<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="text-align: center;" align="center"><br>

- <b><a href="#pipeline">2010

Publication

Pipeline Update</a></b> -<b><br>

</b></p>

<p style="text-align: center;" class="MsoNormal" align="center"><i>New

publications:</i><b><br>

</b></p>

<p style="text-align: center;" class="MsoNormal" align="center">- <b>

<b><a href="#sama">LDC

Standard Arabic Morphological Analyzer (SAMA)

Version 3.1</a></b></b> -<b>

</b></p>

<p style="text-align: center;" class="MsoNormal" align="center">- <b><a

 href="#openmt">NIST

2004 Open Machine Translation (OpenMT)

Evaluation</a></b> -</p>

<hr size="2" width="100%">

<p class="MsoNormal" style="text-align: center;" align="center"><br>

<b><a name="pipeline"></a></b><b>2010 Publication

Pipeline Update</b><o:p></o:p></p>

<p>Membership Year (MY) 2010 has included a strong selection of

publications

including updates to the Arabic and Chinese treebanks, Spanish

telephone speech

and transcript data from the Fisher collection, and Chinese word

n-grams

collected from the web .  Please consult our <a

 href="http://www.ldc.upenn.edu/Catalog/ByYear.jsp">corpus catalog</a>

for a

full list of publications distributed by LDC. As we are now in the

second half

of this membership year, we would like to provide information on what

publications you can expect for the remainder of MY2010.  Our pipeline

includes the following:<o:p></o:p></p>

<blockquote>

  <p><i>Arabic Treebank Part 1 Version 4.1 ~ </i>a revision of Arabic

Treebank:

Part 1 v 3.0 (POS with full vocalization + syntactic analysis)

(LDC2005T02)

(ATB1), according to the new Arabic Treebank (ATB) annotation

guidelines. 

The Arabic Treebank project consists of two distinct phases: (a)

Part-of-Speech

(POS) tagging which divides the text into lexical tokens, and gives

relevant

information about each token such as lexical category, inflectional

features,

and a gloss, and (b) Arabic Treebanking which characterizes the

constituent

structures of word sequences, provides categories for each non-terminal

node,

and identifies null elements, co-reference, traces, etc. on-terminal

node.

  Arabic Treebank Part 1 Version 4.1 represents the manual revision of

the

syntactic tree annotation in ATB1, the automatic revision and updating

of

certain part-of-speech tags, and the manual revision of certain

targeted POS

tags (function words, in particular).  The source data consists of 734

newswire stories from Agence France Presse.<o:p></o:p></p>

  <p><i>Microsoft Research India POS-Tagged Bengali </i>- to support

the task of

Part-of-Speech Tagging (POS) and other forms of data-driven linguistic

research

on Indian languages in general, Microsoft Research India has developed

POS

labeled data for Hindi, Bengali, and Sanskrit as a part of the Indian

Language

– Part-of-Speech Tagset (IL-POST) project.  The corpora are based on

the

IL-POST framework. IL-POST is a POS-tagset framework which has been

designed to

cover the morph-syntactic details of Indian languages. It supports a

three-level hierarchy of Categories, Types and Attributes. The Bengali

corpus

consists of two different levels of information for each lexical token:

(a)

lexical category and types, and (b) set morphological attributes and

their

associated values in the context.  The data consists of 7168 manually

annotated sentences (102933 words) targeted to cover written modern

standard

Bengali from various sources, including blogs, Multikulti, and

Wikipedia. .<o:p></o:p></p>

  <p><i>TRECVID 2006 Keyframes and Transcripts</i> ~ TREC Video

Retrieval Evaluation

(TRECVID) is sponsored by NIST to promote progress in content-based

retrieval

from digital video via open, metrics-based evaluation. The keyframes in

this

release were extracted for use in the NIST TRECVID 2006 Evaluation. 

The

source data includes approximately 158.6 hours of English, Arabic and

Chinese

language video data collected by LDC from NBC, CNN, MSN, New Tang

Dynasty TV, Phoenix

TV, Lebanese Broadcasting Corp.,  <span style=""></span>and China

Central TV.  The keyframes were selected by going to the middle frame

of

the shot boundary, then parsing left and right of that frame to locate

the

nearest I-Frame. This then became the keyframe and was extracted.

Keyframes

have been provided at both the subshot (NRKF) and master shot (RKF)

levels. <o:p></o:p></p>

  <p class="MsoNormal"><i>Uda Walawe Asian Elephant Vocalizations</i> ~

partially-annotated corpus of Asian Elephant

communication/vocalization. The

data set contains vocalizations primarily by adult female and juvenile

Asian

elephants. This corpus is intended to enable researchers in acoustic

communication of elephants and other species to compare acoustic

features and

repertoire diversity to this population. Of particular interest is

whether

there may be regional dialects that differ among Asian elephant

populations in

the wild and in captivity. A second interest is in whether structural

commonalities exist between this and other species that shed light on

underlying social and ecological factors shaping communication systems.

  <o:p></o:p></p>

</blockquote>

<p class="MsoNormal" style="">2010

Subscription Members are automatically sent all MY2010 data as it is

released.  2010 Standard Members are entitled to request 16 corpora for

free from MY2010.   Non-members may license most data for research

use.<br>

</p>

<p class="MsoNormal" style="">

[<a href="#top">

top </a>]</p>

<br>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>New

Publications<o:p></o:p></b></p>

<p class="MsoBodyText"><b><a name="sama"></a></b>(1)  The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01">LDC

Standard Arabic Morphological Analyzer (SAMA) Version 3.1</a> was

developed by

researchers at LDC. SAMA 3.1 is based on, and updates Tim Buckwalter's <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02">Buckwalter

Arabic Morphological Analyzer (BAMA) 2.0 (LDC2004L02)</a>. Since this

is the

first public release of SAMA, it has been numbered continuously to

reflect the

continuity between this release and previous BAMA releases.  SAMA 3.1

is a

software tool for the morphological analysis of Standard Arabic. SAMA

3.1

considers each Arabic word token in all possible 'prefix-stem-suffix'

segmentations, and lists all known/possible annotation solutions, with

assignment of all diacritic marks, morpheme boundaries (separating

clitics and

inflectional morphemes from stems), and all Part-of-Speech (POS) labels

and

glosses for each morpheme segment. The generated output may then be

reviewed by

users, and the most appropriate annotation selected from among several

choices.<o:p></o:p></p>

<p class="MsoBodyText">The software layer of SAMA 3.1 relies on a data

layer that

consists primarily of three Arabic-English lexicon files: prefixes

(1328

entries), suffixes (945 entries), and stems (79318 entries representing

40654

lemmas). The lexicons are supplemented by three morphological

compatibility

tables used for controlling prefix-stem combinations (2497 entries),

stem-suffix combinations (1632 entries), and prefix-suffix combinations

(1180

entries). <o:p></o:p></p>

<p class="MsoBodyText">The input format, output format, and data layer

of SAMA

3.1 were designed to be backward compatible with BAMA. Incremental

changes to

the data layer in SAMA have resulted in: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">increased lexicon coverage in the

dictionary files<o:p></o:p></li>

  <li class="MsoNormal" style="">important changes and additions to the

inventory of POS tags<o:p></o:p></li>

  <li class="MsoNormal" style="">more possible solutions generated for

numerous word forms<o:p></o:p></li>

</ul>

<p class="MsoBodyText">The software implementation has been updated to

allow more

input/output options, installation and configuration options, and

smoother

incorporation in other Perl tools/services. The structure of the

dictionary and

morphotactic tables has remained the same (the tables provided with

SAMA 3.1

differ from the BAMA 2.0 tables only in size and content, not in

format).

Logical separation between the software layer and data layer allows the

new

software tools to be used with previous versions of the tables

(instructions

are provided with software documentation).  The basic logic that

implements the segmentation and analysis look-up for Arabic words is

essentially unchanged since BAMA 2.0. <o:p></o:p></p>

<p class="MsoBodyText">The data layer is now accessed through Berkeley

DB, with

result-caching enabled by default, leading to improved performance.

Various

utility scripts have also been added to the software package to

facilitate more

flexible interaction with tools and data.<o:p></o:p></p>

As a

Members-Only release, LDC Standard Arabic Morphological Analyzer (SAMA)

Version 3.1 is not available for non-member licensing.<o:p></o:p>

<p class="MsoNormal">[<a href="#top">

top </a>]<br>

<br>

</p>

<p class="MsoBodyText"><b><a name="openmt"></a></b>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T12">NIST

2004 Open Machine Translation (OpenMT) Evaluation</a> is a package

containing

source data, reference translations, and scoring software used in the

NIST 2004

OpenMT evaluation. It is designed to help evaluate the effectiveness of

machine

translation systems. The package was compiled and scoring software was

developed by researchers at NIST, making use of newswire source data

and

reference translations collected and developed by LDC.<o:p></o:p></p>

<p class="MsoNormal" style="">The

objective of the NIST OpenMT evaluation series is to support research

in, and

help advance the state of the art of, machine translation (MT)

technologies --

technologies that translate text between human languages. Input may

include all

forms of text. The goal is for the output to be an adequate and fluent

translation of the original.  The 2004 task was to evaluate translation

from Chinese to English and from Arabic to English. Additional

information

about these evaluations may be found at the <a

 href="http://www.itl.nist.gov/iad/mig/tests/mt/">NIST Open Machine

Translation

(OpenMT) Evaluation web site</a>. <o:p></o:p></p>

<p class="MsoNormal">This evaluation kit includes a single perl script

(mteval-v11a.pl) that may be used to produce a translation quality

score for

one (or more) MT systems. The script works by comparing the system

output

translation with a set of (expert) reference translations of the same

source

text. Comparison is based on finding sequences of words in the

reference

translations that match word sequences in the system output

translation. <o:p></o:p></p>

<p class="MsoBodyText">This corpus consists of 150 Arabic newswire

documents, 150

Chinese newswire documents, and 29 Chinese "prepared speech"

documents. For each language, the test set consists of two files: a

source and

a reference file. Each reference file contains four independent

translations of

the data set. The evaluation year, source language, test set, version

of the

data, and source vs. reference file are reflected in the file name.

<span style=""></span><o:p></o:p></p>

<p class="MsoNormal" style="">

[<a href="#top">

top </a>]</p>

<hr size="2" width="100%">

<div align="center">

<pre class="moz-signature" cols="72"><big><font

 face="Courier New, Courier, monospace"><small><small><big>Ilya Ahtaridis</big></small></small></font>

<font face="Courier New, Courier, monospace"><small><small><big>Membership Coordinator</big></small></small></font></big>

<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>

<font face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>