<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p class="MsoNormal" style="text-align: center;" align="center"><br>
- <b><a href="#pipeline">2010
Publication
Pipeline Update</a></b> -<b><br>
</b></p>
<p style="text-align: center;" class="MsoNormal" align="center"><i>New
publications:</i><b><br>
</b></p>
<p style="text-align: center;" class="MsoNormal" align="center">- <b>
<b><a href="#sama">LDC
Standard Arabic Morphological Analyzer (SAMA)
Version 3.1</a></b></b> -<b>
</b></p>
<p style="text-align: center;" class="MsoNormal" align="center">- <b><a
href="#openmt">NIST
2004 Open Machine Translation (OpenMT)
Evaluation</a></b> -</p>
<hr size="2" width="100%">
<p class="MsoNormal" style="text-align: center;" align="center"><br>
<b><a name="pipeline"></a></b><b>2010 Publication
Pipeline Update</b><o:p></o:p></p>
<p>Membership Year (MY) 2010 has included a strong selection of
publications
including updates to the Arabic and Chinese treebanks, Spanish
telephone speech
and transcript data from the Fisher collection, and Chinese word
n-grams
collected from the web . Please consult our <a
href="http://www.ldc.upenn.edu/Catalog/ByYear.jsp">corpus catalog</a>
for a
full list of publications distributed by LDC. As we are now in the
second half
of this membership year, we would like to provide information on what
publications you can expect for the remainder of MY2010. Our pipeline
includes the following:<o:p></o:p></p>
<blockquote>
<p><i>Arabic Treebank Part 1 Version 4.1 ~ </i>a revision of Arabic
Treebank:
Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
(LDC2005T02)
(ATB1), according to the new Arabic Treebank (ATB) annotation
guidelines.
The Arabic Treebank project consists of two distinct phases: (a)
Part-of-Speech
(POS) tagging which divides the text into lexical tokens, and gives
relevant
information about each token such as lexical category, inflectional
features,
and a gloss, and (b) Arabic Treebanking which characterizes the
constituent
structures of word sequences, provides categories for each non-terminal
node,
and identifies null elements, co-reference, traces, etc. on-terminal
node.
Arabic Treebank Part 1 Version 4.1 represents the manual revision of
the
syntactic tree annotation in ATB1, the automatic revision and updating
of
certain part-of-speech tags, and the manual revision of certain
targeted POS
tags (function words, in particular). The source data consists of 734
newswire stories from Agence France Presse.<o:p></o:p></p>
<p><i>Microsoft Research India POS-Tagged Bengali </i>- to support
the task of
Part-of-Speech Tagging (POS) and other forms of data-driven linguistic
research
on Indian languages in general, Microsoft Research India has developed
POS
labeled data for Hindi, Bengali, and Sanskrit as a part of the Indian
Language
– Part-of-Speech Tagset (IL-POST) project. The corpora are based on
the
IL-POST framework. IL-POST is a POS-tagset framework which has been
designed to
cover the morph-syntactic details of Indian languages. It supports a
three-level hierarchy of Categories, Types and Attributes. The Bengali
corpus
consists of two different levels of information for each lexical token:
(a)
lexical category and types, and (b) set morphological attributes and
their
associated values in the context. The data consists of 7168 manually
annotated sentences (102933 words) targeted to cover written modern
standard
Bengali from various sources, including blogs, Multikulti, and
Wikipedia. .<o:p></o:p></p>
<p><i>TRECVID 2006 Keyframes and Transcripts</i> ~ TREC Video
Retrieval Evaluation
(TRECVID) is sponsored by NIST to promote progress in content-based
retrieval
from digital video via open, metrics-based evaluation. The keyframes in
this
release were extracted for use in the NIST TRECVID 2006 Evaluation.
The
source data includes approximately 158.6 hours of English, Arabic and
Chinese
language video data collected by LDC from NBC, CNN, MSN, New Tang
Dynasty TV, Phoenix
TV, Lebanese Broadcasting Corp., <span style=""></span>and China
Central TV. The keyframes were selected by going to the middle frame
of
the shot boundary, then parsing left and right of that frame to locate
the
nearest I-Frame. This then became the keyframe and was extracted.
Keyframes
have been provided at both the subshot (NRKF) and master shot (RKF)
levels. <o:p></o:p></p>
<p class="MsoNormal"><i>Uda Walawe Asian Elephant Vocalizations</i> ~
partially-annotated corpus of Asian Elephant
communication/vocalization. The
data set contains vocalizations primarily by adult female and juvenile
Asian
elephants. This corpus is intended to enable researchers in acoustic
communication of elephants and other species to compare acoustic
features and
repertoire diversity to this population. Of particular interest is
whether
there may be regional dialects that differ among Asian elephant
populations in
the wild and in captivity. A second interest is in whether structural
commonalities exist between this and other species that shed light on
underlying social and ecological factors shaping communication systems.
<o:p></o:p></p>
</blockquote>
<p class="MsoNormal" style="">2010
Subscription Members are automatically sent all MY2010 data as it is
released. 2010 Standard Members are entitled to request 16 corpora for
free from MY2010. Non-members may license most data for research
use.<br>
</p>
<p class="MsoNormal" style="">
[<a href="#top">
top </a>]</p>
<br>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center"><b>New
Publications<o:p></o:p></b></p>
<p class="MsoBodyText"><b><a name="sama"></a></b>(1) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01">LDC
Standard Arabic Morphological Analyzer (SAMA) Version 3.1</a> was
developed by
researchers at LDC. SAMA 3.1 is based on, and updates Tim Buckwalter's <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02">Buckwalter
Arabic Morphological Analyzer (BAMA) 2.0 (LDC2004L02)</a>. Since this
is the
first public release of SAMA, it has been numbered continuously to
reflect the
continuity between this release and previous BAMA releases. SAMA 3.1
is a
software tool for the morphological analysis of Standard Arabic. SAMA
3.1
considers each Arabic word token in all possible 'prefix-stem-suffix'
segmentations, and lists all known/possible annotation solutions, with
assignment of all diacritic marks, morpheme boundaries (separating
clitics and
inflectional morphemes from stems), and all Part-of-Speech (POS) labels
and
glosses for each morpheme segment. The generated output may then be
reviewed by
users, and the most appropriate annotation selected from among several
choices.<o:p></o:p></p>
<p class="MsoBodyText">The software layer of SAMA 3.1 relies on a data
layer that
consists primarily of three Arabic-English lexicon files: prefixes
(1328
entries), suffixes (945 entries), and stems (79318 entries representing
40654
lemmas). The lexicons are supplemented by three morphological
compatibility
tables used for controlling prefix-stem combinations (2497 entries),
stem-suffix combinations (1632 entries), and prefix-suffix combinations
(1180
entries). <o:p></o:p></p>
<p class="MsoBodyText">The input format, output format, and data layer
of SAMA
3.1 were designed to be backward compatible with BAMA. Incremental
changes to
the data layer in SAMA have resulted in: <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">increased lexicon coverage in the
dictionary files<o:p></o:p></li>
<li class="MsoNormal" style="">important changes and additions to the
inventory of POS tags<o:p></o:p></li>
<li class="MsoNormal" style="">more possible solutions generated for
numerous word forms<o:p></o:p></li>
</ul>
<p class="MsoBodyText">The software implementation has been updated to
allow more
input/output options, installation and configuration options, and
smoother
incorporation in other Perl tools/services. The structure of the
dictionary and
morphotactic tables has remained the same (the tables provided with
SAMA 3.1
differ from the BAMA 2.0 tables only in size and content, not in
format).
Logical separation between the software layer and data layer allows the
new
software tools to be used with previous versions of the tables
(instructions
are provided with software documentation). The basic logic that
implements the segmentation and analysis look-up for Arabic words is
essentially unchanged since BAMA 2.0. <o:p></o:p></p>
<p class="MsoBodyText">The data layer is now accessed through Berkeley
DB, with
result-caching enabled by default, leading to improved performance.
Various
utility scripts have also been added to the software package to
facilitate more
flexible interaction with tools and data.<o:p></o:p></p>
As a
Members-Only release, LDC Standard Arabic Morphological Analyzer (SAMA)
Version 3.1 is not available for non-member licensing.<o:p></o:p>
<p class="MsoNormal">[<a href="#top">
top </a>]<br>
<br>
</p>
<p class="MsoBodyText"><b><a name="openmt"></a></b>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T12">NIST
2004 Open Machine Translation (OpenMT) Evaluation</a> is a package
containing
source data, reference translations, and scoring software used in the
NIST 2004
OpenMT evaluation. It is designed to help evaluate the effectiveness of
machine
translation systems. The package was compiled and scoring software was
developed by researchers at NIST, making use of newswire source data
and
reference translations collected and developed by LDC.<o:p></o:p></p>
<p class="MsoNormal" style="">The
objective of the NIST OpenMT evaluation series is to support research
in, and
help advance the state of the art of, machine translation (MT)
technologies --
technologies that translate text between human languages. Input may
include all
forms of text. The goal is for the output to be an adequate and fluent
translation of the original. The 2004 task was to evaluate translation
from Chinese to English and from Arabic to English. Additional
information
about these evaluations may be found at the <a
href="http://www.itl.nist.gov/iad/mig/tests/mt/">NIST Open Machine
Translation
(OpenMT) Evaluation web site</a>. <o:p></o:p></p>
<p class="MsoNormal">This evaluation kit includes a single perl script
(mteval-v11a.pl) that may be used to produce a translation quality
score for
one (or more) MT systems. The script works by comparing the system
output
translation with a set of (expert) reference translations of the same
source
text. Comparison is based on finding sequences of words in the
reference
translations that match word sequences in the system output
translation. <o:p></o:p></p>
<p class="MsoBodyText">This corpus consists of 150 Arabic newswire
documents, 150
Chinese newswire documents, and 29 Chinese "prepared speech"
documents. For each language, the test set consists of two files: a
source and
a reference file. Each reference file contains four independent
translations of
the data set. The evaluation year, source language, test set, version
of the
data, and source vs. reference file are reflected in the file name.
<span style=""></span><o:p></o:p></p>
<p class="MsoNormal" style="">
[<a href="#top">
top </a>]</p>
<hr size="2" width="100%">
<div align="center">
<pre class="moz-signature" cols="72"><big><font
face="Courier New, Courier, monospace"><small><small><big>Ilya Ahtaridis</big></small></small></font>
<font face="Courier New, Courier, monospace"><small><small><big>Membership Coordinator</big></small></small></font></big>
<font face="Courier New, Courier, monospace"><small>--------------------------------------------------------------------</small></font>
<font face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>