[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Jul 23 18:54:09 UTC 2010


- *2010 Publication Pipeline Update <#pipeline>* -*
*

/New publications:/*
*

- * *LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 
<#sama>** -* *

- *NIST 2004 Open Machine Translation (OpenMT) Evaluation <#openmt>* -

------------------------------------------------------------------------


***2010 Publication Pipeline Update*

Membership Year (MY) 2010 has included a strong selection of 
publications including updates to the Arabic and Chinese treebanks, 
Spanish telephone speech and transcript data from the Fisher collection, 
and Chinese word n-grams collected from the web .  Please consult our 
corpus catalog <http://www.ldc.upenn.edu/Catalog/ByYear.jsp> for a full 
list of publications distributed by LDC. As we are now in the second 
half of this membership year, we would like to provide information on 
what publications you can expect for the remainder of MY2010.  Our 
pipeline includes the following:

    /Arabic Treebank Part 1 Version 4.1 ~ /a revision of Arabic
    Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic
    analysis) (LDC2005T02) (ATB1), according to the new Arabic Treebank
    (ATB) annotation guidelines.  The Arabic Treebank project consists
    of two distinct phases: (a) Part-of-Speech (POS) tagging which
    divides the text into lexical tokens, and gives relevant information
    about each token such as lexical category, inflectional features,
    and a gloss, and (b) Arabic Treebanking which characterizes the
    constituent structures of word sequences, provides categories for
    each non-terminal node, and identifies null elements, co-reference,
    traces, etc. on-terminal node.   Arabic Treebank Part 1 Version 4.1
    represents the manual revision of the syntactic tree annotation in
    ATB1, the automatic revision and updating of certain part-of-speech
    tags, and the manual revision of certain targeted POS tags (function
    words, in particular).  The source data consists of 734 newswire
    stories from Agence France Presse.

    /Microsoft Research India POS-Tagged Bengali /- to support the task
    of Part-of-Speech Tagging (POS) and other forms of data-driven
    linguistic research on Indian languages in general, Microsoft
    Research India has developed POS labeled data for Hindi, Bengali,
    and Sanskrit as a part of the Indian Language -- Part-of-Speech
    Tagset (IL-POST) project.  The corpora are based on the IL-POST
    framework. IL-POST is a POS-tagset framework which has been designed
    to cover the morph-syntactic details of Indian languages. It
    supports a three-level hierarchy of Categories, Types and
    Attributes. The Bengali corpus consists of two different levels of
    information for each lexical token: (a) lexical category and types,
    and (b) set morphological attributes and their associated values in
    the context.  The data consists of 7168 manually annotated sentences
    (102933 words) targeted to cover written modern standard Bengali
    from various sources, including blogs, Multikulti, and Wikipedia. .

    /TRECVID 2006 Keyframes and Transcripts/ ~ TREC Video Retrieval
    Evaluation (TRECVID) is sponsored by NIST to promote progress in
    content-based retrieval from digital video via open, metrics-based
    evaluation. The keyframes in this release were extracted for use in
    the NIST TRECVID 2006 Evaluation.  The source data includes
    approximately 158.6 hours of English, Arabic and Chinese language
    video data collected by LDC from NBC, CNN, MSN, New Tang Dynasty TV,
    Phoenix TV, Lebanese Broadcasting Corp.,  and China Central TV.  The
    keyframes were selected by going to the middle frame of the shot
    boundary, then parsing left and right of that frame to locate the
    nearest I-Frame. This then became the keyframe and was extracted.
    Keyframes have been provided at both the subshot (NRKF) and master
    shot (RKF) levels.

    /Uda Walawe Asian Elephant Vocalizations/ ~ partially-annotated
    corpus of Asian Elephant communication/vocalization. The data set
    contains vocalizations primarily by adult female and juvenile Asian
    elephants. This corpus is intended to enable researchers in acoustic
    communication of elephants and other species to compare acoustic
    features and repertoire diversity to this population. Of particular
    interest is whether there may be regional dialects that differ among
    Asian elephant populations in the wild and in captivity. A second
    interest is in whether structural commonalities exist between this
    and other species that shed light on underlying social and
    ecological factors shaping communication systems.

2010 Subscription Members are automatically sent all MY2010 data as it 
is released.  2010 Standard Members are entitled to request 16 corpora 
for free from MY2010.   Non-members may license most data for research use.

[ top <#top>]


*New Publications*

**(1)  The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01> 
was developed by researchers at LDC. SAMA 3.1 is based on, and updates 
Tim Buckwalter's Buckwalter Arabic Morphological Analyzer (BAMA) 2.0 
(LDC2004L02) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02>. 
Since this is the first public release of SAMA, it has been numbered 
continuously to reflect the continuity between this release and previous 
BAMA releases.  SAMA 3.1 is a software tool for the morphological 
analysis of Standard Arabic. SAMA 3.1 considers each Arabic word token 
in all possible 'prefix-stem-suffix' segmentations, and lists all 
known/possible annotation solutions, with assignment of all diacritic 
marks, morpheme boundaries (separating clitics and inflectional 
morphemes from stems), and all Part-of-Speech (POS) labels and glosses 
for each morpheme segment. The generated output may then be reviewed by 
users, and the most appropriate annotation selected from among several 
choices.

The software layer of SAMA 3.1 relies on a data layer that consists 
primarily of three Arabic-English lexicon files: prefixes (1328 
entries), suffixes (945 entries), and stems (79318 entries representing 
40654 lemmas). The lexicons are supplemented by three morphological 
compatibility tables used for controlling prefix-stem combinations (2497 
entries), stem-suffix combinations (1632 entries), and prefix-suffix 
combinations (1180 entries).

The input format, output format, and data layer of SAMA 3.1 were 
designed to be backward compatible with BAMA. Incremental changes to the 
data layer in SAMA have resulted in:

    * increased lexicon coverage in the dictionary files
    * important changes and additions to the inventory of POS tags
    * more possible solutions generated for numerous word forms

The software implementation has been updated to allow more input/output 
options, installation and configuration options, and smoother 
incorporation in other Perl tools/services. The structure of the 
dictionary and morphotactic tables has remained the same (the tables 
provided with SAMA 3.1 differ from the BAMA 2.0 tables only in size and 
content, not in format). Logical separation between the software layer 
and data layer allows the new software tools to be used with previous 
versions of the tables (instructions are provided with software 
documentation).  The basic logic that implements the segmentation and 
analysis look-up for Arabic words is essentially unchanged since BAMA 2.0.

The data layer is now accessed through Berkeley DB, with result-caching 
enabled by default, leading to improved performance. Various utility 
scripts have also been added to the software package to facilitate more 
flexible interaction with tools and data.

As a Members-Only release, LDC Standard Arabic Morphological Analyzer 
(SAMA) Version 3.1 is not available for non-member licensing.

[ top <#top>]

**(2)  NIST 2004 Open Machine Translation (OpenMT) Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T12> 
is a package containing source data, reference translations, and scoring 
software used in the NIST 2004 OpenMT evaluation. It is designed to help 
evaluate the effectiveness of machine translation systems. The package 
was compiled and scoring software was developed by researchers at NIST, 
making use of newswire source data and reference translations collected 
and developed by LDC.

The objective of the NIST OpenMT evaluation series is to support 
research in, and help advance the state of the art of, machine 
translation (MT) technologies -- technologies that translate text 
between human languages. Input may include all forms of text. The goal 
is for the output to be an adequate and fluent translation of the 
original.  The 2004 task was to evaluate translation from Chinese to 
English and from Arabic to English. Additional information about these 
evaluations may be found at the NIST Open Machine Translation (OpenMT) 
Evaluation web site <http://www.itl.nist.gov/iad/mig/tests/mt/>.

This evaluation kit includes a single perl script (mteval-v11a.pl) that 
may be used to produce a translation quality score for one (or more) MT 
systems. The script works by comparing the system output translation 
with a set of (expert) reference translations of the same source text. 
Comparison is based on finding sequences of words in the reference 
translations that match word sequences in the system output translation.

This corpus consists of 150 Arabic newswire documents, 150 Chinese 
newswire documents, and 29 Chinese "prepared speech" documents. For each 
language, the test set consists of two files: a source and a reference 
file. Each reference file contains four independent translations of the 
data set. The evaluation year, source language, test set, version of the 
data, and source vs. reference file are reflected in the file name.

[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100723/12cb514f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list