[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Jul 23 18:54:09 UTC 2010
- *2010 Publication Pipeline Update <#pipeline>* -*
*
/New publications:/*
*
- * *LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1
<#sama>** -* *
- *NIST 2004 Open Machine Translation (OpenMT) Evaluation <#openmt>* -
------------------------------------------------------------------------
***2010 Publication Pipeline Update*
Membership Year (MY) 2010 has included a strong selection of
publications including updates to the Arabic and Chinese treebanks,
Spanish telephone speech and transcript data from the Fisher collection,
and Chinese word n-grams collected from the web . Please consult our
corpus catalog <http://www.ldc.upenn.edu/Catalog/ByYear.jsp> for a full
list of publications distributed by LDC. As we are now in the second
half of this membership year, we would like to provide information on
what publications you can expect for the remainder of MY2010. Our
pipeline includes the following:
/Arabic Treebank Part 1 Version 4.1 ~ /a revision of Arabic
Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic
analysis) (LDC2005T02) (ATB1), according to the new Arabic Treebank
(ATB) annotation guidelines. The Arabic Treebank project consists
of two distinct phases: (a) Part-of-Speech (POS) tagging which
divides the text into lexical tokens, and gives relevant information
about each token such as lexical category, inflectional features,
and a gloss, and (b) Arabic Treebanking which characterizes the
constituent structures of word sequences, provides categories for
each non-terminal node, and identifies null elements, co-reference,
traces, etc. on-terminal node. Arabic Treebank Part 1 Version 4.1
represents the manual revision of the syntactic tree annotation in
ATB1, the automatic revision and updating of certain part-of-speech
tags, and the manual revision of certain targeted POS tags (function
words, in particular). The source data consists of 734 newswire
stories from Agence France Presse.
/Microsoft Research India POS-Tagged Bengali /- to support the task
of Part-of-Speech Tagging (POS) and other forms of data-driven
linguistic research on Indian languages in general, Microsoft
Research India has developed POS labeled data for Hindi, Bengali,
and Sanskrit as a part of the Indian Language -- Part-of-Speech
Tagset (IL-POST) project. The corpora are based on the IL-POST
framework. IL-POST is a POS-tagset framework which has been designed
to cover the morph-syntactic details of Indian languages. It
supports a three-level hierarchy of Categories, Types and
Attributes. The Bengali corpus consists of two different levels of
information for each lexical token: (a) lexical category and types,
and (b) set morphological attributes and their associated values in
the context. The data consists of 7168 manually annotated sentences
(102933 words) targeted to cover written modern standard Bengali
from various sources, including blogs, Multikulti, and Wikipedia. .
/TRECVID 2006 Keyframes and Transcripts/ ~ TREC Video Retrieval
Evaluation (TRECVID) is sponsored by NIST to promote progress in
content-based retrieval from digital video via open, metrics-based
evaluation. The keyframes in this release were extracted for use in
the NIST TRECVID 2006 Evaluation. The source data includes
approximately 158.6 hours of English, Arabic and Chinese language
video data collected by LDC from NBC, CNN, MSN, New Tang Dynasty TV,
Phoenix TV, Lebanese Broadcasting Corp., and China Central TV. The
keyframes were selected by going to the middle frame of the shot
boundary, then parsing left and right of that frame to locate the
nearest I-Frame. This then became the keyframe and was extracted.
Keyframes have been provided at both the subshot (NRKF) and master
shot (RKF) levels.
/Uda Walawe Asian Elephant Vocalizations/ ~ partially-annotated
corpus of Asian Elephant communication/vocalization. The data set
contains vocalizations primarily by adult female and juvenile Asian
elephants. This corpus is intended to enable researchers in acoustic
communication of elephants and other species to compare acoustic
features and repertoire diversity to this population. Of particular
interest is whether there may be regional dialects that differ among
Asian elephant populations in the wild and in captivity. A second
interest is in whether structural commonalities exist between this
and other species that shed light on underlying social and
ecological factors shaping communication systems.
2010 Subscription Members are automatically sent all MY2010 data as it
is released. 2010 Standard Members are entitled to request 16 corpora
for free from MY2010. Non-members may license most data for research use.
[ top <#top>]
*New Publications*
**(1) The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010L01>
was developed by researchers at LDC. SAMA 3.1 is based on, and updates
Tim Buckwalter's Buckwalter Arabic Morphological Analyzer (BAMA) 2.0
(LDC2004L02)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02>.
Since this is the first public release of SAMA, it has been numbered
continuously to reflect the continuity between this release and previous
BAMA releases. SAMA 3.1 is a software tool for the morphological
analysis of Standard Arabic. SAMA 3.1 considers each Arabic word token
in all possible 'prefix-stem-suffix' segmentations, and lists all
known/possible annotation solutions, with assignment of all diacritic
marks, morpheme boundaries (separating clitics and inflectional
morphemes from stems), and all Part-of-Speech (POS) labels and glosses
for each morpheme segment. The generated output may then be reviewed by
users, and the most appropriate annotation selected from among several
choices.
The software layer of SAMA 3.1 relies on a data layer that consists
primarily of three Arabic-English lexicon files: prefixes (1328
entries), suffixes (945 entries), and stems (79318 entries representing
40654 lemmas). The lexicons are supplemented by three morphological
compatibility tables used for controlling prefix-stem combinations (2497
entries), stem-suffix combinations (1632 entries), and prefix-suffix
combinations (1180 entries).
The input format, output format, and data layer of SAMA 3.1 were
designed to be backward compatible with BAMA. Incremental changes to the
data layer in SAMA have resulted in:
* increased lexicon coverage in the dictionary files
* important changes and additions to the inventory of POS tags
* more possible solutions generated for numerous word forms
The software implementation has been updated to allow more input/output
options, installation and configuration options, and smoother
incorporation in other Perl tools/services. The structure of the
dictionary and morphotactic tables has remained the same (the tables
provided with SAMA 3.1 differ from the BAMA 2.0 tables only in size and
content, not in format). Logical separation between the software layer
and data layer allows the new software tools to be used with previous
versions of the tables (instructions are provided with software
documentation). The basic logic that implements the segmentation and
analysis look-up for Arabic words is essentially unchanged since BAMA 2.0.
The data layer is now accessed through Berkeley DB, with result-caching
enabled by default, leading to improved performance. Various utility
scripts have also been added to the software package to facilitate more
flexible interaction with tools and data.
As a Members-Only release, LDC Standard Arabic Morphological Analyzer
(SAMA) Version 3.1 is not available for non-member licensing.
[ top <#top>]
**(2) NIST 2004 Open Machine Translation (OpenMT) Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T12>
is a package containing source data, reference translations, and scoring
software used in the NIST 2004 OpenMT evaluation. It is designed to help
evaluate the effectiveness of machine translation systems. The package
was compiled and scoring software was developed by researchers at NIST,
making use of newswire source data and reference translations collected
and developed by LDC.
The objective of the NIST OpenMT evaluation series is to support
research in, and help advance the state of the art of, machine
translation (MT) technologies -- technologies that translate text
between human languages. Input may include all forms of text. The goal
is for the output to be an adequate and fluent translation of the
original. The 2004 task was to evaluate translation from Chinese to
English and from Arabic to English. Additional information about these
evaluations may be found at the NIST Open Machine Translation (OpenMT)
Evaluation web site <http://www.itl.nist.gov/iad/mig/tests/mt/>.
This evaluation kit includes a single perl script (mteval-v11a.pl) that
may be used to produce a translation quality score for one (or more) MT
systems. The script works by comparing the system output translation
with a set of (expert) reference translations of the same source text.
Comparison is based on finding sequences of words in the reference
translations that match word sequences in the system output translation.
This corpus consists of 150 Arabic newswire documents, 150 Chinese
newswire documents, and 29 Chinese "prepared speech" documents. For each
language, the test set consists of two files: a source and a reference
file. Each reference file contains four independent translations of the
data set. The evaluation year, source language, test set, version of the
data, and source vs. reference file are reflected in the file name.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100723/12cb514f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list