[Corpora-List] New from LDC

Wed Mar 25 21:19:26 UTC 2009

LDC2009T05
*-  2008 NIST Metrics for Machine Translation (MetricsMATR08) 
Development Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05>  -*

LDC2009T06
*-  GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06>  -*

LDC2009T07
*-  Unified Linguistic Annotation Text Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07>  -*

-  Additional Free LDC Resources  -

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications and highlight free LDC resources.

------------------------------------------------------------------------
*New Publications*
* *

(1) 2008 NIST Metrics for Machine Translation (MetricsMATR08) 
Development Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T05> 
contains data, reference translations, and software used for NIST 
MetricsMATR <http://www.nist.gov/speech/tests/metricsmatr/>.  NIST 
MetricsMATR is a series of research challenge events for machine 
translation (MT) metrology, promoting the development of innovative, 
even revolutionary, MT metrics that correlate highly with human 
assessments of MT quality. In this program, participants submit their 
metrics to the National Institute of Standards and Technology (NIST) 
<http://www.nist.gov>. NIST runs those metrics on certain held-back test 
data for which it has human assessments measuring quality and then 
calculates correlations between the automatic metric scores and the 
human assessments.

In the NIST Metrics for Machine Translation 2008 Evaluation 
(MetricsMATR08) <http://www.nist.gov/speech/tests/metricsmatr/2008/>, 
participants received as development data a subset of the materials used 
in the NIST Open MT06 evaluation 
<http://nist.gov/speech/tests/mt/2006/>, specifically, human reference 
translations, system translations, and human assessments of adequacy and 
preference. The source data was comprised of twenty-five Arabic language 
newswire documents with a total of 249 segments. The data in each 
segment consisted of four human reference translations in English and 
system translations from eight different MT06 machine translation 
systems. In addition to the data and reference translations, this 
release includes software tools for evaluation and reporting and 
documentation describing how the human assessments were obtained and how 
they are represented in the data. The evaluation plan 
<http://www.nist.gov/speech/tests/metricsmatr/2008/doc/mm08_evalplan_v1.1.pdf> 
contains further information and rules on the use of this data.

The MetricsMATR program seeks to overcome several drawbacks to the 
methods employed for the evaluation of MT technology. Currently, 
automatic metrics have not yet proved able to predict the usefulness and 
reliability of MT technologies with confidence. Nor have automatic 
metrics demonstrated that they are meaningful in target languages other 
than English. Human assessments, however, are expensive, slow, 
subjective and difficult to standardize. These problems, and the need to 
overcome them through the development of improved automatic (or even 
semi-automatic) metrics, have been a constant point of discussion at 
past NIST MT evaluation events. MetricsMATR aims to provide a platform 
to address these shortcomings.

***

(2) GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T06> 
contains transcripts and English translations of 24 hours of Chinese 
broadcast conversation programming from China Central TV (CCTV), Phoenix 
TV and Voice of America (VOA). It does not contain the audio files from 
which the transcripts and translations were generated. This release, 
along with other corpora, was used as training data in Phase 1 (year 1) 
of the DARPA-funded GALE program. GALE Phase 1 Chinese Broadcast 
Conversation Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T02> 
was released in January 2009.

A manual selection procedure was used to choose data appropriate for the 
GALE program, namely, conversation (talk) programs focusing on current 
events. Stories on topics such as sports, entertainment and business 
were excluded from the data set. 

The selected audio snippets were carefully transcribed by LDC annotators 
and professional transcription agencies following LDC's Quick Rich 
Transcription specification. Manual sentence units/segments (SU) 
annotation was also performed as part of the transcription task. Three 
types of end of sentence SU were identified: statement SU, question SU, 
and incomplete SU.

After transcription and SU annotation, files were reformatted into a 
human-readable translation format and assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features (such as names and 
speech disfluencies) and quality control procedures applied to completed 
translations.

***
*
* (3) The Unified Linguistic Annotation Text Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T07> 
consists of two datasets:  the Language Understanding Annotation Corpus 
(LDC2009T10) and Reflex Entity Translation Training Dev/Test 
(LDC2009T11).  Most recent annotation efforts for language have focused 
on small pieces of the larger problem of semantic annotation rather than 
producing a single unified representation. The Unified Linguistic 
Annotation (ULA) project, sponsored by the National Science Foundation 
<http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0551615>, seeks 
to integrate into one framework different layers of annotation (e.g., 
semantics, discourse, temporal, opinions) using various existing 
resources, including PropBank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14>, 
NomBank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T23>, 
TimeBank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>, 
Penn Discourse Treebank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05> 
and coreference and opinion annotations. The project represents a 
concerted effort of researchers from several institutions to develop a 
large word corpus with balanced and annotated data. The Unified 
Linguistic Annotation Text Collection is provided as a resource for the 
ULA effort. It consists of two datasets:

    * The Language Understanding Annotation Corpus
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10>
      (LDC2009T10). The Language Understanding Annotation Corpus was
      developed at the Johns Hopkins Center of Excellence in Human
      Language Technology <http://web.jhu.edu/hltcoe>.  It consists of
      over 9000 words of English text (6949 words) and Arabic text (2183
      words) annotated for committed belief, event and entity,
      coreference, dialog acts and temporal relations. The materials
      were chosen from various sources to represent "informal input,"
      that is, text that contains colloquial forms. The documents in the
      corpus include excerpts from newswire stories, telephone
      conversation transcripts, emails, contracts and written instructions.

    * REFLEX Entity Translation Training/DevTest
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T11>
      (LDC2009T11). REFLEX Entity Translation Training/DevTest is the
      complete set of training data and development test data for the
      2007 REFLEX Entity Translation evaluation
      <http://www.nist.gov/speech/tests/ace/2007/> sponsored by the
      National Institute of Standards and Technology (NIST). It contains
      approximately 67.5K words of newswire and weblog text for each of
      English, Chinese and Arabic (or approximately 22.5K words in each
      language) translated into each of the other two languages. The
      data is annotated for entities and TIMEX2 extents and normalization.

Researchers may license this data by completing the LDC User Agreement 
for Non-members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.  
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address.  Please indicate on the license whether you are requesting 
the entire collection (LDC2009T07) or just one dataset (LDC2009T10 or 
LDC2009T11).  The collection is being made available at no charge.

*Additional Free LDC Resources
*

LDC is pleased to distribute the Unified Linguistic Annotation Text 
Collection (LDC2009T07) corpora at no cost to support the work of the 
ULA project. As mentioned above, to license a copy of this data, 
non-members should complete the LDC User Agreement for Non-members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf> 
and fax to +1 215 573 2175 or scan and email to this address.  On the 
heels of the release of the ULA corpora, LDC would like to highlight 
other resources which are available at no cost.  Free grant-covered 
copies of the following Talkbank <http://www.talkbank.org/> databases 
can be licensed from LDC:

    * LDC2003V01  FORM2 Kinematic Gesture
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01>

    * LDC2003L01  Grassfields Bantu Fieldwork: Dschang Lexicon
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01>

    * LDC2003S02  Grassfields Bantu Fieldwork: Dschang Tone Paradigms
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02>

    * LDC2001S16  Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16>

    * LDC2004L01  Klex: Finite-State Lexical Transducer for Korean
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01>

    * LDC2004T03  Morphologically Annotated Korean Text
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03>

    * LDC2003S06  Santa Barbara Corpus of Spoken American English Part
      II
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06>

    * LDC2004S10  Santa Barbara Corpus of Spoken American English Part
      III
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10>

    * L
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>DC2005S25 
      Santa Barbara Corpus of Spoken American English Part IV
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>

    * LDC2003T15  SLX Corpus of Classic Sociolinguistic Interviews
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15>

    * LDC2004S12  TalkBank Ethology Data: Field Recordings of Vervet
      Monkey Calls
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12>

A US$30 shipping and handling fee applies for data on disc.  Further 
information, including additional free datasets such as TimeBank 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>, 
and useful tools such as LDC's parallel text sentence aligner, 
Champollion <http://sourceforge.net/projects/champollion/>, can be found 
in our What's New! What's Free! Archive 
<http://www.ldc.upenn.edu/About/whatsnew.shtml>.

------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090325/a9385b9c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora