[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Oct 22 16:21:53 UTC 2010


/In this newsletter:/*
*

*- Fall 2010 LDC Data Scholarship Winners! <#scholar>** -*

*- ** **Position Openings at LDC <#jobs>** -*

/
New Publications:/

LDC2010T18

*- ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 
<#ace>** -
*

LDC2010T19
*- Korean Newswire Second Edition <#korean>** -
*

LDC2010T17
*- NIST 2006 Open Machine Translation (OpenMT) Evaluation <#openmt>** -*

------------------------------------------------------------------------

*
*

*Fall 2010 LDC Data Scholarship Winners!**
*

LDC is pleased to announce the winners in our first-ever LDC Data 
Scholarship program!  The LDC Data Scholarship program provides 
university students with access to LDC data at no-cost.  Data 
scholarships are offered twice a year to correspond to the Fall and 
Spring semesters.  Students are asked to complete an application which 
consists of a data use proposal and letter of support from their 
academic adviser.  

LDC received many strong applications from both undergraduate and 
graduate students attending universities across the globe.  The decision 
process was difficult, and after much deliberation, we have selected 
eight winners!   These students will receive no-cost copies of LDC data 
valued at over US$10,000:

    Aby Abraham - Ohio University (USA), graduate student, Electrical
    Engineering.  Aby has been awarded a copy of /2003 NIST Speaker
    Recognition Evaluation (LDC2010S03)/ for his work in using long term
    memory cells for continuous speech recognition.

    Ripandy Adha - Bandung Institute of Technology (Indonesia),
    undergraduate student, Computer Science - Ripandy has been awarded a
    copy of /American English Spoken Lexicon (LDC99L23)/ to assist in
    the development of a voice command internet browser.

    Basawaraj - Ohio University (USA), PhD candidate, Electrical
    Engineering and Computer Science.  Basawaraj has been awarded a copy
    of /NIST 2002 Open Machine Translation (OpenMT) Evaluation
    (LDC2010T10)/ to assist in fine tuning his machine translation
    system and to provide a benchmark dataset.

    Zachary Brooks - University of Arizona (USA), PhD Candidate, Second
    Language Acquisition and Teaching.  Zachary and his research group
    have been awarded a copy of /ECI Multilingual Text (LDC94T5)/ for
    research in eye movement tracking by native and non-natives readers.

    Marco Carmosino - Hampshire College (USA), undergraduate student,
    Computer Science.  Marco has been awarded a copy of /English
    Gigaword Fourth Edition (LDC2009T13)/ for his work in narrative
    chain extraction.

    Xiaohui Huang - Harbin Institute of Technology (China), Shenzhen
    Graduate School.  Xiaohui has been awarded a copy of /TDT5 Topics
    and Annotations (LDC2006T19)/  for his work in topic detection and
    tracking for large-scale web  data.

    Yuhuan Zhou - PLA University of Science and Technology (China),
    postgraduate student, Institute of Communications Engineering. 
    Yuhuan has been awarded a copy of /2002 NIST Speaker Recognition
    Evaluation (LDC2004S04)/ to assist in the development of a speaker
    recognition system which fuses support vector data description
    (SVDD) and Gaussian mixture model (GMM).

    Speaker Recognition Group (GEDA) with members Matias Fineschi,
    Gonzalo Lavigna, Jorge Prendes, and Pablo Vacatello -  Buenos Aires
    Institute of Technology (Argentina), Department of Electrical
    Engineering.  GEDA has been awarded a copy of /2004 NIST Speaker
    Recognition Evaluation (LDC2006S44)/ to assist in the development of
    a flexible platform on speaker verification capable of implementing
    different feature extraction, normalizations, stochastical models
    and outputs.

Please join us in congratulating our student winners!   The next LDC 
Data Scholarship program is scheduled for the Spring 2011 semester. Stay 
tuned for further announcements.

[ top <#top>]

 

* **Position Openings at LDC
*

Linguistic Data Consortium at the University of Pennsylvania has a 
number of immediate openings for full-time positions to support our 
corpus development projects:

        * PROGRAMMER ANALYST - (#100528459 and #100929195)

    Support linguistic data collection and annotation projects by
    providing software development, system integration, technical and
    research support, annotation tool development and/or data collection
    system management.

        * SENIOR PROJECT MANAGER (#100728923 and #100728924)

    Provide complete oversight for multiple, concurrent corpus creation
    projects, including collection, annotation and distribution of
    speech, text and/or video data in a variety of languages. Create
    project roadmaps and direct teams of programmers, linguists and
    managers to execute deliverables; represent corpus creation efforts
    to external researchers and sponsors.

        * LEAD ANNOTATOR (#100728920)

    Perform linguistic annotation on English text, speech and video
    data; recruit, train and supervise teams of annotators for multiple
    tasks and languages; define, test and document procedural approaches
    to linguistic annotation;perform quality control on annotated data.

For further information on the duties and qualifications for these 
positions, or to apply online please visit https://jobs.hr.upenn.edu/; 
search postings for the reference numbers indicated above.

Penn offers an excellent benefits package including medical/dental, 
retirement plans, tuition assistance and a minimum of three weeks paid 
vacation per year. The University of Pennsylvania is an affirmative 
action/equal opportunity employer.  All positions contingent upon grant 
funding.
.
For more information about LDC and the programs we support, visit 
http://www.ldc.upenn.edu/.

[ top <#top>]

*
New Publications*

(1) ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T18> 
was developed by researchers at The MITRE Corporation 
<http://www.mitre.org/>. It contains the English evaluation data 
prepared for the 2004 Time Expression Recognition and Normalization 
(TERN) Evaluation <http://fofoca.mitre.org/tern.html>, sponsored by the 
Automatic Content Extraction (ACE) 
<http://www.itl.nist.gov/iad/mig/tests/ace/> program, specifically, 
English broadcast news and newswire data collected by LDC. The training 
data for this evaluation can be found in ACE Time Normalization (TERN) 
2004 English Training Data v 1.0 LDC2005T07 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T07>.

The purpose of the TERN evaluation is to advance the state of the art in 
the automatic recognition and normalization of natural language temporal 
expressions. In most language contexts such expressions are indexical. 
For example, with "Monday," "last week," or "three months starting 
October 1," one must know the narrative reference time in order to 
pinpoint the time interval being conveyed by the expression. In 
addition, for data exchange purposes, it is essential that the 
identified interval be rendered according to an established standard, 
i.e., normalized. Accurate identification and normalization of temporal 
expressions are in turn essential for the temporal reasoning being 
demanded by advanced NLP applications such as question answering, 
information extraction and summarization.

The data in this release is English broadcast transcripts and newswire 
material from TDT4 Multilingual Text and Annotations LDC2005T16 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T16>. 
The annotation specifications for this corpus were developed under 
DARPA's Translingual Information Detection Extraction and Summarization 
(TIDES) <http://projects.ldc.upenn.edu/TIDES/> program, with support 
from ACE. All files have been doubly-annotated by two separate 
annotators and then reconciled, using the TIDES 2003 Standard for the 
Annotation of Temporal Expressions.  The data directory contains the 
corpus which consists of 192 files (54K words).


[ top <#top>]

*

(2) Korean Newswire Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T19> 
is an archive of Korean newswire text that has been acquired over 
several years (1994-2009) at LDC from the Korean Press Agency. This 
release includes all of the content of Korean Newswire (LDC2000T45) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T45> 
(June 1994-March 2000) as well as newly-collected data.  The second 
edition contains all data collected by LDC from April 2000 through 
December 2009.

All material, including that from the first release, has been converted 
to UTF-8 (except for more recent data already in UTF-8 format) and 
processed in LDC's gigaword format. The gigaword format classifies 
newswire content into three types: story, multi and other where "story" 
refers to an article containing information pertaining to a particular 
event on a day; "multi" refers to an article that contains more than one 
story relating to different topics; and "other" refers to articles 
containing lists, tables or numerical data, such as sports scores.

A word break error in the original release and in data collected from 
January 2002 through February 2005 has been corrected in the second 
edition with the result that all Korean text should display correctly. 
The error involved a line break in the middle of a word with the result 
that an affected word appeared in segments in two lines. This problem 
was  resolved using word histograms and a few

[ top <#top>]

*

(3) NIST 2006 Open Machine Translation (OpenMT) Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T17> 
is a package containing source data, reference translations and scoring 
software used in the NIST 2006 OpenMT evaluation. It is designed to help 
evaluate the effectiveness of machine translation systems. The package 
was compiled and scoring software was developed by researchers at NIST, 
making use of broadcast, newswire and web newsgroup source data and 
reference translations collected and developed by LDC.

The objective of the NIST Open Machine Translation (OpenMT) evaluation 
series is to support research in, and help advance the state of the art 
of, machine translation (MT) technologies -- technologies that translate 
text between human languages. Input may include all forms of text. The 
goal is for the output to be an adequate and fluent translation of the 
original.  The OpenMT evaluations are intended to be of interest to all 
researchers working on the general problem of automatic translation 
between human languages. To this end, they are designed to be simple, to 
focus on core technology issues and to be fully supported. The 2006 task 
was to evaluate translation from Arabic to English and from Chinese to 
English.  Additional information about these evaluations may be found at 
the NIST Open Machine Translation (OpenMT) Evaluation web site 
<http://www.itl.nist.gov/iad/mig/tests/mt/>.

This evaluation kit includes a single Perl script (mteval-v11b.pl) that 
may be used to produce a translation quality score for one (or more) MT 
systems. The script works by comparing the system output translation 
with a set of (expert) reference translations of the same source text. 
Comparison is based on finding sequences of words in the reference 
translations that match word sequences in the system output translation.

The included scoring script was released with the original evaluation, 
intended for use with SGML-formatted data files, and is provided to 
ensure compatibility of user scoring results with results from the 
original evaluation. An updated scoring software package 
(mteval-v13a-20091001.tar.gz), with XML support, additional options and 
bug fixes, documentation, and example translations, may be downloaded 
from the NIST Multimodal Information Group Tools 
<http://www.itl.nist.gov/iad/mig/tools/> website.

This release contains of 357 documents with corresponding sets of four 
separate human expert reference translations. The source data is 
comprised of Arabic and Chinese newswire documents, human transcriptions 
of broadcast news and broadcast conversation programs and web newsgroup 
documents collected by LDC in 2006. The newswire and broadcast material 
are from Agence France-Presse (Arabic, Chinese), Xinhua News Agency 
(Arabic, Chinese), Lebanese Broadcasting Corp. (Arabic), Dubai TV 
(Arabic), China Central TV (Chinese) and New Tang Dynasty Television 
(Chinese). The web text was collected from Google and Yahoo newsgroups.

For each language, the test set consists of two files: a source and a 
reference file. Each reference file contains four independent 
translations of the data set. The evaluation year, source language, test 
set, version of the data, and source vs. reference file are reflected in 
the file name.

[ top <#top>]
------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101022/e5181cc3/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list