[Corpora-List] New from LDC

Wed Oct 28 14:16:04 UTC 2009

*- **LDC Data Sheets Now Available Online <#interspeech>** -*****

- *2007 NIST Language Recognition Evaluation Test Set <#lre>* -

- * **OntoNotes 3.0 <#onto>* -

- *Web 1T 5-gram, 10 European Languages Version 1 <#euro>* -

------------------------------------------------------------------------

*LDC Data Sheets Now Available Online*

* *

In early 2009, LDC crafted data sheets 
<http://www.ldc.upenn.edu/DataSheets/> to describe in concise form 
current and past projects, daily operations and our technical 
capabilities. Print versions of these documents debuted at Interspeech 
2009 <http://www.interspeech2009.org/> and have received positive 
feedback for both their content and design.

The data sheets were distributed on FSC certified 30% recycled paper and 
were printed using environmentally-friendly toner.  FSC certification 
means that the process that developed the paper, from seed to final 
sheet, is in compliance with international laws and treaties so that it 
employs fair labor standards and respects and conserves environmental 
resources.

LDC intends to expand the breadth of data sheet categories and the depth 
of information provided within each category. This will help to 
accurately represent our organization and highlight our staff's research 
and development efforts.

[top <#top>]

*New Publications*

*(1)*  2007 NIST Language Recognition Evaluation Test Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S04> 
consists of 66 hours of conversational telephone speech segments in the 
following languages and dialects: Arabic, Bengali, Chinese (Cantonese), 
Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American, 
Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian, 
Spanish (Caribbean, non-Caribbean), Tamil, Thai and Vietnamese.

The goal of the NIST (National Institute of Standards and Technology) 
<http://www.itl.nist.gov/iad/> Language Recognition Evaluation (LRE) 
<http://www.itl.nist.gov/iad/mig/tests/lre/> is to establish the 
baseline of current performance capability for language recognition of 
conversational telephone speech and to lay the groundwork for further 
research efforts in the field. NIST conducted three previous language 
recognition evaluations, in 1996 
<http://www.itl.nist.gov/iad/mig/tests/lre/1996/>, 2003 
<http://www.itl.nist.gov/iad/mig/tests/lre/2003/> and 2005 
<http://www.itl.nist.gov/iad/mig/tests/lre/2005/>. The most significant 
differences between those evaluations and the 2007 task were the 
increased number of languages and dialects, the greater emphasis on a 
basic detection task for evaluation and the variety of evaluation 
conditions. Thus, in 2007, given a segment of speech and a language of 
interest to be detected (i.e., a target language), the task was to 
decide whether that target language was in fact spoken in the given 
telephone speech segment (yes or no), based on an automated analysis of 
the data contained in the segment.

Each speech file in the test data is one side of a "4-wire" telephone 
conversation represented as 8-bit 8-kHz mu-law format. There are 7530 
speech files in SPHERE (.sph) format for a total of 66 hours of speech. 
The speech data was compiled from LDC's CALLFRIEND, Fisher Spanish and 
Mixer 3 corpora and from data collected by Oregon Health and Science 
University, Beaverton, Oregon.  The test segments contain three nominal 
durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech 
durations vary, but were constrained to be within the ranges of 2-4 
seconds, 7-13 seconds and 23-35 seconds, respectively.

[top <#top>]

*

*(2)* OntoNotes 3.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T24>.  
The OntoNotes project is a collaborative effort between BBN 
Technologies, the University of Colorado, the University of 
Pennsylvania, and the University of Southern California's Information 
Sciences Institute. The goal of the project is to annotate a large 
corpus comprising various genres of text (news, conversational telephone 
speech, weblogs, use net, broadcast, talk shows) in three languages 
(English, Chinese, and Arabic) with structural information (syntax and 
predicate argument structure) and shallow semantics (word sense linked 
to an ontology and coreference)..

OntoNotes Release 1.0 (LDC2007T21) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21> 
contains 400k words of Chinese newswire data and 300k words of English 
newswire data. OntoNotes Release 2.0 (LDC2008T04) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04> 
added the following to the corpus: 274k words of Chinese broadcast news 
data; and 200k words of English broadcast news data. OntoNotes Release 
3.0 incorporates the following new material: 250k words of English 
newswire data, 200k of English broadcast news data; 200k words of 
English broadcast conversation material; 250k words of Chinese newswire 
data; 250k words of Chinese broadcast news material;150k words of 
Chinese broadcast conversation data; and 200k words of Arabic newswire 
material.

Natural language applications like machine translation, question 
answering, and summarization currently are forced to depend on 
impoverished text models like bags of words or n-grams, while the 
decisions that they are making ought to be based on the meanings of 
those words in context. That lack of semantics causes problems 
throughout the applications. Misinterpreting the meaning of an ambiguous 
word results in failing to extract data, incorrect alignments for 
translation, and ambiguous language models. Incorrect coreference 
resolution results in missed information (because a connection is not 
made) or incorrectly conflated information (due to false connections). 
OntoNotes builds on two time-tested resources, following the Penn 
Treebank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42> 
for syntax and the Penn PropBank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14> 
for predicate-argument structure. Its semantic representation will 
include word sense disambiguation for nouns and verbs, with each word 
sense connected to an ontology, and coreference. The current goals call 
for annotation of over a million words each of English and Chinese, and 
half a million words of Arabic over five years.

[top <#top>]

*

*(3)* Web 1T 5-gram, 10 European Languages Version 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T25> 
was created by Google, Inc.  It consists of word n-grams and their 
observed frequency counts for ten European languages: Czech, Dutch, 
French, German, Italian, Polish, Portuguese, Romanian, Spanish and 
Swedish. The length of the n-grams ranges from unigrams (single words) 
to five-grams. The n-gram counts were generated from approximately one 
billion word tokens of text for each language, or approximately one 
trillion total tokens.

The n-grams were extracted from publicly-accessible web pages from 
October 2008 to December 2008. This data set contains only n-grams that 
appeared at least 40 times in the processed sentences. Less frequent 
n-grams were discarded. While the aim was to identify and collect pages 
from the specific target languages only, it is likely that some text 
from other languages may be in the final data. This dataset will be 
useful for statistical language modeling, including machine translation, 
speech recognition and other uses.  The input encoding of documents was 
automatically detected, and all text was converted to UTF8.

[top <#top>]
------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091028/ad529f77/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora