[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Oct 28 14:16:04 UTC 2009
*- **LDC Data Sheets Now Available Online <#interspeech>** -*****
- *2007 NIST Language Recognition Evaluation Test Set <#lre>* -
- * **OntoNotes 3.0 <#onto>* -
- *Web 1T 5-gram, 10 European Languages Version 1 <#euro>* -
------------------------------------------------------------------------
*LDC Data Sheets Now Available Online*
* *
In early 2009, LDC crafted data sheets
<http://www.ldc.upenn.edu/DataSheets/> to describe in concise form
current and past projects, daily operations and our technical
capabilities. Print versions of these documents debuted at Interspeech
2009 <http://www.interspeech2009.org/> and have received positive
feedback for both their content and design.
The data sheets were distributed on FSC certified 30% recycled paper and
were printed using environmentally-friendly toner. FSC certification
means that the process that developed the paper, from seed to final
sheet, is in compliance with international laws and treaties so that it
employs fair labor standards and respects and conserves environmental
resources.
LDC intends to expand the breadth of data sheet categories and the depth
of information provided within each category. This will help to
accurately represent our organization and highlight our staff's research
and development efforts.
[top <#top>]
*New Publications*
*(1)* 2007 NIST Language Recognition Evaluation Test Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S04>
consists of 66 hours of conversational telephone speech segments in the
following languages and dialects: Arabic, Bengali, Chinese (Cantonese),
Mandarin Chinese (Mainland, Taiwan), Chinese (Min), English (American,
Indian), Farsi, German, Hindustani (Hindi, Urdu), Korean, Russian,
Spanish (Caribbean, non-Caribbean), Tamil, Thai and Vietnamese.
The goal of the NIST (National Institute of Standards and Technology)
<http://www.itl.nist.gov/iad/> Language Recognition Evaluation (LRE)
<http://www.itl.nist.gov/iad/mig/tests/lre/> is to establish the
baseline of current performance capability for language recognition of
conversational telephone speech and to lay the groundwork for further
research efforts in the field. NIST conducted three previous language
recognition evaluations, in 1996
<http://www.itl.nist.gov/iad/mig/tests/lre/1996/>, 2003
<http://www.itl.nist.gov/iad/mig/tests/lre/2003/> and 2005
<http://www.itl.nist.gov/iad/mig/tests/lre/2005/>. The most significant
differences between those evaluations and the 2007 task were the
increased number of languages and dialects, the greater emphasis on a
basic detection task for evaluation and the variety of evaluation
conditions. Thus, in 2007, given a segment of speech and a language of
interest to be detected (i.e., a target language), the task was to
decide whether that target language was in fact spoken in the given
telephone speech segment (yes or no), based on an automated analysis of
the data contained in the segment.
Each speech file in the test data is one side of a "4-wire" telephone
conversation represented as 8-bit 8-kHz mu-law format. There are 7530
speech files in SPHERE (.sph) format for a total of 66 hours of speech.
The speech data was compiled from LDC's CALLFRIEND, Fisher Spanish and
Mixer 3 corpora and from data collected by Oregon Health and Science
University, Beaverton, Oregon. The test segments contain three nominal
durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech
durations vary, but were constrained to be within the ranges of 2-4
seconds, 7-13 seconds and 23-35 seconds, respectively.
[top <#top>]
*
*(2)* OntoNotes 3.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T24>.
The OntoNotes project is a collaborative effort between BBN
Technologies, the University of Colorado, the University of
Pennsylvania, and the University of Southern California's Information
Sciences Institute. The goal of the project is to annotate a large
corpus comprising various genres of text (news, conversational telephone
speech, weblogs, use net, broadcast, talk shows) in three languages
(English, Chinese, and Arabic) with structural information (syntax and
predicate argument structure) and shallow semantics (word sense linked
to an ontology and coreference)..
OntoNotes Release 1.0 (LDC2007T21)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>
contains 400k words of Chinese newswire data and 300k words of English
newswire data. OntoNotes Release 2.0 (LDC2008T04)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04>
added the following to the corpus: 274k words of Chinese broadcast news
data; and 200k words of English broadcast news data. OntoNotes Release
3.0 incorporates the following new material: 250k words of English
newswire data, 200k of English broadcast news data; 200k words of
English broadcast conversation material; 250k words of Chinese newswire
data; 250k words of Chinese broadcast news material;150k words of
Chinese broadcast conversation data; and 200k words of Arabic newswire
material.
Natural language applications like machine translation, question
answering, and summarization currently are forced to depend on
impoverished text models like bags of words or n-grams, while the
decisions that they are making ought to be based on the meanings of
those words in context. That lack of semantics causes problems
throughout the applications. Misinterpreting the meaning of an ambiguous
word results in failing to extract data, incorrect alignments for
translation, and ambiguous language models. Incorrect coreference
resolution results in missed information (because a connection is not
made) or incorrectly conflated information (due to false connections).
OntoNotes builds on two time-tested resources, following the Penn
Treebank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>
for syntax and the Penn PropBank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14>
for predicate-argument structure. Its semantic representation will
include word sense disambiguation for nouns and verbs, with each word
sense connected to an ontology, and coreference. The current goals call
for annotation of over a million words each of English and Chinese, and
half a million words of Arabic over five years.
[top <#top>]
*
*(3)* Web 1T 5-gram, 10 European Languages Version 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T25>
was created by Google, Inc. It consists of word n-grams and their
observed frequency counts for ten European languages: Czech, Dutch,
French, German, Italian, Polish, Portuguese, Romanian, Spanish and
Swedish. The length of the n-grams ranges from unigrams (single words)
to five-grams. The n-gram counts were generated from approximately one
billion word tokens of text for each language, or approximately one
trillion total tokens.
The n-grams were extracted from publicly-accessible web pages from
October 2008 to December 2008. This data set contains only n-grams that
appeared at least 40 times in the processed sentences. Less frequent
n-grams were discarded. While the aim was to identify and collect pages
from the specific target languages only, it is likely that some text
from other languages may be in the final data. This dataset will be
useful for statistical language modeling, including machine translation,
speech recognition and other uses. The input encoding of documents was
automatically detected, and all text was converted to UTF8.
[top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091028/ad529f77/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list