[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Nov 28 15:26:24 UTC 2007
*- Free Google Data (Web 1T 5-gram) Available
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13> -
*
LDC2007T40
*- Arabic Gigaword Third Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40>*
-
LDC2007S18*
- CSLU Kid's Speech Version 1.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S18> -
*
LDC2007T20
*- GALE Phase 1 Distillation Training
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T20> -
*
*The Linguistic Data Consortium (LDC) is pleased to announce the
availability of free Web 1T 5-gram data as well as the release of three
new publications.
*
------------------------------------------------------------------------
*Free Google Data (Web 1T 5-gram) Available
*
We are pleased to announce that Google Inc. is once again providing
financial support for the distribution of its Web 1T 5-gram (LDC2006T13)
corpus to universities. As a result, LDC will make the corpus available
at no charge to 100 non-member universities requesting a copy. Shipping
and handling fees are also being covered by Google. We appreciate
Google's continued generosity and its interest in supporting language
research.
To obtain a free copy, universities will need to sign and submit a copy
of the User License Agreement for Web 1T 5-gram Version
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>1
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>*
*. This can be faxed to +1 215 573 2175 or scanned and emailed to
ldc at ldc.upenn.edu. Complete contact details, including shipping
address, phone number, and email are also required.
*New Publications
*
(1) Arabic Gigaword Third Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40>
is a comprehensive archive of newswire text data acquired from Arabic
news sources by the LDC at the University of Pennsylvania. Arabic
Gigaword Third Edition includes all of the content of Arabic Gigaword
Second Edition (LDC2006T02)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02>
as well as new data collected after the publication of that edition.
Also, an archive from a new newswire source -- Assabah -- has been
included in the third edition.
The six distinct sources of Arabic newswire represented in the third
edition are:
* Agence France Presse (afp_arb)
* Assabah (asb_arb)
* Al Hayat (hyt_arb)
* An Nahar (nhr_arb)
* Ummah Press (umh_arb)
* Xinhua News Agency (xin_arb)
The seven-character codes in the parantheses above consist of the
three-character source name IDs and the three-character language code
("arb") separated by an underscore ("_") character.
The epochs and document counts for the data in the third edition are set
forth below:
Newly Added Data
Source
Date Span
Document Count
Agence France Presse
2005.01 - 2006.12
137815
Assabah News Agency
2004.09 - 2006.12
15410
(new source)
Al Hayat News Agency
2005.01 - 2006.1
8799
(no data for 2004)
An Nahar News Agency
2005.01 - 2006.12
104950
(no data for 2004)
Xinhua News Agency
2005.01 - 2006.12
135472
This release contains 547 files, totaling approximately 1.8GB in
compressed form (6,673 MB uncompressed) and 1,994,735 K-words.
***
(2) CSLU: Kids' Speech Version 1.1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S18>
is a collection of spontaneous and prompted speech from 1100 children
between Kindergarten and Grade 10 in the Forest Grove School District in
Oregon. All children -- approximately 100 children at each grade level
-- read approximately 60 items from a total list of 319
phonetically-balanced but simple words, sentences or digit strings. Each
utterance of spontaneous speech begins with a recitation of the alphabet
and contains a monologue of about one minute in duration. This release
consists of 1017 files containing approximately 8-10 minutes of speech
per speaker. Corresponding word-level transcriptions are also included.
This corpus was developed to facilitate research about the
characteristics of children's speech at different ages and to train and
evaluate recognizers for use in language training and other interactive
tasks involving children, including to train recognizers used in
language development with deaf children. Information about the
subject's age, gender, languages spoken and physical conditions
affecting speech was also collected.
***
(3) GALE Phase 1 Distillation Training
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T20>
constitutes the final release of training data created by LDC for the
DARPA GALE Program Phase 1 Distillation technology evaluation.
Distillation is one of three primary technology components for the DARPA
GALE Program, along with Transcription and Translation. Distillation
engines respond to queries from English-speaking users, delivering
pertinent, consolidated information in easy-to-understand forms. The
distillation engine processes English and foreign language material,
both speech and text, from multiple sources and documents, removing
redundancy and presenting an integrated response to the user.
This release consists of 248 English, Chinese and/or Arabic queries and
their responses created by LDC annotators. Queries conform to one of ten
template types. Query responses may include document and snippet
relevance judgments, nuggets, nugs and supernugs. 158 of the 248 queries
have been annotated for all features, while the remainder are labeled
for only some features.
The annotation task involves responding to a series of user queries. For
each query, annotators first find relevant documents and identify
snippets (strings of contiguous text that answer the query) in the
Arabic, Chinese or English source document. Annotators then create a
nugget for each fact expressed in the snippet. Semantically equivalent
nuggets are grouped into cross-language, cross-document "supernugs".
Queries in this release have been annotated for the following tasks:
* searching for relevant documents and providing yes/no judgments
* extracting snippets
* resolution of pronouns, and certain types of temporal and locative
expressions contained in the snippets
* creating nuggets, i.e. atomic pieces of information that an
annotator considers a valid answer to the query
* building nugs, i.e. clusters of semantically-equivalent nuggets
for each language
* building supernugs, i.e. clusters of semantically-equivalent nugs
across languages
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071128/f02e6270/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list