[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Nov 28 15:26:24 UTC 2007


*-  Free Google Data (Web 1T 5-gram) Available 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13>  -
*

LDC2007T40
*-  Arabic Gigaword Third Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40>*   
-

LDC2007S18*
-  CSLU Kid's Speech Version 1.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S18>  -
*

LDC2007T20
*-  GALE Phase 1 Distillation Training 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T20>  -

*

*The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of free Web 1T 5-gram data as well as the release of three 
new publications.
*

------------------------------------------------------------------------


*Free Google Data (Web 1T 5-gram) Available 
*

We are pleased to announce that Google Inc. is once again providing 
financial support for the distribution of its Web 1T 5-gram (LDC2006T13) 
corpus to universities. As a result, LDC will make the corpus available 
at no charge to 100 non-member universities requesting a copy.  Shipping 
and handling fees are also being covered by Google.  We appreciate 
Google's continued generosity and its interest in supporting language 
research. 

To obtain a free copy, universities will need to sign and submit a copy 
of the User License Agreement for Web 1T 5-gram Version 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>1 
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>* 
*.  This can be faxed to +1 215 573 2175 or scanned and emailed to 
ldc at ldc.upenn.edu.  Complete contact details, including shipping 
address, phone number, and email are also required.

 

*New Publications

*

(1) Arabic Gigaword Third Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40> 
is a comprehensive archive of newswire text data acquired from Arabic 
news sources by the LDC at the University of Pennsylvania. Arabic 
Gigaword Third Edition includes all of the content of Arabic Gigaword 
Second Edition (LDC2006T02) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02> 
as well as new data collected after the publication of that edition. 
Also, an archive from a new newswire source -- Assabah -- has been 
included in the third edition.

The six distinct sources of Arabic newswire represented in the third 
edition are:

    * Agence France Presse (afp_arb)
    * Assabah (asb_arb)
    * Al Hayat (hyt_arb)
    * An Nahar (nhr_arb)
    * Ummah Press (umh_arb)
    * Xinhua News Agency (xin_arb)

The seven-character codes in the parantheses above consist of the 
three-character source name IDs and the three-character language code 
("arb") separated by an underscore ("_") character.

The epochs and document counts for the data in the third edition are set 
forth below:

Newly Added Data

 

	

 

	

 

	

 

Source

	

Date Span

	

Document Count

	

 

Agence France Presse

	

2005.01 - 2006.12

	

137815

	

 

Assabah News Agency

	

2004.09 - 2006.12

	

15410

	

(new source)

Al Hayat News Agency

	

2005.01 - 2006.1

	

8799

	

(no data for 2004)

An Nahar News Agency

	

2005.01 - 2006.12

	

104950

	

(no data for 2004)

Xinhua News Agency

	

2005.01 - 2006.12

	

135472

	

 

This release contains 547 files, totaling approximately 1.8GB in 
compressed form (6,673 MB uncompressed) and 1,994,735 K-words. 

***

(2) CSLU: Kids' Speech Version 1.1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S18> 
is a collection of spontaneous and prompted speech from 1100 children 
between Kindergarten and Grade 10 in the Forest Grove School District in 
Oregon. All children -- approximately 100 children at each grade level 
-- read approximately 60 items from a total list of 319 
phonetically-balanced but simple words, sentences or digit strings. Each 
utterance of spontaneous speech begins with a recitation of the alphabet 
and contains a monologue of about one minute in duration. This release 
consists of 1017 files containing approximately 8-10 minutes of speech 
per speaker. Corresponding word-level transcriptions are also included.

This corpus was developed to facilitate research about the 
characteristics of children's speech at different ages and to train and 
evaluate recognizers for use in language training and other interactive 
tasks involving children, including to train recognizers used in 
language development with deaf children.  Information about the 
subject's age, gender, languages spoken and physical conditions 
affecting speech was also collected.  

***

(3) GALE Phase 1 Distillation Training 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T20> 
constitutes the final release of training data created by LDC for the 
DARPA GALE Program Phase 1 Distillation technology evaluation. 
Distillation is one of three primary technology components for the DARPA 
GALE Program, along with Transcription and Translation. Distillation 
engines respond to queries from English-speaking users, delivering 
pertinent, consolidated information in easy-to-understand forms. The 
distillation engine processes English and foreign language material, 
both speech and text, from multiple sources and documents, removing 
redundancy and presenting an integrated response to the user.

This release consists of 248 English, Chinese and/or Arabic queries and 
their responses created by LDC annotators. Queries conform to one of ten 
template types. Query responses may include document and snippet 
relevance judgments, nuggets, nugs and supernugs. 158 of the 248 queries 
have been annotated for all features, while the remainder are labeled 
for only some features.

The annotation task involves responding to a series of user queries. For 
each query, annotators first find relevant documents and identify 
snippets (strings of contiguous text that answer the query) in the 
Arabic, Chinese or English source document. Annotators then create a 
nugget for each fact expressed in the snippet. Semantically equivalent 
nuggets are grouped into cross-language, cross-document "supernugs".

Queries in this release have been annotated for the following tasks:

    * searching for relevant documents and providing yes/no judgments
    * extracting snippets
    * resolution of pronouns, and certain types of temporal and locative
      expressions contained in the snippets
    * creating nuggets, i.e. atomic pieces of information that an
      annotator considers a valid answer to the query
    * building nugs, i.e. clusters of semantically-equivalent nuggets
      for each language
    * building supernugs, i.e. clusters of semantically-equivalent nugs
      across languages

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071128/f02e6270/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list