[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Jan 6 22:54:40 UTC 2009


LDC2008T25
* -  AQUAINT-2 Information-Retrieval Text Research Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T25>  -*
**
*** *LDC2008L03
* -  Global Yoruba Lexical Database v. 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L03>  *-

The Linguistic Data Consortium (LDC) would like to announce the 
availability of two new publications.
------------------------------------------------------------------------

*New Publications*

(1) AQUAINT-2 Information-Retrieval Text Research Collection 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T25>was 
developed by LDC for NIST's (National Institute for Standards and 
Technology) AQUAINT 2007 Question-Answer (QA) track 
<http://www-nlpir.nist.gov/projects/aquaint/>. It consists of 
approximately 2.5 GB of English news text from six distinct sources 
collected by LDC (Agence France Presse, Associated Press, Central News 
Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and 
Xinhua News Agency) covering the period from October 2004 through March 
2006. The AQUAINT-2 collection is the second part of a series intended 
to provide data useful for developing, evaluating and testing 
information extraction and retrieval systems. It follows the publication 
of The AQUAINT Corpus of English News Text (LDC2002T31) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002T31>.

The AQUAINT (Advanced Question-Answering for Intelligence)  program 
addresses interactivity with scenarios or tasks. The scenario provides a 
context in which questions will be asked and answered, and the task 
reflects the overall assignment. The program is committed to solve a 
single problem: how to find topically relevant, semantically related, 
timely information in massive amounts of data in diverse languages, 
formats, and genres.

For each source, all of the usable data collected by LDC was processed 
into a consistent XML format in which the stories for a given month are 
concatenated in chronological order into a single "DOCSTREAM" element; 
each story is a single "DOC" element within that stream and has a 
globally unique "id" attribute.

* 

(2) The Global Yoruba Lexical Database v. 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L03> 
is a set of related dictionaries providing definitions and translations 
for over 450,000 words from the Yoruba language and its variants: 
Standard Yoruba (over 368,000 words), Gullah (over 3,600 words), Lucumí 
(over 8,000 words) and Trinidadian (over 1,000 words).

Yoruba is a Niger-Congo language (sub classification: Kwa > Yoruboid) 
spoken natively by nearly 20 million people, the vast majority of them 
in southwestern Nigeria.  The  Yoruba language diaspora is wide, 
stretching from southwestern Nigeria and Benin westward to the Caribbean 
and islands along the southeastern United States coast.  Throughout the 
region, Yoruba dialects blended with each other and with languages like 
Spanish and French to form a variety of creoles such as Gullah in the 
United States and Nagô in Brazil.  The ultimate goal of this dictionary 
is to provide coverage for all Yoruba dialects across the globe. For 
that reason, it will continue to be a work in progress.

The Yoruba dialect continuum consists of over fifteen varieties, with 
considerable phonological and lexical differences among them and some 
grammatical ones as well. Peripheral areas of dialectal regions often 
have some similarities to adjoining dialects. /Standard Yoruba/ is a 
koine used for education, writing, broadcasting, and contact between 
speakers of different dialects.

The dictionaries in this publication are presented in two formats, 
Toolbox databases and XML. Short for The Field Linguist's Toolbox, 
<http://www.sil.org/computing/catalog/show_software.asp?id=79> Toolbox 
is a lexicographical database system published by SIL 
<http://www.sil.org/>. SIL makes Toolbox freely available for download 
<http://www.sil.org/computing/toolbox/downloads.htm>. In order to use 
the Global Yoruba Lexical Database v. 1.0, Toolbox must first be 
installed on the user's local computer.
------------------------------------------------------------------------



--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090106/ddf4c4f3/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list