[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Jan 6 22:54:40 UTC 2009
LDC2008T25
* - AQUAINT-2 Information-Retrieval Text Research Collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T25> -*
**
*** *LDC2008L03
* - Global Yoruba Lexical Database v. 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L03> *-
The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications.
------------------------------------------------------------------------
*New Publications*
(1) AQUAINT-2 Information-Retrieval Text Research Collection
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T25>was
developed by LDC for NIST's (National Institute for Standards and
Technology) AQUAINT 2007 Question-Answer (QA) track
<http://www-nlpir.nist.gov/projects/aquaint/>. It consists of
approximately 2.5 GB of English news text from six distinct sources
collected by LDC (Agence France Presse, Associated Press, Central News
Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and
Xinhua News Agency) covering the period from October 2004 through March
2006. The AQUAINT-2 collection is the second part of a series intended
to provide data useful for developing, evaluating and testing
information extraction and retrieval systems. It follows the publication
of The AQUAINT Corpus of English News Text (LDC2002T31)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002T31>.
The AQUAINT (Advanced Question-Answering for Intelligence) program
addresses interactivity with scenarios or tasks. The scenario provides a
context in which questions will be asked and answered, and the task
reflects the overall assignment. The program is committed to solve a
single problem: how to find topically relevant, semantically related,
timely information in massive amounts of data in diverse languages,
formats, and genres.
For each source, all of the usable data collected by LDC was processed
into a consistent XML format in which the stories for a given month are
concatenated in chronological order into a single "DOCSTREAM" element;
each story is a single "DOC" element within that stream and has a
globally unique "id" attribute.
*
(2) The Global Yoruba Lexical Database v. 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008L03>
is a set of related dictionaries providing definitions and translations
for over 450,000 words from the Yoruba language and its variants:
Standard Yoruba (over 368,000 words), Gullah (over 3,600 words), Lucumí
(over 8,000 words) and Trinidadian (over 1,000 words).
Yoruba is a Niger-Congo language (sub classification: Kwa > Yoruboid)
spoken natively by nearly 20 million people, the vast majority of them
in southwestern Nigeria. The Yoruba language diaspora is wide,
stretching from southwestern Nigeria and Benin westward to the Caribbean
and islands along the southeastern United States coast. Throughout the
region, Yoruba dialects blended with each other and with languages like
Spanish and French to form a variety of creoles such as Gullah in the
United States and Nagô in Brazil. The ultimate goal of this dictionary
is to provide coverage for all Yoruba dialects across the globe. For
that reason, it will continue to be a work in progress.
The Yoruba dialect continuum consists of over fifteen varieties, with
considerable phonological and lexical differences among them and some
grammatical ones as well. Peripheral areas of dialectal regions often
have some similarities to adjoining dialects. /Standard Yoruba/ is a
koine used for education, writing, broadcasting, and contact between
speakers of different dialects.
The dictionaries in this publication are presented in two formats,
Toolbox databases and XML. Short for The Field Linguist's Toolbox,
<http://www.sil.org/computing/catalog/show_software.asp?id=79> Toolbox
is a lexicographical database system published by SIL
<http://www.sil.org/>. SIL makes Toolbox freely available for download
<http://www.sil.org/computing/toolbox/downloads.htm>. In order to use
the Global Yoruba Lexical Database v. 1.0, Toolbox must first be
installed on the user's local computer.
------------------------------------------------------------------------
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090106/ddf4c4f3/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list