[Corpora-List] New LDC Corpora

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Jan 5 21:07:23 UTC 2006


LDC2005T35
*ANC Second Release 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35>*

LDC2005T28
*HARD 2004 Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28>*

LDC2005T29
*HARD 2004 Topics and Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29>*
*
*

The Linguistic Data Consortium (LDC) is pleased to announce the 
availability of three new publications.

------------------------------------------------------------------------

*New LDC Publications*

(1) The American National Corpus (ANC) project fosters the development 
of a corpus comparable to the British National Corpus (BNC), covering 
American English. Corpus-analytic work has demonstrated that the BNC is 
inappropriate for the study of American English, due to the numerous 
differences in use of the language.

The availability of a corpus of American English will significantly 
contribute to language and linguistic research, the development of 
language understanding computer applications (e.g., language translation 
and search and retrieval software), and the compilation of reference 
works such as dictionaries and thesauri. It will also provide a rich 
national resource for use in education at all levels.

ANC Second Release 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35> 
contains over 20 million words: 10+ million words added in the Second 
Release, and a new corrected and validated version of the 11 million 
word ANC First Release. The Second Release also contains software for 
searching and retrieving multiple stand-off annotations.

ANC Second Release contains texts from the following sources (* denotes 
new source in the Second Release):

Transcribed telephone speech (LDC and Project MORE)
New York Times
Berlitz Travel Guides (Langensheidt Publishers)
Slate Magazine (Microsoft)
ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural 
Communication)*
The Michigan Corpus of Academic Spoken English (MICASE) (University of 
Michigan, English Language Institute)*
Various non-fiction
Various fiction (Orin Hargraves, Ferd Eggan)*
Various medical research articles (BioMed Central, Public Library of 
Science)*
Anonymized Posts to the Phoenix Board/Buffistas.org*

*NOTE:*  The cost of the first 50 copies of this publication (not 
counting the copies distributed to LDC members) is covered by NSF Grant 
Number BCS-998009, and therefore free of charge to qualified 
researchers; a $30 shipping and handling fee applies. After these first 
50 copies are distributed, additional copies will be available for the 
nonmember fee of US$75.


(2)  The HARD 2004 Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28> 
corpus contains source data for the 2004 TREC HARD (High Accuracy 
Retrieval from Documents) Evaluation. HARD 2004 was a track within the 
NIST Text REtrieval Conference (TREC), with the objective of achieving 
high accuracy retrieval from documents by leveraging additional 
information about the searcher and/or the search context, through 
techniques like passage retrieval and the use of targeted interaction 
with the searcher.  The topics and annotations that correspond to this 
release are distributed as LDC2005T29, HARD 2004 Topics and Annotations. 
This corpus was created with support from the DARPA TIDES Program and LDC.

HARD 2004 Text comprises eight English newswire and web text sources 
from January-December 2003. The sources are

AFE: Agence France Presse - English
APE: Associated Press Newswire
CNE: Central News Agency Taiwan - English
LAT: Los Angeles Times/Washington Post
NYT: New York Times
SLN: Salon.com
UME: Ummah Press - English
XIE: Xinhua News Agency - English


(3)  The HARD 2004 Topics and Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29> 
corpus contains topics and annotations (clarification forms, responses 
and relevance assessments) for the 2004 TREC HARD (High Accuracy 
Retrieval from Documents) Evaluation. HARD 2004 was a track within the 
NIST Text REtrieval Conference (TREC), with the objective of achieving 
high accuracy retrieval from documents by leveraging additional 
information about the searcher and/or the search context, through 
techniques like passage retrieval and the use of targeted interaction 
with the searcher.  The source data that corresponds to this release is 
distributed as LDC2005T28, HARD 2004 Text. This corpus was created with 
support from the DARPA TIDES Program and LDC.

Three major annotation tasks are represented in this release: Topic 
Creation, Clarification Form Responses, and Relevance Assessment. Topics 
include a short title, query plus context, and a number of limiting 
parameters known as "metadata" which include targeted geographical 
region, target data domain or genre, and level of searcher expertise. 
Clarification Forms are brief HTML questionnaires system developers 
submitted to LDC searchers to glean additional information about 
information needs directly from the topic creators. Relevance assessment 
consisted of adjudication of pooled system responses, and included 
document-level judgments for all topics, and passage-level relevance 
judgments for a subset of topics.

The release is divided into training and evaluation resources. The 
training set comprises twenty-one topics and 100 document-level 
relevance judgments per topic. The evaluation set contains fifty topics, 
clarification forms and responses, document-level relevance assessment 
for all topics and passage-level judgments for half of the topics 
assessments. 

------------------------------------------------------------------------


If you need further information, or would like to inquire about 
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573 
1275.



--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             	    	   ldc at ldc.upenn.edu
Philadelphia, PA 19104                 	    http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060105/e1e366a1/attachment.htm>


More information about the Corpora mailing list