<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<div align="center">LDC2005T35<br>
<b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35">ANC
Second Release</a></b><br>
<br>
LDC2005T28<br>
<b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28">HARD
2004 Text</a></b><br>
<br>
LDC2005T29<br>
<b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29">HARD
2004 Topics and Annotations</a></b><br>
<b><br>
</b></div>
<br>
<div align="center">The Linguistic Data Consortium (LDC) is pleased to
announce the
availability of three new publications.</div>
<br>
<hr size="2" width="100%"><br>
<div align="center"><b>New LDC Publications</b><br>
</div>
<br>
(1) The American National Corpus (ANC) project fosters the development
of a
corpus comparable to the British National Corpus (BNC), covering
American English. Corpus-analytic work has demonstrated that the BNC is
inappropriate for the study of American English, due to the numerous
differences in use of the language. <br>
<br>
The availability of a corpus of American English will significantly
contribute to language and linguistic research, the development of
language understanding computer applications (e.g., language
translation and search and retrieval software), and the compilation of
reference works such as dictionaries and thesauri. It will also provide
a rich national resource for use in education at all levels. <br>
<br>
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35">ANC
Second Release</a> contains over 20 million words: 10+ million words
added in the Second Release, and a new corrected and validated version
of the 11 million word ANC First Release. The Second Release also
contains software for searching and retrieving multiple stand-off
annotations.<br>
<br>
ANC Second Release contains texts from the following sources (* denotes
new source in the Second Release):<br>
<br>
Transcribed telephone speech (LDC and Project MORE) <br>
New York Times <br>
Berlitz Travel Guides (Langensheidt Publishers) <br>
Slate Magazine (Microsoft) <br>
ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural
Communication)* <br>
The Michigan Corpus of Academic Spoken English (MICASE) (University of
Michigan, English Language Institute)* <br>
Various non-fiction <br>
Various fiction (Orin Hargraves, Ferd Eggan)* <br>
Various medical research articles (BioMed Central, Public Library of
Science)* <br>
Anonymized Posts to the Phoenix Board/Buffistas.org* <br>
<br>
<b>NOTE:</b> The cost of the first 50 copies of this publication (not
counting the copies distributed to LDC members) is covered by NSF Grant
Number BCS-998009, and therefore free of charge to qualified
researchers; a $30 shipping and handling fee applies. After these first
50 copies are distributed, additional copies will be available for the
nonmember fee of US$75.<br>
<br>
<br>
(2) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28">HARD
2004 Text</a> corpus contains source data for the 2004 TREC HARD (High
Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track
within the NIST Text REtrieval Conference (TREC), with the objective of
achieving high accuracy retrieval from documents by leveraging
additional information about the searcher and/or the search context,
through techniques like passage retrieval and the use of targeted
interaction with the searcher. The topics and annotations that
correspond to this release are distributed as LDC2005T29, HARD 2004
Topics and Annotations. This corpus was created with support from the
DARPA TIDES Program and LDC. <br>
<br>
HARD 2004 Text comprises eight English newswire and web text sources
from January-December 2003. The sources are<br>
<br>
AFE: Agence France Presse - English<br>
APE: Associated Press Newswire<br>
CNE: Central News Agency Taiwan - English<br>
LAT: Los Angeles Times/Washington Post<br>
NYT: New York Times<br>
SLN: Salon.com<br>
UME: Ummah Press - English<br>
XIE: Xinhua News Agency - English<br>
<br>
<br>
(3) The <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29">HARD
2004 Topics and Annotations</a> corpus contains topics and annotations
(clarification forms, responses and relevance assessments) for the 2004
TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD
2004 was a track within the NIST Text REtrieval Conference (TREC), with
the objective of achieving high accuracy retrieval from documents by
leveraging additional information about the searcher and/or the search
context, through techniques like passage retrieval and the use of
targeted interaction with the searcher. The source data that
corresponds to this release is distributed as LDC2005T28, HARD 2004
Text. This corpus was created with support from the DARPA TIDES Program
and LDC. <br>
<br>
Three major annotation tasks are represented in this release: Topic
Creation, Clarification Form Responses, and Relevance Assessment.
Topics include a short title, query plus context, and a number of
limiting parameters known as "metadata" which include targeted
geographical region, target data domain or genre, and level of searcher
expertise. Clarification Forms are brief HTML questionnaires system
developers submitted to LDC searchers to glean additional information
about information needs directly from the topic creators. Relevance
assessment consisted of adjudication of pooled system responses, and
included document-level judgments for all topics, and passage-level
relevance judgments for a subset of topics.<br>
<br>
The release is divided into training and evaluation resources. The
training set comprises twenty-one topics and 100 document-level
relevance judgments per topic. The evaluation set contains fifty
topics, clarification forms and responses, document-level relevance
assessment for all topics and passage-level judgments for half of the
topics assessments. <br>
<br>
<hr size="2" width="100%"><br>
<br>
<div align="center"><font face="Courier New"><small><big><font
face="Times New Roman">If
you need further
information, or would like to inquire about
membership to the LDC, please email <a class="moz-txt-link-abbreviated"
href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215
573 1275.</font></big></small></font><br>
</div>
<font face="Courier New"><small><br>
</small></font><font face="Courier New"><small><br>
<br>
</small></font>
<div align="center">--------------------------------------------------------------------<br>
</div>
<div align="center">
<pre class="moz-signature" cols="72">Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
</div>
<br>
</body>
</html>