<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<div align="center">LDC2005T35<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35">ANC

Second Release</a></b><br>

<br>

LDC2005T28<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28">HARD

2004 Text</a></b><br>

<br>

LDC2005T29<br>

<b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29">HARD

2004 Topics and Annotations</a></b><br>

<b><br>

</b></div>

<br>

<div align="center">The Linguistic Data Consortium (LDC) is pleased to

announce the

availability of three new publications.</div>

<br>

<hr size="2" width="100%"><br>

<div align="center"><b>New LDC Publications</b><br>

</div>

<br>

(1) The American National Corpus (ANC) project fosters the development

of a

corpus comparable to the British National Corpus (BNC), covering

American English. Corpus-analytic work has demonstrated that the BNC is

inappropriate for the study of American English, due to the numerous

differences in use of the language. <br>

<br>

The availability of a corpus of American English will significantly

contribute to language and linguistic research, the development of

language understanding computer applications (e.g., language

translation and search and retrieval software), and the compilation of

reference works such as dictionaries and thesauri. It will also provide

a rich national resource for use in education at all levels. <br>

<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35">ANC

Second Release</a> contains over 20 million words: 10+ million words

added in the Second Release, and a new corrected and validated version

of the 11 million word ANC First Release. The Second Release also

contains software for searching and retrieving multiple stand-off

annotations.<br>

<br>

ANC Second Release contains texts from the following sources (* denotes

new source in the Second Release):<br>

<br>

Transcribed telephone speech (LDC and Project MORE) <br>

New York Times <br>

Berlitz Travel Guides (Langensheidt Publishers) <br>

Slate Magazine (Microsoft) <br>

ICIC Corpus of Fundraising Texts (Indiana Center for Intercultural

Communication)* <br>

The Michigan Corpus of Academic Spoken English (MICASE) (University of

Michigan, English Language Institute)* <br>

Various non-fiction <br>

Various fiction (Orin Hargraves, Ferd Eggan)* <br>

Various medical research articles (BioMed Central, Public Library of

Science)* <br>

Anonymized Posts to the Phoenix Board/Buffistas.org* <br>

<br>

<b>NOTE:</b>  The cost of the first 50 copies of this publication (not

counting the copies distributed to LDC members) is covered by NSF Grant

Number BCS-998009, and therefore free of charge to qualified

researchers; a $30 shipping and handling fee applies. After these first

50 copies are distributed, additional copies will be available for the

nonmember fee of US$75.<br>

<br>

<br>

(2)  The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T28">HARD

2004 Text</a> corpus contains source data for the 2004 TREC HARD (High

Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track

within the NIST Text REtrieval Conference (TREC), with the objective of

achieving high accuracy retrieval from documents by leveraging

additional information about the searcher and/or the search context,

through techniques like passage retrieval and the use of targeted

interaction with the searcher.  The topics and annotations that

correspond to this release are distributed as LDC2005T29, HARD 2004

Topics and Annotations. This corpus was created with support from the

DARPA TIDES Program and LDC. <br>

<br>

HARD 2004 Text comprises eight English newswire and web text sources

from January-December 2003. The sources are<br>

<br>

AFE: Agence France Presse - English<br>

APE: Associated Press Newswire<br>

CNE: Central News Agency Taiwan - English<br>

LAT: Los Angeles Times/Washington Post<br>

NYT: New York Times<br>

SLN: Salon.com<br>

UME: Ummah Press - English<br>

XIE: Xinhua News Agency - English<br>

<br>

<br>

(3)  The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T29">HARD

2004 Topics and Annotations</a> corpus contains topics and annotations

(clarification forms, responses and relevance assessments) for the 2004

TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD

2004 was a track within the NIST Text REtrieval Conference (TREC), with

the objective of achieving high accuracy retrieval from documents by

leveraging additional information about the searcher and/or the search

context, through techniques like passage retrieval and the use of

targeted interaction with the searcher.  The source data that

corresponds to this release is distributed as LDC2005T28, HARD 2004

Text. This corpus was created with support from the DARPA TIDES Program

and LDC. <br>

<br>

Three major annotation tasks are represented in this release: Topic

Creation, Clarification Form Responses, and Relevance Assessment.

Topics include a short title, query plus context, and a number of

limiting parameters known as "metadata" which include targeted

geographical region, target data domain or genre, and level of searcher

expertise. Clarification Forms are brief HTML questionnaires system

developers submitted to LDC searchers to glean additional information

about information needs directly from the topic creators. Relevance

assessment consisted of adjudication of pooled system responses, and

included document-level judgments for all topics, and passage-level

relevance judgments for a subset of topics.<br>

<br>

The release is divided into training and evaluation resources. The

training set comprises twenty-one topics and 100 document-level

relevance judgments per topic. The evaluation set contains fifty

topics, clarification forms and responses, document-level relevance

assessment for all topics and passage-level judgments for half of the

topics assessments.  <br>

<br>

<hr size="2" width="100%"><br>

<br>

<div align="center"><font face="Courier New"><small><big><font

 face="Times New Roman">If

you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 1275.</font></big></small></font><br>

</div>

<font face="Courier New"><small><br>

</small></font><font face="Courier New"><small><br>

<br>

</small></font>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

3600 Market Street                             Fax:   (215) 573-2175

Suite 810                                          <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104                      <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

<br>

</body>

</html>