FW: is Google reliable?
lists at SPANISHTRANSLATOR.ORG
Fri Dec 13 17:35:49 UTC 2002
On 12/13/2002 06:15, Frank Abate wrote the following:
>Whatever happened to the effort to build a US corpus a la BNC (but bigger)?
>There was an American National Corpus effort going -- status, anyone?
They're about to put out about 10 million words, it seems. See
"The first release of 10 million words of the ANC corpus is scheduled for
mid-fall. The corpus will be annotated for part of speech and include a
"base set" of tools for search and extraction, a preliminary version of the
access tools that will provided with the final corpus. A password-protected
web interface for testing the tools on a sub-set of the data is expected to
be accessible to ANC Consortium members at the end of this month.
"Acquisition of texts is proceeding, although more slowly than planned. The
first release of the ANC will contain whatever texts are in hand, and will
therefore not be balanced for genre (as originally stipulated in ANC
documentation). So far, we have acquired about 2 million words of spoken
data (the LDC Switchboard corpus and a portion of the CallHome corpus), 1.5
million words of previously un-released newspaper data from the New York
Times, a few hundred thousand words of "ephemera" (pamphlets, newletters,
etc.), and several novels published by Oxford University Press USA. We
expect to receive substantially more data from the contributing consortium
members to include in the first release, including not only fiction and
journals but also various Berlitz Travel Guides (Langenscheidt) and
technical manuals from Microsoft and IBM. We are also negotiating to
acquire research papers from the Association for Computational Linguistics
and articles from the IBM Research Journal."
Scott Sadowsky -- Spanish-English / English-Spanish Translator
sadowsky at spanishtranslator.org · sadowsky at bigfoot.com
"La soberanía del hablante nativo no tiene más límites que los de su
sistema mental real, sin que tengan que importarle en absoluto las
opiniones de las presuntas 'autoridades en la materia' empeñadas
en cocear la gramaticalidad con normativismos de su propia cosecha."
-- Carlos-Peregrín Otero
More information about the Ads-l