[Corpora-List] Newspaper Corpora
Tony Rose
tr at acl.icnet.uk
Mon Apr 14 15:09:40 UTC 2003
You could also try the Reuters Corpus:
http://about.reuters.com/researchandstandards/corpus/
It's an archive of some 800,000 English language news stories, is freely
available, and marked up in XML (NewsML in fact).
Regards,
Tony
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no]On
Behalf Of Jan Strunk
Sent: 14 April 2003 15:16
To: CORPORA at HIT.UIB.NO
Subject: [Corpora-List] Newspaper Corpora
Hello,
I would like to evaluate a sentence boundary
and abbreviation detection algorithm on as
many different languages as possible.
Therefore, I am searching for newspaper corpora
that are either freely avaible or not too expensive.
The languages in question should use the period
as an ambiguous token denoting either a sentence
boundary, an abbreviation or both.
I am already using parts of the Wall Street Journal Corpus,
the Neue Zürcher Zeitung and some corpora
included in the Multilingual Corpus I from the European Corpus Initiative.
I also know about TRACTOR.
I would be very thankful for any suggestions.
Best regards,
Jan Strunk
strunk at linguistics.ruhr-uni-bochum.de
Sprachwissenschaftliches Institut
Ruhr-Universität Bochum
Germany
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030414/139bb2ae/attachment.htm>
More information about the Corpora
mailing list