[Corpora-List] Newspaper Corpora

Tony Rose tr at acl.icnet.uk
Mon Apr 14 15:09:40 UTC 2003


You could also try the Reuters Corpus:

http://about.reuters.com/researchandstandards/corpus/

It's an archive of some 800,000 English language news stories, is freely
available, and marked up in XML (NewsML in fact).

Regards,
Tony
  -----Original Message-----
  From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no]On
Behalf Of Jan Strunk
  Sent: 14 April 2003 15:16
  To: CORPORA at HIT.UIB.NO
  Subject: [Corpora-List] Newspaper Corpora


  Hello,

  I would like to evaluate a sentence boundary
  and abbreviation detection algorithm on as
  many different languages as possible.
  Therefore, I am searching for newspaper corpora
  that are either freely avaible or not too expensive.

  The languages in question should use the period
  as an ambiguous token denoting either a sentence
  boundary, an abbreviation or both.

  I am already using parts of the Wall Street Journal Corpus,
  the Neue Zürcher Zeitung and some corpora
  included in the Multilingual Corpus I from the European Corpus Initiative.
  I also know about TRACTOR.

  I would be very thankful for any suggestions.

  Best regards,

  Jan Strunk
  strunk at linguistics.ruhr-uni-bochum.de
  Sprachwissenschaftliches Institut
  Ruhr-Universität Bochum
  Germany

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030414/139bb2ae/attachment.htm>


More information about the Corpora mailing list