<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2723.2500" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff size=2>You
could also try the Reuters Corpus:</FONT></SPAN></DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff size=2><A
href="http://about.reuters.com/researchandstandards/corpus/">http://about.reuters.com/researchandstandards/corpus/</A></FONT></SPAN></DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff size=2>It's
an archive of some 800,000 English language news stories, is freely available,
and marked up in XML (NewsML in fact).</FONT></SPAN></DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff
size=2>Regards,</FONT></SPAN></DIV>
<DIV><SPAN class=973470615-14042003><FONT face=Arial color=#0000ff
size=2>Tony</FONT></SPAN></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader dir=ltr align=left><FONT face=Tahoma
size=2>-----Original Message-----<BR><B>From:</B> owner-corpora@lists.uib.no
[mailto:owner-corpora@lists.uib.no]<B>On Behalf Of </B>Jan
Strunk<BR><B>Sent:</B> 14 April 2003 15:16<BR><B>To:</B>
CORPORA@HIT.UIB.NO<BR><B>Subject:</B> [Corpora-List] Newspaper
Corpora<BR><BR></FONT></DIV>
<DIV><FONT face=Arial size=2>Hello,</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>I would like to evaluate a sentence
boundary</FONT></DIV>
<DIV><FONT face=Arial size=2>and abbreviation detection algorithm on
as</FONT></DIV>
<DIV><FONT face=Arial size=2>many different languages as
possible.</FONT></DIV>
<DIV><FONT face=Arial size=2>Therefore, I am searching for newspaper
corpora</FONT></DIV>
<DIV><FONT face=Arial size=2>that are either freely avaible or not too
expensive.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>The languages in question should use the
period</FONT></DIV>
<DIV><FONT face=Arial size=2>as an ambiguous token denoting either a
sentence</FONT></DIV>
<DIV><FONT face=Arial size=2>boundary, an abbreviation or both.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>I am already using parts of the Wall Street
Journal Corpus,</FONT></DIV>
<DIV><FONT face=Arial size=2>the Neue Zürcher Zeitung and some
corpora</FONT></DIV>
<DIV><FONT face=Arial size=2>included in the Multilingual Corpus I from the
European Corpus Initiative.</FONT></DIV>
<DIV><FONT face=Arial size=2>I also know about TRACTOR.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>I would be very thankful for any
suggestions.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Best regards,</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Jan Strunk</FONT></DIV>
<DIV><FONT face=Arial size=2><A
href="mailto:strunk@linguistics.ruhr-uni-bochum.de">strunk@linguistics.ruhr-uni-bochum.de</A></FONT></DIV>
<DIV><FONT face=Arial size=2>Sprachwissenschaftliches Institut</FONT></DIV>
<DIV><FONT face=Arial size=2>Ruhr-Universität Bochum</FONT></DIV>
<DIV><FONT face=Arial size=2>Germany</FONT></DIV>
<DIV> </DIV></BLOCKQUOTE></BODY></HTML>