<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">

<meta name=Generator content="Microsoft Word 12 (filtered medium)">

<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Vorformatiert Zchn";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";}

span.HTMLVorformatiertZchn

        {mso-style-name:"HTML Vorformatiert Zchn";

        mso-style-priority:99;

        mso-style-link:"HTML Vorformatiert";

        font-family:Consolas;}

span.E-MailFormatvorlage19

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:70.85pt 70.85pt 2.0cm 70.85pt;}

div.WordSection1

        {page:WordSection1;}

-->

</style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

</head>


<body lang=DE link=blue vlink=purple>


<div class=WordSection1>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>Hi Irina,<o:p></o:p></span></p>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p> </o:p></span></p>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>the Java Wikipedia Library (JWPL) contains a parser for the

MediaWiki syntax that allows you (among other things) to access the plain-text

of a Wikipedia article:<o:p></o:p></span></p>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><a href="http://www.ukp.tu-darmstadt.de/software/jwpl/">http://www.ukp.tu-darmstadt.de/software/jwpl/</a><o:p></o:p></span></p>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p> </o:p></span></p>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>-Torsten<o:p></o:p></span></p>


<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p> </o:p></span></p>


<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm'>


<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>Von:</span></b><span

style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>

corpora-bounces@uib.no [mailto:corpora-bounces@uib.no] <b>Im Auftrag von </b>Irina

Temnikova<br>

<b>Gesendet:</b> Freitag, 27. August 2010 19:52<br>

<b>An:</b> corpora@uib.no<br>

<b>Betreff:</b> [Corpora-List] Extracting text from Wikipedia articles<o:p></o:p></span></p>


</div>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>Dear CORPORA mailing list members,<br>

<br>

Do any of you know of any tool for extracting text specifically from Wikipedia

articles, besides those for extracting text from HTML pages?<br>

<br>

I only need the title and the text, without any of the formal elements present

in every Wikipedia article (such as "From Wikipedia, the free

encyclopedia", "This article is about ..", [edit], the list of

languages,"Main article:","Categories:") and without

"Contents", "See also", "References",

"Notes" and "External links".<br>

<br>

Can you give me any suggestions?<br>

<br>

Thank you very much in advance,<br>

<br>

Irina<br>

<br>

<br>

<o:p></o:p></p>


<pre>Irina Temnikova<br>

<br>

PhD Student in Computational Linguistics<br>

Editorial Assistant for the Journal of Natural Language Engineering<br>

Research Group in Computational Linguistics<br>

<br>

<o:p></o:p></pre><pre><o:p> </o:p></pre><pre>Research Institute of Information and Language Processing<br>

University of Wolverhampton, UK<o:p></o:p></pre>


<p class=MsoNormal><br>

-- <br>

If you want to build a ship, don't drum up the men to gather wood, divide the

work and give orders. Instead, teach them to yearn for the vast and endless

sea. (Antoine de Saint-Exupery)<o:p></o:p></p>


</div>


</body>


</html>