<font face="verdana,sans-serif">I believe we do further reading lest we insult the developers replying to the thread.<br></font><br><div class="gmail_quote">On Mon, Aug 2, 2010 at 9:46 AM, Tsvi Sadan <span dir="ltr"><<a href="mailto:tsvi.sadan@gmail.com">tsvi.sadan@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Constantin Orasan:<br>

<div class="im"><br>

> When you deal with newspaper articles, one thing you want to check is<br>

> if there is a print version of the page. Usually the print version<br>

> contains mainly the text of the article without menus and extra<br>

> information.<br>

<br>

</div>And after this process, you can save the articles and use the following<br>

regex expression with any text editor supporting regex to remove all the<br>

(X)HTML tags (and extract actual text); no bloatware is required:<br>

<br>

Find: <[^>]+><br>

Replace: (leave this line blank)<br>

<font color="#888888"><br>

--<br>

Tsvi Sadan (Tsuguya Sasaki), PhD<br>

Senior Lecturer<br>

Department of Hebrew and Semitic Languages<br>

Bar-Ilan University, Israel<br>

<a href="mailto:tsvi.sadan@gmail.com">tsvi.sadan@gmail.com</a><br>

<a href="http://sites.google.com/site/tsvisadan/" target="_blank">http://sites.google.com/site/tsvisadan/</a><br>

</font><div><div></div><div class="h5"><br>

_______________________________________________<br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</div></div></blockquote></div><br>