[Corpora-List] How do we extract actual text in html?

Tsvi Sadan tsvi.sadan at gmail.com
Mon Aug 2 16:46:24 UTC 2010


Constantin Orasan:

> When you deal with newspaper articles, one thing you want to check is
> if there is a print version of the page. Usually the print version
> contains mainly the text of the article without menus and extra
> information.

And after this process, you can save the articles and use the following
regex expression with any text editor supporting regex to remove all the
(X)HTML tags (and extract actual text); no bloatware is required:

Find: <[^>]+>
Replace: (leave this line blank)

-- 
Tsvi Sadan (Tsuguya Sasaki), PhD
Senior Lecturer
Department of Hebrew and Semitic Languages
Bar-Ilan University, Israel
tsvi.sadan at gmail.com
http://sites.google.com/site/tsvisadan/

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list