[Corpora-List] How do we extract actual text in html?
Tsvi Sadan
tsvi.sadan at gmail.com
Mon Aug 2 16:46:24 UTC 2010
Constantin Orasan:
> When you deal with newspaper articles, one thing you want to check is
> if there is a print version of the page. Usually the print version
> contains mainly the text of the article without menus and extra
> information.
And after this process, you can save the articles and use the following
regex expression with any text editor supporting regex to remove all the
(X)HTML tags (and extract actual text); no bloatware is required:
Find: <[^>]+>
Replace: (leave this line blank)
--
Tsvi Sadan (Tsuguya Sasaki), PhD
Senior Lecturer
Department of Hebrew and Semitic Languages
Bar-Ilan University, Israel
tsvi.sadan at gmail.com
http://sites.google.com/site/tsvisadan/
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list