[Corpora-List] How do we extract actual text in html?

Siddhartha Jonnalagadda sid.kgp at gmail.com
Mon Aug 2 17:04:40 UTC 2010


I believe we do further reading lest we insult the developers replying to
the thread.

On Mon, Aug 2, 2010 at 9:46 AM, Tsvi Sadan <tsvi.sadan at gmail.com> wrote:

> Constantin Orasan:
>
> > When you deal with newspaper articles, one thing you want to check is
> > if there is a print version of the page. Usually the print version
> > contains mainly the text of the article without menus and extra
> > information.
>
> And after this process, you can save the articles and use the following
> regex expression with any text editor supporting regex to remove all the
> (X)HTML tags (and extract actual text); no bloatware is required:
>
> Find: <[^>]+>
> Replace: (leave this line blank)
>
> --
> Tsvi Sadan (Tsuguya Sasaki), PhD
> Senior Lecturer
> Department of Hebrew and Semitic Languages
> Bar-Ilan University, Israel
> tsvi.sadan at gmail.com
> http://sites.google.com/site/tsvisadan/
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100802/5777a74c/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list