[Corpora-List] tool for extracting text from web forum and websites

Stefan Th. Gries stgries at gmail.com
Fri Oct 16 15:13:51 UTC 2009


2009/10/16 Bjørn Arild Mæland <bjorn.maeland at gmail.com>:
> This regexp is a good start, but its important to note that it isn't enough for cleaning documents that use inline JavaScript and/or CSS.
Of course not, but If I remember correctly, the query said something
like 'cleaning is not an issue right now' so I focused on the download
part ;-) There are of course *many* ugly issues to be dealt with.

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list