[Corpora-List] tool for extracting text from web forum and websites
Stefan Th. Gries
stgries at gmail.com
Fri Oct 16 15:13:51 UTC 2009
2009/10/16 Bjørn Arild Mæland <bjorn.maeland at gmail.com>:
> This regexp is a good start, but its important to note that it isn't enough for cleaning documents that use inline JavaScript and/or CSS.
Of course not, but If I remember correctly, the query said something
like 'cleaning is not an issue right now' so I focused on the download
part ;-) There are of course *many* ugly issues to be dealt with.
Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list