[Corpora-List] tool for extracting text from web forum and websites
    Stefan Th. Gries 
    stgries at gmail.com
       
    Fri Oct 16 15:13:51 UTC 2009
    
    
  
2009/10/16 Bjørn Arild Mæland <bjorn.maeland at gmail.com>:
> This regexp is a good start, but its important to note that it isn't enough for cleaning documents that use inline JavaScript and/or CSS.
Of course not, but If I remember correctly, the query said something
like 'cleaning is not an issue right now' so I focused on the download
part ;-) There are of course *many* ugly issues to be dealt with.
Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
    
    
More information about the Corpora
mailing list