[Corpora-List] tool for extracting text from web forum and websites

Rob Malouf rmalouf at mail.sdsu.edu
Thu Oct 22 14:13:00 UTC 2009


For those who want to get serious about cleaning up HTML, I strongly  
recommend:

http://www.crummy.com/software/BeautifulSoup/

--
Rob Malouf <rmalouf at mail.sdsu.edu>
Department of Linguistics and Asian / Middle Eastern Languages
San Diego State University

On Oct 16, 2009, at 7:17 AM, Bjørn Arild Mæland wrote:

>> You can use R (<http://www.r-project.org/>) to download files and
>> clean them easily: to load the contents of
>> <http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html 
>> >,
>> you just enter this at the console
>>
>> (x <- gsub("<[^>]*?>", "",
>> scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html 
>> ",
>> what=character(0), sep="\n",  quote="", comment.char=""), perl=T))
>
> This regexp is a good start, but its important to note that it isn't
> enough for cleaning documents that use inline JavaScript and/or CSS.
> HTML comments can also cause problems since they can contain the '>'
> character without ending the comment. In NLTK (http://www.nltk.org/)
> we use the following cascade of regular expressions (in Python):
>
>   cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "",  
> html.strip())
>   cleaned = re.sub(r"(?s)<!--.*?-->", "", cleaned)
>   cleaned = re.sub(r"(?s)<.*?>", "", cleaned)
>
> ((?is) is the Python way of saying that the expression should be
> matched case insensitively, and that the '.' character also should
> match newlines.)
>
> HTML entities is another matter, but that is more application- 
> specific.
>
> -Bjørn Arild Mæland
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list