[Corpora-List] tool for extracting text from web forum and websites

Bjørn Arild Mæland bjorn.meland at student.uib.no
Fri Oct 16 14:17:40 UTC 2009


> You can use R (<http://www.r-project.org/>) to download files and
> clean them easily: to load the contents of
> <http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html>,
> you just enter this at the console
>
> (x <- gsub("<[^>]*?>", "",
> scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html",
> what=character(0), sep="\n",  quote="", comment.char=""), perl=T))

This regexp is a good start, but its important to note that it isn't
enough for cleaning documents that use inline JavaScript and/or CSS.
HTML comments can also cause problems since they can contain the '>'
character without ending the comment. In NLTK (http://www.nltk.org/)
we use the following cascade of regular expressions (in Python):

   cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
   cleaned = re.sub(r"(?s)<!--.*?-->", "", cleaned)
   cleaned = re.sub(r"(?s)<.*?>", "", cleaned)

((?is) is the Python way of saying that the expression should be
matched case insensitively, and that the '.' character also should
match newlines.)

HTML entities is another matter, but that is more application-specific.

-Bjørn Arild Mæland

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list