There is another good Python solution: <a href="http://scrapy.org/">http://scrapy.org/</a><br><br>Regards,<br><br><div class="gmail_quote">2009/10/22 Rob Malouf <span dir="ltr"><<a href="mailto:rmalouf@mail.sdsu.edu">rmalouf@mail.sdsu.edu</a>></span><br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">For those who want to get serious about cleaning up HTML, I strongly recommend:<br>

<br>

<a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">http://www.crummy.com/software/BeautifulSoup/</a><br><font color="#888888">

<br>

--<br>

Rob Malouf <<a href="mailto:rmalouf@mail.sdsu.edu" target="_blank">rmalouf@mail.sdsu.edu</a>><br>

Department of Linguistics and Asian / Middle Eastern Languages<br>

San Diego State University</font><div><div></div><div class="h5"><br>

<br>

On Oct 16, 2009, at 7:17 AM, Bjørn Arild Mæland wrote:<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


You can use R (<<a href="http://www.r-project.org/" target="_blank">http://www.r-project.org/</a>>) to download files and<br>

clean them easily: to load the contents of<br>

<<a href="http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html" target="_blank">http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html</a>>,<br>

you just enter this at the console<br>

<br>

(x <- gsub("<[^>]*?>", "",<br>

scan("<a href="http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html" target="_blank">http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html</a>",<br>

what=character(0), sep="\n",  quote="", comment.char=""), perl=T))<br>

</blockquote>

<br>

This regexp is a good start, but its important to note that it isn't<br>

enough for cleaning documents that use inline JavaScript and/or CSS.<br>

HTML comments can also cause problems since they can contain the '>'<br>

character without ending the comment. In NLTK (<a href="http://www.nltk.org/" target="_blank">http://www.nltk.org/</a>)<br>

we use the following cascade of regular expressions (in Python):<br>

<br>

  cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())<br>

  cleaned = re.sub(r"(?s)<!--.*?-->", "", cleaned)<br>

  cleaned = re.sub(r"(?s)<.*?>", "", cleaned)<br>

<br>

((?is) is the Python way of saying that the expression should be<br>

matched case insensitively, and that the '.' character also should<br>

match newlines.)<br>

<br>

HTML entities is another matter, but that is more application-specific.<br>

<br>

-Bjørn Arild Mæland<br>

<br>

_______________________________________________<br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br>

</blockquote>

<br>

<br>

_______________________________________________<br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>Bruno JM Melo <<a href="mailto:bjmm@acm.org">bjmm@acm.org</a>><br>