There is another good Python solution: <a href="http://scrapy.org/">http://scrapy.org/</a><br><br>Regards,<br><br><div class="gmail_quote">2009/10/22 Rob Malouf <span dir="ltr"><<a href="mailto:rmalouf@mail.sdsu.edu">rmalouf@mail.sdsu.edu</a>></span><br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">For those who want to get serious about cleaning up HTML, I strongly recommend:<br>
<br>
<a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">http://www.crummy.com/software/BeautifulSoup/</a><br><font color="#888888">
<br>
--<br>
Rob Malouf <<a href="mailto:rmalouf@mail.sdsu.edu" target="_blank">rmalouf@mail.sdsu.edu</a>><br>
Department of Linguistics and Asian / Middle Eastern Languages<br>
San Diego State University</font><div><div></div><div class="h5"><br>
<br>
On Oct 16, 2009, at 7:17 AM, Bjørn Arild Mæland wrote:<br>
<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
You can use R (<<a href="http://www.r-project.org/" target="_blank">http://www.r-project.org/</a>>) to download files and<br>
clean them easily: to load the contents of<br>
<<a href="http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html" target="_blank">http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html</a>>,<br>
you just enter this at the console<br>
<br>
(x <- gsub("<[^>]*?>", "",<br>
scan("<a href="http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html" target="_blank">http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html</a>",<br>
what=character(0), sep="\n", quote="", comment.char=""), perl=T))<br>
</blockquote>
<br>
This regexp is a good start, but its important to note that it isn't<br>
enough for cleaning documents that use inline JavaScript and/or CSS.<br>
HTML comments can also cause problems since they can contain the '>'<br>
character without ending the comment. In NLTK (<a href="http://www.nltk.org/" target="_blank">http://www.nltk.org/</a>)<br>
we use the following cascade of regular expressions (in Python):<br>
<br>
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())<br>
cleaned = re.sub(r"(?s)<!--.*?-->", "", cleaned)<br>
cleaned = re.sub(r"(?s)<.*?>", "", cleaned)<br>
<br>
((?is) is the Python way of saying that the expression should be<br>
matched case insensitively, and that the '.' character also should<br>
match newlines.)<br>
<br>
HTML entities is another matter, but that is more application-specific.<br>
<br>
-Bjørn Arild Mæland<br>
<br>
_______________________________________________<br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br>
</blockquote>
<br>
<br>
_______________________________________________<br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Bruno JM Melo <<a href="mailto:bjmm@acm.org">bjmm@acm.org</a>><br>