[Corpora-List] tool for extracting text from web forum and websites

Bruno J. M. Melo brunojm at gmail.com
Thu Oct 22 19:11:57 UTC 2009


There is another good Python solution: http://scrapy.org/

Regards,

2009/10/22 Rob Malouf <rmalouf at mail.sdsu.edu>

> For those who want to get serious about cleaning up HTML, I strongly
> recommend:
>
> http://www.crummy.com/software/BeautifulSoup/
>
> --
> Rob Malouf <rmalouf at mail.sdsu.edu>
> Department of Linguistics and Asian / Middle Eastern Languages
> San Diego State University
>
>
> On Oct 16, 2009, at 7:17 AM, Bjørn Arild Mæland wrote:
>
>  You can use R (<http://www.r-project.org/>) to download files and
>>> clean them easily: to load the contents of
>>> <
>>> http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html
>>> >,
>>> you just enter this at the console
>>>
>>> (x <- gsub("<[^>]*?>", "",
>>> scan("
>>> http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html
>>> ",
>>> what=character(0), sep="\n",  quote="", comment.char=""), perl=T))
>>>
>>
>> This regexp is a good start, but its important to note that it isn't
>> enough for cleaning documents that use inline JavaScript and/or CSS.
>> HTML comments can also cause problems since they can contain the '>'
>> character without ending the comment. In NLTK (http://www.nltk.org/)
>> we use the following cascade of regular expressions (in Python):
>>
>>  cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
>>  cleaned = re.sub(r"(?s)<!--.*?-->", "", cleaned)
>>  cleaned = re.sub(r"(?s)<.*?>", "", cleaned)
>>
>> ((?is) is the Python way of saying that the expression should be
>> matched case insensitively, and that the '.' character also should
>> match newlines.)
>>
>> HTML entities is another matter, but that is more application-specific.
>>
>> -Bjørn Arild Mæland
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Bruno JM Melo <bjmm at acm.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091022/4cefc90e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list