[Corpora-List] Extracting only editorial content from a HTML page

Rob Malouf rmalouf at mail.sdsu.edu
Tue Aug 9 17:43:06 UTC 2005


Hi,

For this task I use Python and BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

It's an extremely flexible and robust DOM-ish parser, very well-suited
for extracting bits of text out of web pages.

-- 
Rob Malouf <rmalouf at mail.sdsu.edu>
Department of Linguistics and Oriental Languages
San Diego State University



More information about the Corpora mailing list