[Corpora-List] Extracting only editorial content from a HTML page
Rob Malouf
rmalouf at mail.sdsu.edu
Tue Aug 9 17:43:06 UTC 2005
Hi,
For this task I use Python and BeautifulSoup:
http://www.crummy.com/software/BeautifulSoup/
It's an extremely flexible and robust DOM-ish parser, very well-suited
for extracting bits of text out of web pages.
--
Rob Malouf <rmalouf at mail.sdsu.edu>
Department of Linguistics and Oriental Languages
San Diego State University
More information about the Corpora
mailing list