[Corpora-List] Extracting only editorial content from a HTML page

Rob Malouf rmalouf at mail.sdsu.edu
Tue Aug 9 17:43:06 UTC 2005

Previous message (by thread): [Corpora-List] Extracting only editorial content from a HTML page
Next message (by thread): [Corpora-List] Extracting only editorial content from a HTML page
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

For this task I use Python and BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

It's an extremely flexible and robust DOM-ish parser, very well-suited
for extracting bits of text out of web pages.

-- 
Rob Malouf <rmalouf at mail.sdsu.edu>
Department of Linguistics and Oriental Languages
San Diego State University

Previous message (by thread): [Corpora-List] Extracting only editorial content from a HTML page
Next message (by thread): [Corpora-List] Extracting only editorial content from a HTML page
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Corpora mailing list