[Corpora-List] Extracting only editorial content from a HTML page

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Tue Aug 9 18:09:46 UTC 2005


The other tool for this purpose which no-one has (so far) mentioned is 
tidy -- http://tidy.,sourceforge.net

It will take almost any html and turn it into something usable very 
fast; it's also very robust and there is a choice of APIs for 
integrating it into your own production system

Lou


On 9 Aug 2005, at 18:43, Rob Malouf wrote:

> Hi,
>
> For this task I use Python and BeautifulSoup:
>
> http://www.crummy.com/software/BeautifulSoup/
>
> It's an extremely flexible and robust DOM-ish parser, very well-suited
> for extracting bits of text out of web pages.
>
> -- 
> Rob Malouf <rmalouf at mail.sdsu.edu>
> Department of Linguistics and Oriental Languages
> San Diego State University
>
>
>
>
>
 From the Macmini at Burnard Towers



More information about the Corpora mailing list