[Corpora-List] Extracting only editorial content from a HTML page
Lou Burnard
lou.burnard at computing-services.oxford.ac.uk
Tue Aug 9 18:09:46 UTC 2005
The other tool for this purpose which no-one has (so far) mentioned is
tidy -- http://tidy.,sourceforge.net
It will take almost any html and turn it into something usable very
fast; it's also very robust and there is a choice of APIs for
integrating it into your own production system
Lou
On 9 Aug 2005, at 18:43, Rob Malouf wrote:
> Hi,
>
> For this task I use Python and BeautifulSoup:
>
> http://www.crummy.com/software/BeautifulSoup/
>
> It's an extremely flexible and robust DOM-ish parser, very well-suited
> for extracting bits of text out of web pages.
>
> --
> Rob Malouf <rmalouf at mail.sdsu.edu>
> Department of Linguistics and Oriental Languages
> San Diego State University
>
>
>
>
>
From the Macmini at Burnard Towers
More information about the Corpora
mailing list