[Corpora-List] Extracting only editorial content from a HTML page
Alexander Schutz
goalscoringsuperstarhero at gmail.com
Tue Aug 9 13:41:34 UTC 2005
Helge,
Aidan Finn and Nick Kushmerick did some interesting research on how to
identify and extract relevant parts (i.e. containing plaintext) of a
given webpage.
The boilerplate removal tool worked quite well for me when I tested
it and I've heard some good things from other people about it, too.
check out this link and follow BTE
http://www.smi.ucd.ie/hyppia/
Best,
Alex
On 8/9/05, Helge Thomas Hellerud <helgetho at stud.ntnu.no> wrote:
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>
--
Alexander Schutz
Student of Computational Linguistics
University of Saarland, Germany
More information about the Corpora
mailing list