[Corpora-List] Extracting only editorial content from a HTML page

Alexander Schutz goalscoringsuperstarhero at gmail.com
Tue Aug 9 13:41:34 UTC 2005


Helge,

Aidan Finn and Nick Kushmerick did some interesting research on how to
identify and extract relevant parts (i.e. containing plaintext) of a
given webpage.
The boilerplate removal tool worked quite well for me when I tested
it and I've heard some good things from other people about it, too.
check out this link and follow BTE
http://www.smi.ucd.ie/hyppia/

Best,
Alex

On 8/9/05, Helge Thomas Hellerud <helgetho at stud.ntnu.no> wrote:
> Hello,
> 
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
> 
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
> 
> Thanks in advance.
> 
> Helge Thomas Hellerud
> 
> 
> 


-- 
Alexander Schutz
Student of Computational Linguistics
University of Saarland, Germany



More information about the Corpora mailing list