[Corpora-List] Extracting only editorial content from a HTML page

Helge Thomas Hellerud helgetho at stud.ntnu.no
Tue Aug 9 09:43:12 UTC 2005


Hello,

I want to extract the article text of a HTML page (for instance the text of
a news article). But a HTML page contains much "noise", like menus and ads.
So I want to ask if anyone know a way to eliminate unwanted elements like
menus and ads, and only extract the editorial article text?

Of course, I can use Regex to look for patterns in the HTML code (by
defining a starting point and an ending point), but the solution will be a
hack that will not work if the pattern in the HTML page suddenly is changed.
So do you know how to extract the content without using such a hack?

Thanks in advance.

Helge Thomas Hellerud



More information about the Corpora mailing list