[Corpora-List] Extracting only editorial content from a HTML page
Helge Thomas Hellerud
helgetho at stud.ntnu.no
Tue Aug 9 09:43:12 UTC 2005
Hello,
I want to extract the article text of a HTML page (for instance the text of
a news article). But a HTML page contains much "noise", like menus and ads.
So I want to ask if anyone know a way to eliminate unwanted elements like
menus and ads, and only extract the editorial article text?
Of course, I can use Regex to look for patterns in the HTML code (by
defining a starting point and an ending point), but the solution will be a
hack that will not work if the pattern in the HTML page suddenly is changed.
So do you know how to extract the content without using such a hack?
Thanks in advance.
Helge Thomas Hellerud
More information about the Corpora
mailing list