[Corpora-List] Extracting only editorial content from a HTML page

Tue Aug 9 14:49:14 UTC 2005

Hi Helge, all:

In addition to all the tools that people have mentioned, I will add my
own.  We have developed a tool in java, available through sourceforge
to help people do this task and others where some fragment of the web
page needs to be identified and/or extracted.  We have experimented
with tagging and extracting the main text, navigation links, title,
headers, etc. from news stories on various sites on the web.  Our
software, PARCELS, also partially handles sites that use XHTML/CSS
(e.g. <DIV> tags) to place text.

You can find PARCELS on sourceforge at http://parcels.sourceforge.net

It may be overkill for a simple problem, but if you need to extract
the same type of information from multiple websites with different
formats, this toolkit may be of help.

Min-Yen Kan
National University of Singapore

On 8/9/05, Helge Thomas Hellerud <helgetho at stud.ntnu.no> wrote:
> Hello,
> 
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
> 
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
> 
> Thanks in advance.
> 
> Helge Thomas Hellerud
> 
> 
>