[Corpora-List] Extracting only editorial content from a HTML page

Tue Aug 9 15:00:55 UTC 2005

My approach is based on the HTML tags, rather than the more elaborate 
DOMs and REs (as suggested in other responses to this message).  The 
problem in basic HTML is that <p>'s don't have to be closed.  But, you 
can assume that if you've got an opening <p>, then any prior one is now 
closed.  So, now you've got a stretch of material and you can examine it 
for any other tags, which almost always have a closing tag, and remove 
those tags, and perhaps what's in them.  This will get rid of links, 
<img> elements, etc.  This is the starting point for your algorithm, and 
you then refine it from there.  (One main problem with a <p> is that it 
may be embedded in a table, so you have to decide what you want to do 
with tabular material.)

Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many 
problems.  And, then you start getting into all sorts of other web 
pages.  Unless you have the resources (both time and money) to devote to 
a more elaborate solution, you can do surprisingly well.

	Ken

Helge Thomas Hellerud wrote:

> Hello,
> 
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
> 
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
> 
> Thanks in advance.
> 
> Helge Thomas Hellerud
> 
> 
> 
> 

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com