[Corpora-List] Extracting only editorial content from a HTML page

Tue Aug 9 11:02:36 UTC 2005

I looked at this a while ago; the solution I came up with is not perfect, 
but seems to do a pretty good job, at least with news-like web pages.  The 
key idea is to look for a subsequence of the text with the highest (# of 
words) to (# of html tags) ratio.  You can fairly easily do this if you 
first tokenize the web page into seqs of html tags and seqs of 
non-html-tags, and then do simple dynamic programming to find the longest 
contiguous sequence that's "mostly" words.  The only thing this misses is 
on web pages that look like "First paragraph of article <some ads> rest of 
article."  In these cases, the first paragraph is often lost.  You could 
probably hueristically fix this, but I didn't.

On Tue, 9 Aug 2005, Helge Thomas Hellerud wrote:

> Hello,
> 
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
> 
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
> 
> Thanks in advance.
> 
> Helge Thomas Hellerud
> 
> 

-- 
 Hal Daume III                                   | hdaume at isi.edu
 "Arrest this man, he talks in maths."           | www.isi.edu/~hdaume