[Corpora-List] Extracting only editorial content from a HTML page
Ken Litkowski
ken at clres.com
Tue Aug 9 15:00:55 UTC 2005
My approach is based on the HTML tags, rather than the more elaborate
DOMs and REs (as suggested in other responses to this message). The
problem in basic HTML is that <p>'s don't have to be closed. But, you
can assume that if you've got an opening <p>, then any prior one is now
closed. So, now you've got a stretch of material and you can examine it
for any other tags, which almost always have a closing tag, and remove
those tags, and perhaps what's in them. This will get rid of links,
<img> elements, etc. This is the starting point for your algorithm, and
you then refine it from there. (One main problem with a <p> is that it
may be embedded in a table, so you have to decide what you want to do
with tabular material.)
Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many
problems. And, then you start getting into all sorts of other web
pages. Unless you have the resources (both time and money) to devote to
a more elaborate solution, you can do surprisingly well.
Ken
Helge Thomas Hellerud wrote:
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>
>
--
Ken Litkowski TEL.: 301-482-0237
CL Research EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
More information about the Corpora
mailing list