[Corpora-List] SUMMARY: Extracting only editorial content from a HTML page

Wed Aug 10 20:38:28 UTC 2005

Hello,

Thanks to everyone who answered my question. The response has been
enormously. Some answers are related to my description of using regex
looking for patterns, followed by HTML cleaning. Here is a summary of other
approaches (some replies are also sent directly to the list):

- Aidan Finn's BTE module: http://www.smi.ucd.ie/hyppia/. 

- Java based sample, needs to be modified:
http://javaalmanac.com/egs/javax.swing.text.html/GetText.html

- An object model to load/walk the page (I used Microsoft's implementation
of the DOM (http://www.webreference.com/js/column40/)) - essentially, any
webpage is parsed and loaded into this, and then represented by a number of
software objects that one can walk, and manipulate etc.  The main advantage
of this approach is that the DOM essentially reformats the source HTML so
that it is consistent (adding elements as needed etc to make it 'good').

- If you have access to several articles from the same source, you can
delete everything that is equal (or very similar) starting from the top and
bottom across articles.

- I have used UNIX lynx (with -dump option) to extract plain text from HTML
pages which gets rid of most of the unwanted text you mentioned. I have also
been looking at some research from microsoft based on DOM and segmenting web
pages baed on visual appearance. They are able to spot regular patterns on
web pages like ads, menus etc.

- I looked at this a while ago; the solution I came up with is not perfect,
but seems to do a pretty good job, at least with news-like web pages.  The
key idea is to look for a subsequence of the text with the highest (# of
words) to (# of html tags) ratio.  You can fairly easily do this if you
first tokenize the web page into seqs of html tags and seqs of
non-html-tags, and then do simple dynamic programming to find the longest
contiguous sequence that's "mostly" words.  The only thing this misses is on
web pages that look like "First paragraph of article <some ads> rest of
article."  In these cases, the first paragraph is often lost.  You could
probably hueristically fix this, but I didn't.

- Since I was scrapping text from a limited number of Albanian language
websites, it was easy for me to search for repeated text in every page
coming from the same site. The repeated text was removed. This meant that I
had only one copy of pages containing everything. One of the sites I was
spidering changed its format three times in two months which generated quite
a bit of noise. The only way to get rid of it was to get back to regex. I
ended up using only regex in the end. As for the "sudden" changes, you could
use the absence of text repetition as a signal that there is a change and,
then, modify the regex accordingly.

- We have developed a tool in java, available through sourceforge to help
people do this task and others where some fragment of the web page needs to
be identified and/or extracted.  We have experimented with tagging and
extracting the main text, navigation links, title, headers, etc. from news
stories on various sites on the web.  Our software, PARCELS, also partially
handles sites that use XHTML/CSS (e.g. <DIV> tags) to place text.

You can find PARCELS on sourceforge at http://parcels.sourceforge.net

It may be overkill for a simple problem, but if you need to extract the same
type of information from multiple websites with different formats, this
toolkit may be of help.

- My approach is based on the HTML tags, rather than the more elaborate DOMs
and REs (as suggested in other responses to this message).  The problem in
basic HTML is that <p>'s don't have to be closed.  But, you can assume that
if you've got an opening <p>, then any prior one is now closed.  So, now
you've got a stretch of material and you can examine it for any other tags,
which almost always have a closing tag, and remove those tags, and perhaps
what's in them.  This will get rid of links, <img> elements, etc.  This is
the starting point for your algorithm, and you then refine it from there.
(One main problem with a <p> is that it may be embedded in a table, so you
have to decide what you want to do with tabular material.)

Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many
problems.  And, then you start getting into all sorts of other web pages.
Unless you have the resources (both time and money) to devote to a more
elaborate solution, you can do surprisingly well.

- For this task I use Python and BeautifulSoup: 
http://www.crummy.com/software/BeautifulSoup/. It's an extremely flexible
and robust DOM-ish parser, very well-suited for extracting bits of text out
of web pages.

- The other tool for this purpose which no-one has (so far) mentioned is
tidy -- http://tidy.sourceforge.net. It will take almost any html and turn
it into something usable very fast; it's also very robust and there is a
choice of APIs for integrating it into your own production system.

- First I extract the text from the HTML (I use lynx -dump for this).
Next I count the number of times every line in the collection of files
occurs.  Then I (manually) scan through the generated list and set a more or
less arbitrary threshold for filtering out the stuff I don't want, e.g. any
line that occurs more than 10 times (keeping an eye out for lines which may
have a high frequency for some other reason).

This edited list is then used as a filter - all lines which feature in it
are deleted from the collection of files.

Despite its dirtiness, this might have certain advantages.  It seems to work
quite robustly and is very quick (at least, for modest corpora of
~1 million words).  It allows you to remove things like "More >>" links
which often occur at the end of paras, rather than in header/footer or
navigation panels.  Moreover, you are able to keep information about the
frequency of boilerplate and furniture elements, while filtering them out of
the main corpus.

On the down side, it requires tailoring to each website from which you wish
to collect data - which in our specific case happens not to be a problem.
Some revision would be necessary if the corpus were to be updated with new
material from a previously collected site.  It is also likely that some
things are cut which you might want to keep (e.g.
frequent subheadings, which occur on many pages, whilst not coming under
header/footer/navigation panel categories).  Similarly, some unwanted text
gets through.  

On the whole it seems to work well enough for us, though.

- Another useful reference is the VIPS work from microsoft:
http://research.microsoft.com/research/pubs/view.aspx?tr_id=690

Helge Thomas Hellerud