[Corpora-List] Extracting only editorial content from a HTML page
peetm
peet.morris at comlab.ox.ac.uk
Tue Aug 9 10:48:06 UTC 2005
I started looking at this problem a couple of years ago (I've since changed
tack - so am no longer continuing with looking at this).
However, the approach I used was roughly as follows.
I first used regular expressions, but soon gave up on them - it's amazing
how well [some] browsers cope with badly formatted HTML (that can throw your
regexps)
So, in the end, I used an object model to load/walk the page (I used
Microsoft's implementation of the DOM
(http://www.webreference.com/js/column40/)) - essentially, any webpage is
parsed and loaded into this, and then represented by a number of software
objects that one can walk, and manipulate etc. The main advantage of this
approach is that the DOM essentially reformats the source HTML so that it is
consistent (adding elements as needed etc to make it 'good').
For example, if the source contained this
1. <p><b>this is some text</p></b>
Or this
2. <p><b>this is some text
Or this
3. <p><b>this is some text</b>
The object model I used 'rendered' it as
<p>
<b>
This is some text
</b>
</p>
So, it 'fixed' the bad tag ordering in '1', added the </b></p> in '2', and
the </p> in '3' - very clever parsing! BTW, some DOMs do better at this
than others of course (one of the reasons that some browsers display certain
pages better than others do - does their DOM 'fix' the HTML?)!
The object model also allows one to easily ignore tags (the tags are simply
node types in the model) - or - enables one to just select (say) paragraph
sections of a page.
I did the latter, and then threw out any paragraphs that contained single
sentences, or other junk stuff (like images).
It worked pretty well, although it was a little slower than it might have
been using regexps.
peetm
email: peet.morris at clg.ox.ac.uk
addr: Computational Linguistics Group
University of Oxford
The Clarendon Institute
Walton Street
Oxford
OX1 2HG
=======================================
Important: This email is intended for the use of the individual addressee(s)
named above and may contain information that is confidential, privileged or
unsuitable for overly sensitive persons with low self-esteem, no sense of
humour or irrational religious beliefs.
If you are not the intended recipient, then social etiquette demands that
you fully appropriate the message without trace of the former sender and
triumphantly claim it as your own. Leaving a former sender's signature on a
"forwarded" email is very bad form and, while being only a technical breach
of the Olympic ideal, does in fact constitute an irritating social faux pas.
Further, sending this email to a colleague does not appear to breach the
provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
Commonwealth, because chances are none of the thoughts contained in this
email are in any sense original...
Finally, if you have received this email in error, shred it immediately,
then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
let it stand for 2 hours before adding the decorative kiwi fruit and cream.
Then notify me immediately by return email and eat the original message.
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Helge Thomas Hellerud
Sent: 09 August 2005 10:43
To: corpora at uib.no
Subject: [Corpora-List] Extracting only editorial content from a HTML page
Hello,
I want to extract the article text of a HTML page (for instance the text of
a news article). But a HTML page contains much "noise", like menus and ads.
So I want to ask if anyone know a way to eliminate unwanted elements like
menus and ads, and only extract the editorial article text?
Of course, I can use Regex to look for patterns in the HTML code (by
defining a starting point and an ending point), but the solution will be a
hack that will not work if the pattern in the HTML page suddenly is changed.
So do you know how to extract the content without using such a hack?
Thanks in advance.
Helge Thomas Hellerud
More information about the Corpora
mailing list