[Corpora-List] Extracting only editorial content from a HTML page

Andy Roberts andyr at comp.leeds.ac.uk
Tue Aug 9 11:44:46 UTC 2005


I agree with this approach. Providing that you are happy with the
assumption that all content is between <p> tags then parse the HTML doc
into a DOM tree. Walk the tree and extract all the text from <p> tags.

I'm not sure if there is an actual program already written to do this. I
know I've used the JTidy library to do something similar, but that
required me to write some actual code to utilise it. JTidy is a Java
port of the HTML Tidy project. The JTidy has more functionality though,
and does provide the ability to parse HTML into DOM structure.  It can
do lots of other things too with HTML.

Andy

On Tue, 9 Aug 2005, peetm wrote:

> I started looking at this problem a couple of years ago (I've since changed
> tack - so am no longer continuing with looking at this).
>
> However, the approach I used was roughly as follows.
>
> I first used regular expressions, but soon gave up on them - it's amazing
> how well [some] browsers cope with badly formatted HTML (that can throw your
> regexps)
>
> So, in the end, I used an object model to load/walk the page (I used
> Microsoft's implementation of the DOM
> (http://www.webreference.com/js/column40/)) - essentially, any webpage is
> parsed and loaded into this, and then represented by a number of software
> objects that one can walk, and manipulate etc.  The main advantage of this
> approach is that the DOM essentially reformats the source HTML so that it is
> consistent (adding elements as needed etc to make it 'good').
>
> For example, if the source contained this
>
> 1. <p><b>this is some text</p></b>
>
> Or this
>
> 2. <p><b>this is some text
>
> Or this
>
> 3. <p><b>this is some text</b>
>
> The object model I used 'rendered' it as
>
> <p>
> 	<b>
> 		This is some text
> 	</b>
> </p>
>
> So, it 'fixed' the bad tag ordering in '1', added the </b></p> in '2', and
> the </p> in '3' - very clever parsing!  BTW, some DOMs do better at this
> than others of course (one of the reasons that some browsers display certain
> pages better than others do - does their DOM 'fix' the HTML?)!
>
> The object model also allows one to easily ignore tags (the tags are simply
> node types in the model) - or - enables one to just select (say) paragraph
> sections of a page.
>
> I did the latter, and then threw out any paragraphs that contained single
> sentences, or other junk stuff (like images).
>
> It worked pretty well, although it was a little slower than it might have
> been using regexps.
>
>
>
> peetm
>
> email: peet.morris at clg.ox.ac.uk
>
> addr: Computational Linguistics Group
>       University of Oxford
>       The Clarendon Institute
>       Walton Street
>       Oxford
>       OX1 2HG
>
> =======================================
>
> Important: This email is intended for the use of the individual addressee(s)
> named above and may contain information that is confidential, privileged or
> unsuitable for overly sensitive persons with low self-esteem, no sense of
> humour or irrational religious beliefs.
> If you are not the intended recipient, then social etiquette demands that
> you fully appropriate the message without trace of the former sender and
> triumphantly claim it as your own. Leaving a former sender's signature on a
> "forwarded" email is very bad form and, while being only a technical breach
> of the Olympic ideal, does in fact constitute an irritating social faux pas.
> Further, sending this email to a colleague does not appear to breach the
> provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
> Commonwealth, because chances are none of the thoughts contained in this
> email are in any sense original...
> Finally, if you have received this email in error, shred it immediately,
> then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
> peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
> let it stand for 2 hours before adding the decorative kiwi fruit and cream.
> Then notify me immediately by return email and eat the original message.
>
>
> -----Original Message-----
> From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
> Behalf Of Helge Thomas Hellerud
> Sent: 09 August 2005 10:43
> To: corpora at uib.no
> Subject: [Corpora-List] Extracting only editorial content from a HTML page
>
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>
>


More information about the Corpora mailing list