[Corpora-List] Extracting only editorial content from a HTML page

Martin Thomas M.Thomas at leeds.ac.uk
Wed Aug 10 07:38:51 UTC 2005


Hi all,

I have been playing with a very crude approach to this problem
(extraction of boilerplate and other furniture such as
header/footer/navigation panels), which I don't think anyone has
mentioned here yet...

First I extract the text from the HTML (I use lynx -dump for this).
Next I count the number of times every line in the collection of files
occurs.  Then I (manually) scan through the generated list and set a
more or less arbitrary threshold for filtering out the stuff I don't
want, e.g. any line that occurs more than 10 times (keeping an eye out
for lines which may have a high frequency for some other reason).

This edited list is then used as a filter - all lines which feature in
it are deleted from the collection of files.

Despite its dirtiness, this might have certain advantages.  It seems to
work quite robustly and is very quick (at least, for modest corpora of
~1 million words).  It allows you to remove things like "More >>" links
which often occur at the end of paras, rather than in header/footer or
navigation panels.  Moreover, you are able to keep information about the
frequency of boilerplate and furniture elements, while filtering them
out of the main corpus.

On the down side, it requires tailoring to each website from which you
wish to collect data - which in our specific case happens not to be a
problem.  Some revision would be necessary if the corpus were to be
updated with new material from a previously collected site.  It is also
likely that some things are cut which you might want to keep (e.g.
frequent subheadings, which occur on many pages, whilst not coming under
header/footer/navigation panel categories).  Similarly, some unwanted
text gets through.  

On the whole it seems to work well enough for us, though.

Best,
Martin Thomas

Centre for Translation Studies
University of Leeds

On Tue, 2005-08-09 at 22:02 -0400, Mike Maxwell wrote:
> Lou Burnard wrote:
>  > The other tool for this purpose which no-one has (so far) mentioned is
>  > tidy -- http://tidy.,sourceforge.net
>  >
>  > It will take almost any html and turn it into something usable very
>  > fast; it's also very robust and there is a choice of APIs for
>  > integrating it into your own production system
> 
> I think the original question was how to deal with the boilerplate text 
> that often appears at the top and bottom of html files, so it doesn't get 
> included in the text one extracts from a web page.  (If that wasn't the 
> original question, it's mine :-).)  By "boilerplate", I mean things like 
> copyright notices, "Enroll in our big extravaganza", "Download our super 
> font", menu items, and other such trash.
> 
> I dealt with that in some work I did by using regexs tailored to the sort 
> of trash that each web site used.  But the regexs had to be tailored, they 
> were fragile when a site changed its boilerplate (as someone else pointed 
> out), and you could in fact run out of stack space in Python (and 
> presumably other interpreters), so you had to be careful how you designed 
> your regexs.  All in all, not a very good solution.
> 
> I should look back and see if I can just skip to the first <p> tag, but 
> again, I doubt whether that will work for all sites: some of them put the 
> main text into tables, IIRC.
> 
> Possibly I could do some sort of language ID (since all of the texts I 
> wanted were non-English).  But then again, some of the menu items were 
> non-English.  Or given that this stuff is boilerplate, and tends to change 
> slowly at any one web site, maybe I could train a recognizer for the 
> boilerplate (as opposed to a recognizer for the text).  Has anyone tried 
> that?  (One piece that sometimes occurs inside the boilerplate, and which 
> changes rapidly, is the date.  Again, I used a regex "solution".)
> 
> I haven't tried the "look for the place where you start to get a higher 
> text-to-tag ratio" method that was also mentioned.
> 
> It looks to me like 'tidy' is intended to handle incorrectly structure 
> html.  Can it be used to extract text, and in particular to throw away 
> header and footer boilerplate?



More information about the Corpora mailing list