[Corpora-List] Extracting only editorial content from a HTMLpage

Serge Sharoff S.Sharoff at leeds.ac.uk
Wed Aug 10 08:54:17 UTC 2005


Just to extend the applicability of the "crude" method.  I used it on corpora of the size 100-150 million words collected from thousands of websites, and it works just fine:
sort <collected_corpus_file | uniq -c | sort -nr -k 1 | head -1000
takes about 2 hours on a modern computer and detects the majority of navigation frames and boilerplate to filterout.  Though, this method is corpus dependent.  Another corpus-independent approach has been already hinted on this list; this is Finn's BTE module
http://www.smi.ucd.ie/hyppia/
which analyses the density of links in a file and removes high-density areas, which are responsible for navigation frames.  This works remarkably well even on single files, but doesn't remove boilerplate, such as "Powered by vBulletin" or "The BBC is not responsible for the content of external internet sites".
Serge

> -----Original Message-----
> From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
> Behalf Of Martin Thomas
> Sent: Wednesday, August 10, 2005 9:10 AM
> To: corpora at uib.no
> Subject: Re: [Corpora-List] Extracting only editorial content from a
> HTMLpage
> 
> Hi all,
> 
> I have been playing with a very crude approach to this problem
> (extraction of boilerplate and other furniture such as
> header/footer/navigation panels), which I don't think anyone has
> mentioned here yet...
> 
> First I extract the text from the HTML (I use lynx -dump for this).
> Next I count the number of times every line in the collection of files
> occurs.  Then I (manually) scan through the generated list and set a
> more or less arbitrary threshold for filtering out the stuff I don't
> want, e.g. any line that occurs more than 10 times (keeping an eye out
> for lines which may have a high frequency for some other reason).
> 
> This edited list is then used as a filter - all lines which feature in
> it are deleted from the collection of files.
> 
> Despite its dirtiness, this might have certain advantages.  It seems to
> work quite robustly and is very quick (at least, for modest corpora of
> ~1 million words).  It allows you to remove things like "More >>" links
> which often occur at the end of paras, rather than in header/footer or
> navigation panels.  Moreover, you are able to keep information about the
> frequency of boilerplate and furniture elements, while filtering them
> out of the main corpus.
> 
> On the down side, it requires tailoring to each website from which you
> wish to collect data - which in our specific case happens not to be a
> problem.  Some revision would be necessary if the corpus were to be
> updated with new material from a previously collected site.  It is also
> likely that some things are cut which you might want to keep (e.g.
> frequent subheadings, which occur on many pages, whilst not coming under
> header/footer/navigation panel categories).  Similarly, some unwanted
> text gets through.
> 
> On the whole it seems to work well enough for us, though.
> 
> Best,
> Martin Thomas
> 
> Centre for Translation Studies
> University of Leeds
> 
> PS - I hope this message doesn't appear twice - sorry if it does - I
> originally sent it from a non-member email account.
> 



More information about the Corpora mailing list