[Corpora-List] Extracting only editorial content from a HTML page

Niels Ott niels at drni.de
Fri Aug 19 10:21:10 UTC 2005


Alexander et al,

this is a late reply... We are currently working on a project that has
the goal to extract corpora from the web and of course came accross the
topic. Boilerplate removal was a topic we recently worked on.

Alexander Schutz wrote:
> The boilerplate removal tool worked quite well for me when I tested
> it and I've heard some good things from other people about it, too.
> check out this link and follow BTE
> http://www.smi.ucd.ie/hyppia/

This was the first approach we implemented/took over into our "toolbox".
It turned out that his code follows the right path but leads to several
problems.

Apart from beeing slow, the algorithm misses boilerplates in the middle
of a page.

Additionaly the original tag recognition code does not find all tags.

If you want to use BTE, you should be into programming to an extend that
allows you to repair/modify those regular expressions involved. You
should also check your output over and over again. HTML writers tend to
produce thinks you won't dream of in your worst nightmares. ;-)

Greetingens from Tübingen/Germany,

  Niels

-- 
http://www.drni.de/niels/



More information about the Corpora mailing list