[Corpora-List] Extracting only editorial content from a HTML page

Lars Nygaard lars.nygaard at iln.uio.no
Tue Aug 16 12:45:41 UTC 2005


List members,

I've had a couple of request for the source code I wrote for extracting 
editorila content; so reworked my original script into something that 
should be more useable for other people (though a superficial knowledge 
of Perl is still required - this should be fixed in the next version):

http://search.cpan.org/~larsnyg/Text-Identify-BoilerPlate-0.2/

The functions can be accessed from inside Perl programs, or as a 
standalone program (rem-boilerplate-text).

My approach to this problem is to detect lines (in plain text files) 
that are repeated more than a certain number of times (optinally only 
consecutive lines at the start and end of the document), and remove them.

The system is in dire need of testing, so if anyone has workloads that 
need to be processed, I'd be happy to help.

    regards,
    lars nygaard


Helge Thomas Hellerud wrote:

>Hello,
>
>I want to extract the article text of a HTML page (for instance the text of
>a news article). But a HTML page contains much "noise", like menus and ads.
>So I want to ask if anyone know a way to eliminate unwanted elements like
>menus and ads, and only extract the editorial article text?
>
>Of course, I can use Regex to look for patterns in the HTML code (by
>defining a starting point and an ending point), but the solution will be a
>hack that will not work if the pattern in the HTML page suddenly is changed.
>So do you know how to extract the content without using such a hack?
>
>Thanks in advance.
>
>Helge Thomas Hellerud
>
>
>
>  
>



More information about the Corpora mailing list