[Corpora-List] Extracting only editorial content from a HTML page

Lars Nygaard lars.nygaard at iln.uio.no
Tue Aug 9 10:49:31 UTC 2005


If you have access to several articles from the same source, you can 
delete everything that is equal (or very similar) starting from the top 
and bottom across articles. Feel free to contact me if you need perl 
source code to do this; I also have a draft of a paper explaining the 
approach in more detail, if you are interested.

Regards,
Lars Nygaard, The Text Laboratory, University of Oslo

Helge Thomas Hellerud wrote:

>Hello,
>
>I want to extract the article text of a HTML page (for instance the text of
>a news article). But a HTML page contains much "noise", like menus and ads.
>So I want to ask if anyone know a way to eliminate unwanted elements like
>menus and ads, and only extract the editorial article text?
>
>Of course, I can use Regex to look for patterns in the HTML code (by
>defining a starting point and an ending point), but the solution will be a
>hack that will not work if the pattern in the HTML page suddenly is changed.
>So do you know how to extract the content without using such a hack?
>
>Thanks in advance.
>
>Helge Thomas Hellerud



More information about the Corpora mailing list