[Corpora-List] Extracting only editorial content from a HTML page

Lars Nygaard lars.nygaard at iln.uio.no
Tue Aug 9 12:27:23 UTC 2005


Helge,

If you have access to several articles from the same source, you can
delete everything that is equal (or very similar) starting from the top
and bottom across articles. Feel free to contact me if you need perl
source code to do this; I also have a draft of a paper explaining the
approach in more detail, if you are interested.

Lykke til!

Regards,
Lars Nygaard, The Text Laboratory, University of Oslo


Helge Thomas Hellerud wrote:

>Hello,
>
>I want to extract the article text of a HTML page (for instance the text of
>a news article). But a HTML page contains much "noise", like menus and ads.
>So I want to ask if anyone know a way to eliminate unwanted elements like
>menus and ads, and only extract the editorial article text?
>
>Of course, I can use Regex to look for patterns in the HTML code (by
>defining a starting point and an ending point), but the solution will be a
>hack that will not work if the pattern in the HTML page suddenly is changed.
>So do you know how to extract the content without using such a hack?
>
>Thanks in advance.
>
>Helge Thomas Hellerud
>
>
>
>  
>



More information about the Corpora mailing list