[Corpora-List] Extracting only editorial content from a HTML page
Lars Nygaard
lars.nygaard at iln.uio.no
Tue Aug 9 10:49:31 UTC 2005
If you have access to several articles from the same source, you can
delete everything that is equal (or very similar) starting from the top
and bottom across articles. Feel free to contact me if you need perl
source code to do this; I also have a draft of a paper explaining the
approach in more detail, if you are interested.
Regards,
Lars Nygaard, The Text Laboratory, University of Oslo
Helge Thomas Hellerud wrote:
>Hello,
>
>I want to extract the article text of a HTML page (for instance the text of
>a news article). But a HTML page contains much "noise", like menus and ads.
>So I want to ask if anyone know a way to eliminate unwanted elements like
>menus and ads, and only extract the editorial article text?
>
>Of course, I can use Regex to look for patterns in the HTML code (by
>defining a starting point and an ending point), but the solution will be a
>hack that will not work if the pattern in the HTML page suddenly is changed.
>So do you know how to extract the content without using such a hack?
>
>Thanks in advance.
>
>Helge Thomas Hellerud
More information about the Corpora
mailing list