[Corpora-List] Extracting only editorial content from a HTML page

Alex Murzaku lissus at gmail.com
Tue Aug 9 13:10:30 UTC 2005


Since I was scrapping text from a limited number of Albanian language
websites, it was easy for me to search for repeated text in every page
coming from the same site. The repeated text was removed. This meant
that I had only one copy of pages containing everything. One of the
sites I was spidering changed its format three times in two months
which generated quite a bit of noise. The only way to get rid of it
was to get back to regex. I ended up using only regex in the end. As
for the "sudden" changes, you could use the absence of text repetition
as a signal that there is a change and, then, modify the regex
accordingly.

Good luck,

Alex

On 8/9/05, Helge Thomas Hellerud <helgetho at stud.ntnu.no> wrote:
> Hello,
> 
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
> 
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
> 
> Thanks in advance.
> 
> Helge Thomas Hellerud
> 
> 
>



More information about the Corpora mailing list