[Corpora-List] How do we extract actual text in html?

Constantin Orasan C.Orasan at wlv.ac.uk
Mon Aug 2 16:16:28 UTC 2010


Hi,

> Is it trivial to extract the title and relevant text (ignoring the ads
> and other irrelevant stuff)? For example, in the website:
> http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168
> 
> I am only interested in extracting the tile: "Chelsea Clinton marries
> in NY"
> and the subject below. How easy is this?

When you deal with newspaper articles, one thing you want to check is if
there is a print version of the page. Usually the print version contains
mainly the text of the article without menus and extra information. In
the case of the page you indicated it is a bit difficult to obtain the
print version (but not impossible if you know enough javascript), but if
you take BBC for example (and many other sites) it is really easy:

Eg. for http://www.bbc.co.uk/news/world-us-canada-10828516 the print
version can be obtained by adding a parameter to the URL:
http://www.bbc.co.uk/news/world-us-canada-10828516?print=true

Regards,

Constantin

-- 
Dr. Constantin Orasan <C.Orasan at wlv.ac.uk>
Senior Lecturer in Computational Linguistics
Research Group in Computational Linguistics
http://www.wlv.ac.uk/~in6093/
University of Wolverhampton
-- 
Scanned by iCritical.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list