[Corpora-List] Extracting only editorial content from a HTML page

Tom Emerson tree at basistech.com
Wed Aug 10 03:56:56 UTC 2005


Mike Maxwell writes:
> It looks to me like 'tidy' is intended to handle incorrectly structure 
> html.  Can it be used to extract text, and in particular to throw away 
> header and footer boilerplate?

Tidy does have an API that can be used for this: I have code around
here (in C++) that uses it for this. If you are interested in seeing
it, let me know off-list and I'll try to dig it up. Unfortunately when
I last looked it was not well documented (i.e., at all.)

The problem still remains, however, which is identifying the
boilerplate from the useful content. Tidy doesn't help with that at
all.

Language id is difficult because in many languages I've dealt with the
text isn't encoded in a clean way, and is often misidentified in the
HTTP headers. For example, I regularly see Arabic and Persian pages
that are declared to be encoded in CP1252 (the Windows Latin 1
codepage) and use HTML character entities for all Arabic text (e.g.,
أل). I've seen the same thing with Hungarian.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)



More information about the Corpora mailing list