[Corpora-List] Web Content Extractor / Screen Scraper

Alexandre Rafalovitch arafalov at gmail.com
Mon Jun 18 20:50:08 UTC 2007


Resty,

Not a fully standalone application, but a Java library with examples
that might do what you want is Jericho:
http://jerichohtml.sourceforge.net/doc/index.html

It is especially good if you only need to extract/render part of the
document, in case there are menus, sidebards, etc.

Regards,
   Alex.
P.s. Haven't there been a number of papers on "corpus from the web"
tasks over the last year? What did they use?

On 6/18/07, Resty Cena <restycena at gmail.com> wrote:
> Hello,
> I am looking for a free or open-source Windows utility/application that
> extract text-only rendered (not raw) contents of web pages, such as one
> would use for automatically scraping news feeds. Does anyone use such an
> application?
>
> Basically the application will be used to harvest texts on the internet to
> build a corpus.



More information about the Corpora mailing list