[Corpora-List] Preprocessing the Project Gutenberg DVD

Phil Gooch philgooch at gmail.com
Thu Apr 25 08:37:28 UTC 2013


Without seeing examples of the text files it's hard to give a complete
answer, but you could look at Boilerpipe (
http://code.google.com/p/boilerpipe/), which removes 'boilerplate' text
from HTML encoded text, perhaps this could be adapted for your purposes.

Boilerpipe has been integrated with GATE, and you might find success
combining this with the GATE segmentation processing component (
http://gate.ac.uk/sale/tao/splitch19.html#sec:alignment:segment-processing)
and some custom rules written in the JAPE language. I've done this in the
past for stripping out unwanted boilerplate text, disclaimers etc from
consumer health web sites.

Regards

Phil


On Thu, Apr 25, 2013 at 8:47 AM, Tristan Miller <
miller at ukp.informatik.tu-darmstadt.de> wrote:

> Greetings.
>
> I'm looking for a large corpus of English-language books, preferably
> general literature such as novels.  The corpus need not be annotated;
> raw text is fine, though I don't want OCR'd text unless the errors have
> been manually corrected.
>
> One such corpus I found is the Project Gutenberg DVD
> <http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project>, which
> by my count contains nearly 24,743 English books in plain text format.
> However, it has one problem for those wishing to extract the texts for
> corpus analysis:  each text file contains some front matter inserted by
> the publisher which describes Project Gutenberg, the licensing and
> distribution terms, and various other things.  It is difficult to
> automatically remove this front matter because there is no consistent
> delimiter which marks the end of it and the beginning of the actual
> book.  (The actual front matter text often varies from file to file.)
> Thus regular expression tools like sed aren't of much use.
>
> Does anyone know of a tool which could help me automatically filter out
> the front matter?
>
> Alternatively, does anyone know of a corpus of similar size which I
> could use instead?
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Doctoral Researcher
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130425/ce2e0dfa/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list