[Corpora-List] Preprocessing the Project Gutenberg DVD

Thu Apr 25 15:06:03 UTC 2013

Tristan,
One problem with the Project Gutenberg files is that the boilerplate is not standard across the entire corpus.  A few years ago I modified a python script from Michiel Overtoom in an attempt to strip out the boilerplate and convert the plain text to TEI.  I blogged about this and included my code here http://www.matthewjockers.net/2010/08/26/auto-converting-project-gutenberg-text-to-tei/
Unfortunately the solution does not cover all the variations in the way the boilerplate gets written across Project Gutenberg files.
Matt

--
Matthew L. Jockers
Assistant Professor of English
Fellow, Center for Digital Research in the Humanities
325 Andrews Hall
University of Nebraska-Lincoln
Lincoln, NE 68588
402-472-1896
www.matthewjockers.net

On Apr 25, 2013, at 2:47 AM, Tristan Miller wrote:

> Greetings.
> 
> I'm looking for a large corpus of English-language books, preferably
> general literature such as novels.  The corpus need not be annotated;
> raw text is fine, though I don't want OCR'd text unless the errors have
> been manually corrected.
> 
> One such corpus I found is the Project Gutenberg DVD
> <http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project>, which
> by my count contains nearly 24,743 English books in plain text format.
> However, it has one problem for those wishing to extract the texts for
> corpus analysis:  each text file contains some front matter inserted by
> the publisher which describes Project Gutenberg, the licensing and
> distribution terms, and various other things.  It is difficult to
> automatically remove this front matter because there is no consistent
> delimiter which marks the end of it and the beginning of the actual
> book.  (The actual front matter text often varies from file to file.)
> Thus regular expression tools like sed aren't of much use.
> 
> Does anyone know of a tool which could help me automatically filter out
> the front matter?
> 
> Alternatively, does anyone know of a corpus of similar size which I
> could use instead?
> 
> Regards,
> Tristan
> 
> -- 
> Tristan Miller, Doctoral Researcher
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora