[Corpora-List] Preprocessing the Project Gutenberg DVD

Tristan Miller miller at ukp.informatik.tu-darmstadt.de
Thu Apr 25 07:47:29 UTC 2013


Greetings.

I'm looking for a large corpus of English-language books, preferably
general literature such as novels.  The corpus need not be annotated;
raw text is fine, though I don't want OCR'd text unless the errors have
been manually corrected.

One such corpus I found is the Project Gutenberg DVD
<http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project>, which
by my count contains nearly 24,743 English books in plain text format.
However, it has one problem for those wishing to extract the texts for
corpus analysis:  each text file contains some front matter inserted by
the publisher which describes Project Gutenberg, the licensing and
distribution terms, and various other things.  It is difficult to
automatically remove this front matter because there is no consistent
delimiter which marks the end of it and the beginning of the actual
book.  (The actual front matter text often varies from file to file.)
Thus regular expression tools like sed aren't of much use.

Does anyone know of a tool which could help me automatically filter out
the front matter?

Alternatively, does anyone know of a corpus of similar size which I
could use instead?

Regards,
Tristan

-- 
Tristan Miller, Doctoral Researcher
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130425/ab6bf831/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list