[Corpora-List] Preprocessing the Project Gutenberg DVD

Damir Cavar dcavar at me.com
Thu Apr 25 16:26:44 UTC 2013


Hi there,

Malgosia Cavar and I converted all of the then available books in one swoop, that is the ones that were in the RDF catalogue on the Gutenberg site. The Java-Code for converting the RDF-catalogue information to TEI-headers, the books to TEI XML files, generating back new RDF files for each book etc. is on GitHub:

http://github.com/dcavar/PG2TEI

The RDF-formats and locations on the Gutenberg site are now different. We will change the code above to cope with this new location and distribution. We generated new RDFs that cover our new TEI XML files with their URLs. The complete conversion result is here:

http://ltl.emich.edu/gutenberg/

Every book is in a subfolder with its number and the respective files. The files for book ID 350 would be under these URLs:

http://ltl.emich.edu/gutenberg/350/350.html
http://ltl.emich.edu/gutenberg/350/350.odt
http://ltl.emich.edu/gutenberg/350/350.rdf
http://ltl.emich.edu/gutenberg/350/350.xml

XML is the TEI file, The .odt is the OpenOffice document, etc.

Not all book numbers have been made available via the RDF-catalogue by the time we were running the conversion, so not all books are in this archive right now. We will set up a demon to update the converted books and make them available on the LTL pages mentioned above.

The converted and checked TEI XML files are indexed using the online Philologic tool:

http://ltl.emich.edu/philologic/

for basic corpus analysis and concordancing. We will add as many books as possible to it these weeks (after grading is finally over). All converted books will be made available for download and further processing as well. We might keep it on a LINGUIST List sub-site, or on the links above. More info on that will be made available soon.

There is a brief intro to Philologic here:

http://cavar.me/damir/blog/files/philologic-corpus-introduction-part-1.php

Note also that we set up a basic linguistic processing chain using all kinds of NLP tools to annotate and convert the TEI XML files. A small demo for English is available here:

http://ltl.emich.edu/txt2tei/

This is just using the Stanford CoreNLP components for English. More languages should be covered this summer. The complete setup, UIMA-based processing chain and TEI XML wrapping should be also made available soon. Currently there is just a simple Python-based wrapper for CoreNLP-XML to TEI, with a RPC script and HTML code. This txt2tei demo is using a manually controlled CoreNLP demon, it might die and nobody might be around to revitalize it immediately. So, if you want to test it, and it doesn't work, try again later… :-)

We would be happy to cooperate, share and exchange data and code related to this process. Once our workload here normalizes, from end of next week on, we will be more engaged in this process.

Best wishes

DC


--
Dr. Damir Cavar
Eastern Michigan University
Institute for Language Information and Technology
http://cavar.me/damir/



On Apr 25, 2013, at 11:06 AM, Matthew L. Jockers <mjockers at unl.edu> wrote:

> Tristan,
> One problem with the Project Gutenberg files is that the boilerplate is not standard across the entire corpus.  A few years ago I modified a python script from Michiel Overtoom in an attempt to strip out the boilerplate and convert the plain text to TEI.  I blogged about this and included my code here http://www.matthewjockers.net/2010/08/26/auto-converting-project-gutenberg-text-to-tei/
> Unfortunately the solution does not cover all the variations in the way the boilerplate gets written across Project Gutenberg files.
> Matt
> 
> --
> Matthew L. Jockers
> Assistant Professor of English
> Fellow, Center for Digital Research in the Humanities
> 325 Andrews Hall
> University of Nebraska-Lincoln
> Lincoln, NE 68588
> 402-472-1896
> www.matthewjockers.net
> 
> On Apr 25, 2013, at 2:47 AM, Tristan Miller wrote:
> 
>> Greetings.
>> 
>> I'm looking for a large corpus of English-language books, preferably
>> general literature such as novels.  The corpus need not be annotated;
>> raw text is fine, though I don't want OCR'd text unless the errors have
>> been manually corrected.
>> 
>> One such corpus I found is the Project Gutenberg DVD
>> <http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project>, which
>> by my count contains nearly 24,743 English books in plain text format.
>> However, it has one problem for those wishing to extract the texts for
>> corpus analysis:  each text file contains some front matter inserted by
>> the publisher which describes Project Gutenberg, the licensing and
>> distribution terms, and various other things.  It is difficult to
>> automatically remove this front matter because there is no consistent
>> delimiter which marks the end of it and the beginning of the actual
>> book.  (The actual front matter text often varies from file to file.)
>> Thus regular expression tools like sed aren't of much use.
>> 
>> Does anyone know of a tool which could help me automatically filter out
>> the front matter?
>> 
>> Alternatively, does anyone know of a corpus of similar size which I
>> could use instead?
>> 
>> Regards,
>> Tristan
>> 
>> -- 
>> Tristan Miller, Doctoral Researcher
>> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
>> Department of Computer Science, Technische Universität Darmstadt
>> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>> 
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> 
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list