<div dir="ltr">Without seeing examples of the text files it's hard to give a complete answer, but you could look at Boilerpipe (<a href="http://code.google.com/p/boilerpipe/">http://code.google.com/p/boilerpipe/</a>), which removes 'boilerplate' text from HTML encoded text, perhaps this could be adapted for your purposes.<div>


<br></div><div style>Boilerpipe has been integrated with GATE, and you might find success combining this with the GATE segmentation processing component (<a href="http://gate.ac.uk/sale/tao/splitch19.html#sec:alignment:segment-processing">http://gate.ac.uk/sale/tao/splitch19.html#sec:alignment:segment-processing</a>) and some custom rules written in the JAPE language. I've done this in the past for stripping out unwanted boilerplate text, disclaimers etc from consumer health web sites.</div>


<div style><br></div><div style>Regards</div><div style><br></div><div style>Phil</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 25, 2013 at 8:47 AM, Tristan Miller <span dir="ltr"><<a href="mailto:miller@ukp.informatik.tu-darmstadt.de" target="_blank">miller@ukp.informatik.tu-darmstadt.de</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Greetings.<br>

<br>

I'm looking for a large corpus of English-language books, preferably<br>

general literature such as novels.  The corpus need not be annotated;<br>

raw text is fine, though I don't want OCR'd text unless the errors have<br>

been manually corrected.<br>

<br>

One such corpus I found is the Project Gutenberg DVD<br>

<<a href="http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project" target="_blank">http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project</a>>, which<br>

by my count contains nearly 24,743 English books in plain text format.<br>

However, it has one problem for those wishing to extract the texts for<br>

corpus analysis:  each text file contains some front matter inserted by<br>

the publisher which describes Project Gutenberg, the licensing and<br>

distribution terms, and various other things.  It is difficult to<br>

automatically remove this front matter because there is no consistent<br>

delimiter which marks the end of it and the beginning of the actual<br>

book.  (The actual front matter text often varies from file to file.)<br>

Thus regular expression tools like sed aren't of much use.<br>

<br>

Does anyone know of a tool which could help me automatically filter out<br>

the front matter?<br>

<br>

Alternatively, does anyone know of a corpus of similar size which I<br>

could use instead?<br>

<br>

Regards,<br>

Tristan<br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

Tristan Miller, Doctoral Researcher<br>

Ubiquitous Knowledge Processing Lab (UKP-TUDA)<br>

Department of Computer Science, Technische Universität Darmstadt<br>

Tel: <a href="tel:%2B49%206151%2016%206166" value="+496151166166">+49 6151 16 6166</a> | Web: <a href="http://www.ukp.tu-darmstadt.de/" target="_blank">http://www.ukp.tu-darmstadt.de/</a><br>

<br>

</font></span><br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br></div>