[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Mark Davies Mark_Davies at byu.edu
Fri Oct 12 23:37:58 UTC 2012


>> For tens of thousands of documents or more, pdftotext is the only really fast solution.

I used ScanSoft PDF Converter and ScanSoft OmniPage to process about 145,000 PDF files of historical newspapers and magazines for the 400 million word Corpus of Historical American English (COHA; http://corpus.byu.edu/coha), and I was very pleased with the results. It did a great job with even some very poor typeface newspapers from the 1800s.

Mark D.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Maximilian Haeussler [max at soe.ucsc.edu]
Sent: Friday, October 12, 2012 4:20 PM
To: Laurence Anthony; r.krishnamurthy at aston.ac.uk
Cc: corpora at uib.no
Subject: Re: [Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

It completely depends on the size and age of your input data.

For best results for a few hundred documents, especially if they are
older and not OCRed yet, I'd use any standard commercial OCR software
on windows and convert to html. This will give the real flow of the
text and separate images nicely from the text. It will also recognize
all text formatting. But they are very slow.

If the documents are new, already OCRed and easy to parse, PDFx as a
webservice might be useful, as it separates the document into title,
authors, abstract etc (but you might use pdfinfo for that, too, or 3rd
party databases like SFX or CrossRef to get the metadata)

For intermediate sizes or if you don't want to optimize your software
too much, several hundreds up to tens of thousands of PDFs, PDFMiner,
Poppler or pdfbox and derivatives are fast enough and easy to adapt.
They are better than pdftext which sometimes stumbles over images and
outputs lots of junk characters but slower.

For tens of thousands of documents or more, pdftotext is the only
really fast solution.

For best results you can combine any of the main solutions but that
will take even more time...

--
Maximilian Haeussler, max at soe.ucsc.edu
mob +1 831 295 0653 office: +1 831 459 5232


On Fri, Oct 12, 2012 at 6:21 AM, Laurence Anthony <anthony0122 at gmail.com> wrote:
> I've just started working on a simple PDF to text converter. It's
> basically a wrapper around the Python PDFMiner module. I plan to
> extend this shortly to convert .doc(x) files and other file types to
> plain text. Just drag and drop in any PDF files (or use the file menu)
> and hit "Start".
>
> You can download the alpha version (0.0.2) here:
> http://www.antlab.sci.waseda.ac.jp/software/antconverter002/AntConverter.exe
>
> I'll make an official release shortly that you'll be able to download
> from the regular software page of my website:
> http://www.antlab.sci.waseda.ac.jp/software.html
>
> If anyone would like to see a Mac or Linux version developed, please
> let me know.
>
> Regards,
> Laurence.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list