[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Maximilian Haeussler max at soe.ucsc.edu
Fri Oct 12 23:46:48 UTC 2012


I'm very interested in how you did that exactly. I tried Omnipage
around 3 years ago on a relatively big Windows Server and it
consistently crashed after a few thousand files. Did you use a
cluster?

Do you remember how you configured it or if there were any problems?


On Fri, Oct 12, 2012 at 4:37 PM, Mark Davies <Mark_Davies at byu.edu> wrote:
>>> For tens of thousands of documents or more, pdftotext is the only really fast solution.
>
> I used ScanSoft PDF Converter and ScanSoft OmniPage to process about 145,000 PDF files of historical newspapers and magazines for the 400 million word Corpus of Historical American English (COHA; http://corpus.byu.edu/coha), and I was very pleased with the results. It did a great job with even some very poor typeface newspapers from the 1800s.
>
> Mark D.
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
> ________________________________________
> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Maximilian Haeussler [max at soe.ucsc.edu]
> Sent: Friday, October 12, 2012 4:20 PM
> To: Laurence Anthony; r.krishnamurthy at aston.ac.uk
> Cc: corpora at uib.no
> Subject: Re: [Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"
>
> It completely depends on the size and age of your input data.
>
> For best results for a few hundred documents, especially if they are
> older and not OCRed yet, I'd use any standard commercial OCR software
> on windows and convert to html. This will give the real flow of the
> text and separate images nicely from the text. It will also recognize
> all text formatting. But they are very slow.
>
> If the documents are new, already OCRed and easy to parse, PDFx as a
> webservice might be useful, as it separates the document into title,
> authors, abstract etc (but you might use pdfinfo for that, too, or 3rd
> party databases like SFX or CrossRef to get the metadata)
>
> For intermediate sizes or if you don't want to optimize your software
> too much, several hundreds up to tens of thousands of PDFs, PDFMiner,
> Poppler or pdfbox and derivatives are fast enough and easy to adapt.
> They are better than pdftext which sometimes stumbles over images and
> outputs lots of junk characters but slower.
>
> For tens of thousands of documents or more, pdftotext is the only
> really fast solution.
>
> For best results you can combine any of the main solutions but that
> will take even more time...
>
> --
> Maximilian Haeussler, max at soe.ucsc.edu
> mob +1 831 295 0653 office: +1 831 459 5232
>
>
> On Fri, Oct 12, 2012 at 6:21 AM, Laurence Anthony <anthony0122 at gmail.com> wrote:
>> I've just started working on a simple PDF to text converter. It's
>> basically a wrapper around the Python PDFMiner module. I plan to
>> extend this shortly to convert .doc(x) files and other file types to
>> plain text. Just drag and drop in any PDF files (or use the file menu)
>> and hit "Start".
>>
>> You can download the alpha version (0.0.2) here:
>> http://www.antlab.sci.waseda.ac.jp/software/antconverter002/AntConverter.exe
>>
>> I'll make an official release shortly that you'll be able to download
>> from the regular software page of my website:
>> http://www.antlab.sci.waseda.ac.jp/software.html
>>
>> If anyone would like to see a Mac or Linux version developed, please
>> let me know.
>>
>> Regards,
>> Laurence.
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list