[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Maximilian Haeussler max at soe.ucsc.edu
Fri Oct 12 22:20:36 UTC 2012


It completely depends on the size and age of your input data.

For best results for a few hundred documents, especially if they are
older and not OCRed yet, I'd use any standard commercial OCR software
on windows and convert to html. This will give the real flow of the
text and separate images nicely from the text. It will also recognize
all text formatting. But they are very slow.

If the documents are new, already OCRed and easy to parse, PDFx as a
webservice might be useful, as it separates the document into title,
authors, abstract etc (but you might use pdfinfo for that, too, or 3rd
party databases like SFX or CrossRef to get the metadata)

For intermediate sizes or if you don't want to optimize your software
too much, several hundreds up to tens of thousands of PDFs, PDFMiner,
Poppler or pdfbox and derivatives are fast enough and easy to adapt.
They are better than pdftext which sometimes stumbles over images and
outputs lots of junk characters but slower.

For tens of thousands of documents or more, pdftotext is the only
really fast solution.

For best results you can combine any of the main solutions but that
will take even more time...

--
Maximilian Haeussler, max at soe.ucsc.edu
mob +1 831 295 0653 office: +1 831 459 5232


On Fri, Oct 12, 2012 at 6:21 AM, Laurence Anthony <anthony0122 at gmail.com> wrote:
> I've just started working on a simple PDF to text converter. It's
> basically a wrapper around the Python PDFMiner module. I plan to
> extend this shortly to convert .doc(x) files and other file types to
> plain text. Just drag and drop in any PDF files (or use the file menu)
> and hit "Start".
>
> You can download the alpha version (0.0.2) here:
> http://www.antlab.sci.waseda.ac.jp/software/antconverter002/AntConverter.exe
>
> I'll make an official release shortly that you'll be able to download
> from the regular software page of my website:
> http://www.antlab.sci.waseda.ac.jp/software.html
>
> If anyone would like to see a Mac or Linux version developed, please
> let me know.
>
> Regards,
> Laurence.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list