[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Mark Davies Mark_Davies at byu.edu
Wed Jun 16 16:13:41 UTC 2010


Angus Grieve-Smith noted:

>>     It's not clear from the initial description: are these PDF files with text in them, or with scanned images?  That makes a big difference.

This is the crucial point, which few others seem to have addressed. If these PDF files were created with Acrobat, etc in the past 5-10 years then they will probably have the text stored in the PDF. But if they are PDFs of journal articles from 20-30 years ago, they are likely just *images* of those pages. There is no text stored in the PDF. In this case, you'll have to do OCR on the PDFs, just as you would with any scanned image of a book, for example.

I've just finished the 400 million word Corpus of *Historical* American English (http://corpus.byu.edu/coha), which is based on 140,000 texts from the 1810s to the 2000s, many of which were originally PDF *images* (newspapers and magazines from the 1800s, for example). I used OmniPage to OCR the 50,000-60,000 PDFs. The nice thing about OmniPage (and FineReader, and others too, I imagine) is that they can be run in batch mode to process many texts. Then you can run other scripts on the output to clean things up more. In my case, for example, I compared each of the 50,000-60,000 files to texts from the Corpus of Contemporary American English (http://www.americancorpus.org) to find the ones that had obviously OCR'ed very poorly, and these were then not included in the corpus.

Bottom line, I think, is if there are a lot of old, image-based PDFs, you may want to invest in a program like OmniPage or FineReader.

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list