[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Emmanuel Prochasson eemmanuel at ust.hk
Thu Jun 17 01:35:19 UTC 2010


Le 16/06/2010 18:40, John MCKENNY a écrit :
>
> Does anyone have a solution  to the problem we are facing in a corpus 
> linguistic research project? We have been given permission by the 
> publishers and editors  to download all issues of a journal from the 
> last 30 years obtainable from our university e-library in the form of 
> PDFs amounting to about 3,000,000 words. Starting with a small sample 
> (250,000 words), we tried  using various methods and software 
> including Wordsmith Tools 5  to convert the PDFs into text-only files. 
> The result so far has been text-only files with many words clumped 
> together  e.g. 'inthefinalanalysisitseems'.  Breaking up these clumps 
> is a time-consuming business. For this reason, we haven't started 
> compiling our larger corpus. We would only build the larger corpus if 
> there was some kind of automated or semi-automated way to generate 
> text-only files which contained all and only the alphanumeric 
> sequences bounded by spaces in the original PDFs, in other words, 
> without clumps.
>
> We would be very grateful for any suggestions you might have.
>


I used Multivalent by the past ( http://multivalent.sourceforge.net/), 
it gives surprisingly good results, especially for multi-columns 
documents. It deals pretty well with footnote (which tend to be 
concatenated to the previous paragraph, leading to weird sentence 
segmentation).

One issue I had with it was being too clever: some sequences of letters, 
like "fi", are output as the character 'fi' rather than 'f' and 'i'. 
You'll need a simple script to deal with those issues.

And it's free software.

Regards,

-- 
Emmanuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100617/2f2fad31/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list