[Corpora-List] converting PDFs to ASCII or text-only files without clumps
Emmanuel Prochasson
eemmanuel at ust.hk
Thu Jun 17 01:35:19 UTC 2010
Le 16/06/2010 18:40, John MCKENNY a écrit :
>
> Does anyone have a solution to the problem we are facing in a corpus
> linguistic research project? We have been given permission by the
> publishers and editors to download all issues of a journal from the
> last 30 years obtainable from our university e-library in the form of
> PDFs amounting to about 3,000,000 words. Starting with a small sample
> (250,000 words), we tried using various methods and software
> including Wordsmith Tools 5 to convert the PDFs into text-only files.
> The result so far has been text-only files with many words clumped
> together e.g. 'inthefinalanalysisitseems'. Breaking up these clumps
> is a time-consuming business. For this reason, we haven't started
> compiling our larger corpus. We would only build the larger corpus if
> there was some kind of automated or semi-automated way to generate
> text-only files which contained all and only the alphanumeric
> sequences bounded by spaces in the original PDFs, in other words,
> without clumps.
>
> We would be very grateful for any suggestions you might have.
>
I used Multivalent by the past ( http://multivalent.sourceforge.net/),
it gives surprisingly good results, especially for multi-columns
documents. It deals pretty well with footnote (which tend to be
concatenated to the previous paragraph, leading to weird sentence
segmentation).
One issue I had with it was being too clever: some sequences of letters,
like "fi", are output as the character 'fi' rather than 'f' and 'i'.
You'll need a simple script to deal with those issues.
And it's free software.
Regards,
--
Emmanuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100617/2f2fad31/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list