[Corpora-List] converting PDFs to ASCII or text-only files without clumps
Christian Chiarcos
christian.chiarcos at web.de
Wed Jun 16 11:34:01 UTC 2010
A more suitable candidate may be http://pdftohtml.sourceforge.net that
allows to convert PDF to HTML or XML. I've applied it successfully to
compile corpora of German and Hausa. For German, everything worked
surprisingly well (including the conversion of special characters äöüß).
The results for Hausa were a bit more problematic as PDF2HTML had problems
with the special characters for glottalized b,k,d (that were rendered as
simple b,k,d). Clumps occurred occasionally, but less frequently than with
ps2ascii, for example.
Best,
Christian
--
Christian Chiarcos
University of Potsdam/Germany
Collaborative Research Center 632
Project D1 "Linguistic Data Base for Information Structure"
snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm
office: II.24.2.68
email: chiarcos at uni-potsdam.de
web: http://www.sfb632.uni-potsdam.de/~chiarcos
tel.: +49-(0)331/977-2664
fax: +49-(0)331/977-2925
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list