[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Christian Chiarcos christian.chiarcos at web.de
Wed Jun 16 11:34:01 UTC 2010


A more suitable candidate may be http://pdftohtml.sourceforge.net that  
allows to convert PDF to HTML or XML. I've applied it successfully to  
compile corpora of German and Hausa. For German, everything worked  
surprisingly well (including the conversion of special characters äöüß).  
The results for Hausa were a bit more problematic as PDF2HTML had problems  
with the special characters for glottalized b,k,d (that were rendered as  
simple b,k,d). Clumps occurred occasionally, but less frequently than with  
ps2ascii, for example.

Best,
Christian
-- 
Christian Chiarcos
University of Potsdam/Germany
Collaborative Research Center 632
Project D1 "Linguistic Data Base for Information Structure"
snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm
office: II.24.2.68
email: chiarcos at uni-potsdam.de
web: http://www.sfb632.uni-potsdam.de/~chiarcos
tel.: +49-(0)331/977-2664
fax: +49-(0)331/977-2925

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list