[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)

Jörg Tiedemann Jorg.Tiedemann at lingfil.uu.se
Sun Sep 14 08:43:30 UTC 2014


You could also try pdf2xml that combines Apache Tika, pdftotext and other tools
https://bitbucket.org/tiedemann/pdf2xml/
It also integrates a language identifier to automatically filter out some garbage.

Best,
Jörg


**********************************************************************************
 Jörg Tiedemann                                          jorg.tiedemann at lingfil.uu.se<mailto:jorg.tiedemann at lingfil.uu.se>
 Dep. of Linguistics and Philology            http://stp.lingfil.uu.se/~joerg/
 Uppsala University                                     tel:  +46 (0)18 - 471 1412
 Box 635, SE-751 26 Uppsala/Sweden   fax: +46 (0)18 - 471 1094

On Sep 11, 2014, at 7:29 PM, Craig Pfeifer wrote:

Another option is the open source Apache Tika project:
https://tika.apache.org/

It *should* handle arabic properly, with the standard caveats about needing OCR for image PDFs.

Craig

______________
craig.pfeifer at gmail.com<mailto:craig.pfeifer at gmail.com>

On Thu, Sep 11, 2014 at 6:45 AM, Eric Atwell <E.S.Atwell at leeds.ac.uk<mailto:E.S.Atwell at leeds.ac.uk>> wrote:
Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
I have had enquiries from several Arabic corpus linguistics researchers,
example below from Anastasiya Andrusenko in Valencia

thanks - Eric Atwell, Leeds University
 WWW: http://www.comp.leeds.ac.uk/eric
      http://www.comp.leeds.ac.uk/arabic

---------- Forwarded message ----------
Date: Thu, 11 Sep 2014 10:50:36 +0100
From: Anastasiya Andrusenko <anisika2002 at gmail.com<mailto:anisika2002 at gmail.com>>
To: Eric Atwell <E.S.Atwell at leeds.ac.uk<mailto:E.S.Atwell at leeds.ac.uk>>
Subject: Converting PDFs in Arabic to txt. for further corpus analysis


Hi,

I saw your profile in internet and thought may be you can help me.
My name is Anastasiia Andrusenko, currently I am doing research on
metadiscourse features in Arabic Research Articles (Analysis of Arabic corpus)
at the Department of Applied Linguistics of the Universitat Politècnica de
València.
I have PDF files in Arabic. I need them to be in txt. format. But the problem
is that by converting them with Adobe Acrobat Prof. the txt. files are not
readible.

Could you please advice any solution to this problem or may be you know any
tool for text analysis for Arabic.
Thank you in advance

Regards,

Anastasiia

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140914/68aac648/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list