[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)
Jörg Tiedemann
Jorg.Tiedemann at lingfil.uu.se
Sun Sep 14 08:43:30 UTC 2014
You could also try pdf2xml that combines Apache Tika, pdftotext and other tools
https://bitbucket.org/tiedemann/pdf2xml/
It also integrates a language identifier to automatically filter out some garbage.
Best,
Jörg
**********************************************************************************
Jörg Tiedemann jorg.tiedemann at lingfil.uu.se<mailto:jorg.tiedemann at lingfil.uu.se>
Dep. of Linguistics and Philology http://stp.lingfil.uu.se/~joerg/
Uppsala University tel: +46 (0)18 - 471 1412
Box 635, SE-751 26 Uppsala/Sweden fax: +46 (0)18 - 471 1094
On Sep 11, 2014, at 7:29 PM, Craig Pfeifer wrote:
Another option is the open source Apache Tika project:
https://tika.apache.org/
It *should* handle arabic properly, with the standard caveats about needing OCR for image PDFs.
Craig
______________
craig.pfeifer at gmail.com<mailto:craig.pfeifer at gmail.com>
On Thu, Sep 11, 2014 at 6:45 AM, Eric Atwell <E.S.Atwell at leeds.ac.uk<mailto:E.S.Atwell at leeds.ac.uk>> wrote:
Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
I have had enquiries from several Arabic corpus linguistics researchers,
example below from Anastasiya Andrusenko in Valencia
thanks - Eric Atwell, Leeds University
WWW: http://www.comp.leeds.ac.uk/eric
http://www.comp.leeds.ac.uk/arabic
---------- Forwarded message ----------
Date: Thu, 11 Sep 2014 10:50:36 +0100
From: Anastasiya Andrusenko <anisika2002 at gmail.com<mailto:anisika2002 at gmail.com>>
To: Eric Atwell <E.S.Atwell at leeds.ac.uk<mailto:E.S.Atwell at leeds.ac.uk>>
Subject: Converting PDFs in Arabic to txt. for further corpus analysis
Hi,
I saw your profile in internet and thought may be you can help me.
My name is Anastasiia Andrusenko, currently I am doing research on
metadiscourse features in Arabic Research Articles (Analysis of Arabic corpus)
at the Department of Applied Linguistics of the Universitat Politècnica de
València.
I have PDF files in Arabic. I need them to be in txt. format. But the problem
is that by converting them with Adobe Acrobat Prof. the txt. files are not
readible.
Could you please advice any solution to this problem or may be you know any
tool for text analysis for Arabic.
Thank you in advance
Regards,
Anastasiia
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140914/68aac648/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list