[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)
Djamel MOSTEFA
djamel.mostefa at techlimed.com
Thu Sep 11 12:29:56 UTC 2014
Hi,
It depends on the type of PDF you want to convert.
If your PDF is made of texts (an office doc converted to PDF for
instance) , then pdftotext or Adobe should do the conversion properly.
But if your PDF file is made of images (scan of documents), which is
very common for Arabic PDF files, than you need an OCR software
supporting the Arabic language.
For the latter case I would recommend Abby Fine Reader which gives good
recognition results on Arabic.
Hope it helps
Djamel
--
*Djamel MOSTEFA*
Directeur technique / CTO
42, rue de l'Université 69007 Lyon
Tel: +33 (0) 4 78 58 32 35
Mob: +33 (0) 6 04 42 19 66
www.techlimed.com <http://www.techlimed.com>
Le 11/09/2014 12:45, Eric Atwell a écrit :
> Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
> I have had enquiries from several Arabic corpus linguistics researchers,
> example below from Anastasiya Andrusenko in Valencia
>
> thanks - Eric Atwell, Leeds University
> WWW: http://www.comp.leeds.ac.uk/eric
> http://www.comp.leeds.ac.uk/arabic
>
> ---------- Forwarded message ----------
> Date: Thu, 11 Sep 2014 10:50:36 +0100
> From: Anastasiya Andrusenko <anisika2002 at gmail.com>
> To: Eric Atwell <E.S.Atwell at leeds.ac.uk>
> Subject: Converting PDFs in Arabic to txt. for further corpus analysis
>
>
> Hi,
>
> I saw your profile in internet and thought may be you can help me.
> My name is Anastasiia Andrusenko, currently I am doing research on
> metadiscourse features in Arabic Research Articles (Analysis of Arabic
> corpus)
> at the Department of Applied Linguistics of the Universitat
> Politècnica de
> València.
> I have PDF files in Arabic. I need them to be in txt. format. But the
> problem
> is that by converting them with Adobe Acrobat Prof. the txt. files are
> not
> readible.
>
> Could you please advice any solution to this problem or may be you
> know any
> tool for text analysis for Arabic.
> Thank you in advance
>
> Regards,
>
> Anastasiia
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140911/786cd47f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list