[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)

Kristian Kankainen kristian at eki.ee
Thu Sep 11 11:38:11 UTC 2014


Hello!

On GNU/Linux systems there is, among many others, easy programs such as 
'pdftotext' and 'pdftohtml'. They have served me well, but all I do is 
in latin script.

What has been the complications with the files converted with Adobe 
Acrobat Professional? Problems with encoding or problems with fonts?

Best wishes
Kristian K


11.09.2014 13:45, Eric Atwell kirjutas:
> Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
> I have had enquiries from several Arabic corpus linguistics researchers,
> example below from Anastasiya Andrusenko in Valencia
>
> thanks - Eric Atwell, Leeds University
>  WWW: http://www.comp.leeds.ac.uk/eric
>       http://www.comp.leeds.ac.uk/arabic
>
> ---------- Forwarded message ----------
> Date: Thu, 11 Sep 2014 10:50:36 +0100
> From: Anastasiya Andrusenko <anisika2002 at gmail.com>
> To: Eric Atwell <E.S.Atwell at leeds.ac.uk>
> Subject: Converting PDFs in Arabic to txt. for further corpus analysis
>
>
> Hi,
>
> I saw your profile in internet and thought may be you can help me.
> My name is Anastasiia Andrusenko, currently I am doing research on
> metadiscourse features in Arabic Research Articles (Analysis of Arabic 
> corpus)
> at the Department of Applied Linguistics of the Universitat 
> Politècnica de
> València.
> I have PDF files in Arabic. I need them to be in txt. format. But the 
> problem
> is that by converting them with Adobe Acrobat Prof. the txt. files are 
> not
> readible.
>
> Could you please advice any solution to this problem or may be you 
> know any
> tool for text analysis for Arabic.
> Thank you in advance
>
> Regards,
>
> Anastasiia
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140911/ccd7a5a2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list