[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)

Craig Pfeifer craig.pfeifer at gmail.com
Thu Sep 11 17:29:02 UTC 2014


Another option is the open source Apache Tika project:
https://tika.apache.org/

It *should* handle arabic properly, with the standard caveats about needing
OCR for image PDFs.

Craig

______________
craig.pfeifer at gmail.com

On Thu, Sep 11, 2014 at 6:45 AM, Eric Atwell <E.S.Atwell at leeds.ac.uk> wrote:

> Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
> I have had enquiries from several Arabic corpus linguistics researchers,
> example below from Anastasiya Andrusenko in Valencia
>
> thanks - Eric Atwell, Leeds University
>  WWW: http://www.comp.leeds.ac.uk/eric
>       http://www.comp.leeds.ac.uk/arabic
>
> ---------- Forwarded message ----------
> Date: Thu, 11 Sep 2014 10:50:36 +0100
> From: Anastasiya Andrusenko <anisika2002 at gmail.com>
> To: Eric Atwell <E.S.Atwell at leeds.ac.uk>
> Subject: Converting PDFs in Arabic to txt. for further corpus analysis
>
>
> Hi,
>
> I saw your profile in internet and thought may be you can help me.
> My name is Anastasiia Andrusenko, currently I am doing research on
> metadiscourse features in Arabic Research Articles (Analysis of Arabic
> corpus)
> at the Department of Applied Linguistics of the Universitat Politècnica de
> València.
> I have PDF files in Arabic. I need them to be in txt. format. But the
> problem
> is that by converting them with Adobe Acrobat Prof. the txt. files are not
> readible.
>
> Could you please advice any solution to this problem or may be you know any
> tool for text analysis for Arabic.
> Thank you in advance
>
> Regards,
>
> Anastasiia
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140911/51233352/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list