[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)

Carsten Schnober c.schnober at jpberlin.de
Thu Sep 11 12:09:49 UTC 2014


Hi,
Apart from Abbey Finereader, there is also Tesseract which includes a
pre-trained model for Arabic: https://code.google.com/p/tesseract-ocr/

We have applied it for German Fraktur language with an acceptable
outcome which is why I cannot say anything about Arabic. I don't have
any numbers at hand, but it's been reported to perform slightly worse
than Abbey Finereader. Also, it does not come with a nice GUI, but there
are some available on the web. On the other hand, it's free and
open-source and has plenty of fine-tuning parameters with which you
might be able to improve corpus-specific results.

Best,
Carsten


Am 11.09.2014 um 12:45 schrieb Eric Atwell:
> Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
> I have had enquiries from several Arabic corpus linguistics researchers,
> example below from Anastasiya Andrusenko in Valencia
> 
> thanks - Eric Atwell, Leeds University
>  WWW: http://www.comp.leeds.ac.uk/eric
>       http://www.comp.leeds.ac.uk/arabic
> 
> ---------- Forwarded message ----------
> Date: Thu, 11 Sep 2014 10:50:36 +0100
> From: Anastasiya Andrusenko <anisika2002 at gmail.com>
> To: Eric Atwell <E.S.Atwell at leeds.ac.uk>
> Subject: Converting PDFs in Arabic to txt. for further corpus analysis
> 
> 
> Hi,
> 
> I saw your profile in internet and thought may be you can help me.
> My name is Anastasiia Andrusenko, currently I am doing research on
> metadiscourse features in Arabic Research Articles (Analysis of Arabic
> corpus)
> at the Department of Applied Linguistics of the Universitat Politècnica de
> València.
> I have PDF files in Arabic. I need them to be in txt. format. But the
> problem
> is that by converting them with Adobe Acrobat Prof. the txt. files are not
> readible.
> 
> Could you please advice any solution to this problem or may be you know any
> tool for text analysis for Arabic.
> Thank you in advance
> 
> Regards,
> 
> Anastasiia
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list