[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)

Pavel Vondřička Pavel.Vondricka at ff.cuni.cz
Thu Sep 11 15:04:22 UTC 2014


Hello,

our experience from project InterCorp shows that FineReader is basically 
usable, but the results are full of mistakes (probably mainly because of 
the very limited dictionary). We doubt that you will achieve better 
results with any other software, but let us know in case you do!

(Of course the results may be much better (or not) if the PDF is 
actually textual and not purely graphical, as previously suggested - but 
that is a completely different story. We have a lot of experience with 
this process as well, though not specifically with Arabic.)

Best regards,
Pavel Vondřička, ICNC


> Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
> I have had enquiries from several Arabic corpus linguistics researchers,
> example below from Anastasiya Andrusenko in Valencia
>
> thanks - Eric Atwell, Leeds University
>   WWW: http://www.comp.leeds.ac.uk/eric
>        http://www.comp.leeds.ac.uk/arabic
>
> ---------- Forwarded message ----------
> Date: Thu, 11 Sep 2014 10:50:36 +0100
> From: Anastasiya Andrusenko <anisika2002 at gmail.com>
> To: Eric Atwell <E.S.Atwell at leeds.ac.uk>
> Subject: Converting PDFs in Arabic to txt. for further corpus analysis
>
>
> Hi,
>
> I saw your profile in internet and thought may be you can help me.
> My name is Anastasiia Andrusenko, currently I am doing research on
> metadiscourse features in Arabic Research Articles (Analysis of Arabic
> corpus)
> at the Department of Applied Linguistics of the Universitat Politècnica de
> València.
> I have PDF files in Arabic. I need them to be in txt. format. But the
> problem
> is that by converting them with Adobe Acrobat Prof. the txt. files are not
> readible.
>
> Could you please advice any solution to this problem or may be you know any
> tool for text analysis for Arabic.
> Thank you in advance
>
> Regards,
>
> Anastasiia
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list