[Corpora-List] Converting PDFs in Arabic to txt/xml for further corpus analysis (fwd)

Laurence Anthony anthony0122 at gmail.com
Thu Sep 11 20:23:16 UTC 2014


Hi,

My own PDF to Text Converter, AntFileConverter, should also work although I
have not tested it on Arabic writing. You can find it here on my software
page here (just scroll to the middle of the list of tools):

http://www.laurenceanthony.net/software.html


Laurence.


###############################################################
Laurence ANTHONY, Ph.D.
Professor
Center for English Language Education in Science and Engineering (CELESE)
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: anthony0122 at gmail.com
WWW: http://www.antlab.sci.waseda.ac.jp/
###############################################################

On Fri, Sep 12, 2014 at 2:29 AM, Craig Pfeifer <craig.pfeifer at gmail.com>
wrote:

> Another option is the open source Apache Tika project:
> https://tika.apache.org/
>
> It *should* handle arabic properly, with the standard caveats about
> needing OCR for image PDFs.
>
> Craig
>
> ______________
> craig.pfeifer at gmail.com
>
> On Thu, Sep 11, 2014 at 6:45 AM, Eric Atwell <E.S.Atwell at leeds.ac.uk>
> wrote:
>
>> Can anyone recommend PDF-to=txt (or PDF-to=xml) tools for Arabic?
>> I have had enquiries from several Arabic corpus linguistics researchers,
>> example below from Anastasiya Andrusenko in Valencia
>>
>> thanks - Eric Atwell, Leeds University
>>  WWW: http://www.comp.leeds.ac.uk/eric
>>       http://www.comp.leeds.ac.uk/arabic
>>
>> ---------- Forwarded message ----------
>> Date: Thu, 11 Sep 2014 10:50:36 +0100
>> From: Anastasiya Andrusenko <anisika2002 at gmail.com>
>> To: Eric Atwell <E.S.Atwell at leeds.ac.uk>
>> Subject: Converting PDFs in Arabic to txt. for further corpus analysis
>>
>>
>> Hi,
>>
>> I saw your profile in internet and thought may be you can help me.
>> My name is Anastasiia Andrusenko, currently I am doing research on
>> metadiscourse features in Arabic Research Articles (Analysis of Arabic
>> corpus)
>> at the Department of Applied Linguistics of the Universitat Politècnica de
>> València.
>> I have PDF files in Arabic. I need them to be in txt. format. But the
>> problem
>> is that by converting them with Adobe Acrobat Prof. the txt. files are not
>> readible.
>>
>> Could you please advice any solution to this problem or may be you know
>> any
>> tool for text analysis for Arabic.
>> Thank you in advance
>>
>> Regards,
>>
>> Anastasiia
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140912/d8b7b1e0/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list