[Corpora-List] Arabic language under Linux

Andy Roberts andyr at comp.leeds.ac.uk
Sun May 29 10:43:02 UTC 2005


This is not an operating system issue. You read an Arabic file much in
the same way as any file. The main difference is that you will need to
specify a character encoding.

In terms of adapting your current tokeniser, it's difficult to advise
what to do because it depends what programming language you've used.
I've always found Java to be the best for multilingual support,
including Arabic. I've also written an Arabic transliterator in Python
which wasn't too difficult. All programming will let you specify an
encoding, but it's easier in some than others.

If you are unsure about encodings, I found this article to be
particularly good:
http://www.joelonsoftware.com/articles/Unicode.html

If you have a bilingual file, with Arabic and French, then I'd recommend
using the same encoding through out the file. The Unicode encoding is
ideal. UTF8 should be adequate, although UTF-16 will certainly be fine.
(that is, make sure you save your files as utf16 *before* trying to
tokenise them).

Andy

On Sun, 29 May 2005, nouha.chaaben wrote:

>
>
> Dear all,
>
> I have a French documents tokenizer under Linux; I want to adapt it to Arabic documents.
> Does anyone know how to use Arabic language and how to read bilingual file under Linux?
>
> Thanks
>
> Nouha
> ******************************
> Nouha Chaâben
> PhD Student at Faculty
> of Economic Sciences
> and management of Sfax, Tunisia
>
> Email : nouha.chaaben at laposte.net

Accédez au courrier électronique de La Poste : www.laposte.net ;
3615 LAPOSTENET (0,34€/mn) ; tél : 08 92 68 13 50 (0,34€/mn)
>


More information about the Corpora mailing list