Arabic-L:GEN:Arabic under Linux
Dilworth Parkinson
dilworth_parkinson at byu.edu
Thu Jun 2 16:53:05 UTC 2005
------------------------------------------------------------------------
-
Arabic-L: Thu 02 Jun 2005
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic under Linux
-------------------------Messages-----------------------------------
1)
Date: 02 Jun 2005
From:reposted from Corpora
Subject:Arabic under Linux
[moderator's note: the following exchange took place on Corpora, and
I thought some of you might benefit from seeing it. Dil]
>
> Dear all,
>
> I have a French documents tokenizer under Linux; I want to adapt it
> to Arabic documents.
> Does anyone know how to use Arabic language and how to read
> bilingual file under Linux?
>
> Thanks
>
> Nouha
From: andyr at comp.leeds.ac.uk
Subject: Re: [Corpora-List] Arabic language under Linux
Date: May 29, 2005 4:43:02 AM MDT
To: nouha.chaaben at laposte.net
Cc: corpora at uib.no
This is not an operating system issue. You read an Arabic file much in
the same way as any file. The main difference is that you will need to
specify a character encoding.
In terms of adapting your current tokeniser, it's difficult to advise
what to do because it depends what programming language you've used.
I've always found Java to be the best for multilingual support,
including Arabic. I've also written an Arabic transliterator in Python
which wasn't too difficult. All programming will let you specify an
encoding, but it's easier in some than others.
If you are unsure about encodings, I found this article to be
particularly good:
http://www.joelonsoftware.com/articles/Unicode.html
If you have a bilingual file, with Arabic and French, then I'd recommend
using the same encoding through out the file. The Unicode encoding is
ideal. UTF8 should be adequate, although UTF-16 will certainly be fine.
(that is, make sure you save your files as utf16 *before* trying to
tokenise them).
Andy
From: tree at basistech.com
Subject: Re: [Corpora-List] Arabic language under Linux
Date: May 29, 2005 8:35:30 AM MDT
To: nouha.chaaben at laposte.net
Cc: corpora at uib.no
Reply-To: tree at basistech.com
nouha.chaaben writes:
> I have a French documents tokenizer under Linux; I want to adapt it
> to Arabic documents.
> Does anyone know how to use Arabic language and how to read
> bilingual file under Linux?
>
http://www.arabeyes.org/
-tree
-- Tom Emerson Basis
Technology Corp.
Software Architect http://
www.basistech.com
------------------------------------------------------------------------
--
End of Arabic-L: 02 Jun 2005
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20050602/56f3bacd/attachment.htm>
More information about the Arabic-l
mailing list