Arabic-L:GEN:Arabic under Linux

Dilworth Parkinson dilworth_parkinson at byu.edu
Thu Jun 2 16:53:05 UTC 2005


------------------------------------------------------------------------
-
Arabic-L: Thu 02 Jun  2005
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic under Linux

-------------------------Messages-----------------------------------
1)
Date: 02 Jun  2005
From:reposted from Corpora
Subject:Arabic under Linux

[moderator's note: the following exchange took place on Corpora, and
I thought some of you might benefit from seeing it.  Dil]

>
> Dear all,
>
> I have a French documents tokenizer under Linux; I want to adapt it
> to Arabic documents.
> Does anyone know how to use Arabic language and how to read
> bilingual file under Linux?
>
> Thanks
>
> Nouha

     From:       andyr at comp.leeds.ac.uk
     Subject:     Re: [Corpora-List] Arabic language under Linux
     Date:     May 29, 2005 4:43:02 AM MDT
     To:       nouha.chaaben at laposte.net
     Cc:       corpora at uib.no


This is not an operating system issue. You read an Arabic file much in
the same way as any file. The main difference is that you will need to
specify a character encoding.

In terms of adapting your current tokeniser, it's difficult to advise
what to do because it depends what programming language you've used.
I've always found Java to be the best for multilingual support,
including Arabic. I've also written an Arabic transliterator in Python
which wasn't too difficult. All programming will let you specify an
encoding, but it's easier in some than others.

If you are unsure about encodings, I found this article to be
particularly good:
http://www.joelonsoftware.com/articles/Unicode.html

If you have a bilingual file, with Arabic and French, then I'd recommend
using the same encoding through out the file. The Unicode encoding is
ideal. UTF8 should be adequate, although UTF-16 will certainly be fine.
(that is, make sure you save your files as utf16 *before* trying to
tokenise them).

Andy

     From:       tree at basistech.com
     Subject:     Re: [Corpora-List] Arabic language under Linux
     Date:     May 29, 2005 8:35:30 AM MDT
     To:       nouha.chaaben at laposte.net
     Cc:       corpora at uib.no
     Reply-To:       tree at basistech.com

nouha.chaaben writes:

> I have a French documents tokenizer under Linux; I want to adapt it
> to Arabic documents.
> Does anyone know how to use Arabic language and how to read
> bilingual file under Linux?
>

http://www.arabeyes.org/

     -tree
-- Tom Emerson                                          Basis
Technology Corp.
Software Architect                                 http://
www.basistech.com

------------------------------------------------------------------------
--
End of Arabic-L:  02 Jun  2005



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20050602/56f3bacd/attachment.htm>


More information about the Arabic-l mailing list