<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; "><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">-------------------------------------------------------------------------</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Arabic-L: Thu 02 Jun 2005</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Moderator: Dilworth Parkinson <<A href="mailto:dilworth_parkinson@byu.edu">dilworth_parkinson@byu.edu</A>></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">[To post messages to the list, send them to <A href="mailto:arabic-l@byu.edu">arabic-l@byu.edu</A>]</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">[To unsubscribe, send message from same address you subscribed from to</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><A href="mailto:listserv@byu.edu">listserv@byu.edu</A> with first line reading:</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> unsubscribe arabic-l ]</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">-------------------------Directory------------------------------------</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">1) Subject:Arabic under Linux</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">-------------------------Messages-----------------------------------</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">1)</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Date: 02 Jun 2005</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">From:reposted from Corpora</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Subject:Arabic under Linux</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">[moderator's note: the following exchange took place on Corpora, and I thought some of you might benefit from seeing it. Dil]</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Dear all,</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">I have a French documents tokenizer under Linux; I want to adapt it to Arabic documents.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Does anyone know how to use Arabic language and how to read bilingual file under Linux?</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Thanks</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Nouha</DIV></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>From: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:andyr@comp.leeds.ac.uk">andyr@comp.leeds.ac.uk</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Subject: </B><SPAN class="Apple-converted-tab"> </SPAN><B>Re: [Corpora-List] Arabic language under Linux</B></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Date: </B><SPAN class="Apple-converted-tab"> </SPAN>May 29, 2005 4:43:02 AM MDT</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>To: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:nouha.chaaben@laposte.net">nouha.chaaben@laposte.net</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Cc: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:corpora@uib.no">corpora@uib.no</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">This is not an operating system issue. You read an Arabic file much in</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">the same way as any file. The main difference is that you will need to</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">specify a character encoding.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">In terms of adapting your current tokeniser, it's difficult to advise</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">what to do because it depends what programming language you've used.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">I've always found Java to be the best for multilingual support,</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">including Arabic. I've also written an Arabic transliterator in Python</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">which wasn't too difficult. All programming will let you specify an</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">encoding, but it's easier in some than others.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">If you are unsure about encodings, I found this article to be</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">particularly good:</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><A href="http://www.joelonsoftware.com/articles/Unicode.html">http://www.joelonsoftware.com/articles/Unicode.html</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">If you have a bilingual file, with Arabic and French, then I'd recommend</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">using the same encoding through out the file. The Unicode encoding is</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">ideal. UTF8 should be adequate, although UTF-16 will certainly be fine.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">(that is, make sure you save your files as utf16 *before* trying to</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">tokenise them).</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Andy</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>From: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:tree@basistech.com">tree@basistech.com</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Subject: </B><SPAN class="Apple-converted-tab"> </SPAN><B>Re: [Corpora-List] Arabic language under Linux</B></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Date: </B><SPAN class="Apple-converted-tab"> </SPAN>May 29, 2005 8:35:30 AM MDT</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>To: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:nouha.chaaben@laposte.net">nouha.chaaben@laposte.net</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Cc: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:corpora@uib.no">corpora@uib.no</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-converted-tab"> </SPAN><B>Reply-To: </B><SPAN class="Apple-converted-tab"> </SPAN> <A href="mailto:tree@basistech.com">tree@basistech.com</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">nouha.chaaben writes:</DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">I have a French documents tokenizer under Linux; I want to adapt it to Arabic documents.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Does anyone know how to use Arabic language and how to read bilingual file under Linux?</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><A href="http://www.arabeyes.org">http://www.arabeyes.org</A>/</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> -tree</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">-- Tom Emerson Basis Technology Corp.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Software Architect <A href="http://www.basistech.com">http://www.basistech.com</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">--------------------------------------------------------------------------</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">End of Arabic-L: 02 Jun 2005</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR><BR><BR class="khtml-block-placeholder"></DIV></BODY></HTML>