[Corpora-List] Looking for Arabic corpus

Kais Dukes sckd at leeds.ac.uk
Thu Feb 18 08:11:53 UTC 2010


Hello Wafya,

I think that Eric raises a good point here. Do you mean a character-based transliteration or instead a phonetic transcription? These terms often mean different things to Arabic linguists than to the general public. People generally use transliteration to refer to a phonetic conversion, or possibly any type of conversion between two character sets. For example, English WordNet gives the definition: "Transliteration: a transcription from one alphabet to another"

This can cause some confusion :-)  However, when discussing Arabic from a technical linguistic perspective (especially computational Arabic) these two terms have a more precise formal meaning:
http://en.wikipedia.org/wiki/Transliteration#Difference_from_transcription

In summary, Arabic transliteration (e.g. Buckwalter transliteration) is usually a 1-1 mapping, which is typically lossless and reversible, and is based on mapping character codes. Transcription on the other hand tends to be phonetic - based on pronunciation, is not usually reversible, and is uses phonemes instead of characters.

For example, the Arabic word for "Sun" can be converted into Roman/ASCII characters in at least two different ways. Using Buckwalter transliteration gives:

(1) … $ms (without diacritics)

(2) … or $amos (with diacritics)

However, this is quite different from a phonetic transcription. For a pronunciation-based transcription, the following is more likely:

(3) … shams

So, if you are after a Buckwalter-style transliteration (1) or (2), this is quite straightforward. As indicated by others, you can simply take any existing Arabic corpora and apply a straightforward character-based mapping to it based on the specific transliteration scheme, to map from Unicode to your transliteration character set.

However, if by transliteration you mean what is usually technically referred to as a phonetic transcription, then this is more interesting :-) For the Quranic Arabic Corpus, the original Arabic text was used to automatically derive both buckwalter translation AND a phonetic transcription. We have a rule-based engine that can convert Arabic words to their corresponding phonetic pronunciation using English characters. This is then displayed online, see: http://corpus.quran.com/wordbyword.jsp. Users have found this automatic phonetic transcription of the Quran into English characters to be quite accurate. However, this system only works on corpora which are fully diacriticized, and so allows division of words into phonemes. So, if you have a corpus with full Arabic diacritics, then in principle you can reuse this phonetic translator to produce an Arabic corpus which also has a transcription into English.

Note that automatically producing a phonetic-based transcription from Arabic text without diacritics is non-trivial in the general case, since you will first need to perform a disambiguation task to restore the missing diacritics. I hope that helps. Please forgive me if you did actually meant a Buckwalter-style character-based transliteration.

Kind Regards,

-- Kais Dukes

Language Research Group
School of Computing
University of Leeds
http://corpus.quran.com - The Quranic Arabic Corpus
comp-quran at comp.leeds.ac.uk - Computational Quranic Arabic discussion list

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Eric Atwell [csc6ea at leeds.ac.uk]
Sent: 17 February 2010 19:02
To: wafya hamouda
Cc: corpora at uib.no
Subject: Re: [Corpora-List] looking for Arabic corpus

Dear Wafya,

perhaps you should state more precisely what you need when you say it
must be "transliterated" - if you just need each arabic character to be
represented by a Roman / ASCII character, this cna be achieved readily by
a simple mapping program.  BUT I think that maybe what you really want
is a corpus with all vowels explicitly included, whcih is harder to find
- am I right?

eric atwell
Leeds University

On Wed, 17 Feb 2010, wafya hamouda wrote:

> Dear All,
>  Kindly can you help me in finding an Arabic corpus that is
> transliterated . At the moment I have got the Leeds corpus for Quran but
> I need to have another one so I can compare results . I have looked on
> the LDC but they do not give it out for free  for research students .
> Thanks
> Best
> Wafya Ibrahim
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list