[Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost

Kais Dukes sckd at leeds.ac.uk
Wed Dec 11 10:29:57 UTC 2013


Hi All,

I've received several good responses to my original question, so I thought I should send out a brief summary.

A few people have asked (off the list) why I'm trying to produce such as resource. There are two reasons. Firstly, from a computational perspective, I'm interested in a resource to train and develop new NLP tools for Arabic. I have a specific morphosyntactic tagset in mind which is not currently used for MSA (based on work I've done for Classical Arabic - http://corpus.quran.com). Secondly, although parallel corpora for MSA do exist, these are not free and open. I'm quite keen on free and open resources. Not just for myself, but to encourage wider research (not all new researchers can pay afford to pay for high quality data, so free resources can help encourage participation). So, collecting high-quality sentence pairs is a good starting point for me. Longer term, I'm looking to do annotation on top of this, for new computational work.

Following offline discussion with list members, it looks like a good choice for MSA will be to use Arabic Wikipedia as a starting point, due to its licensing model. However, from a computational perspective, I'm interested in sentence-aligned data. Based on Radu's e-mail below, automatically identifying parallel sentences in the Arabic/English Wikipedia editions looks like a potential solution for automatic data (which could then be improved by manual correction). However, unfortunately it looks like LEXACC needs an Arabic-English dictionary to work. Given Arabic's complex morphology, I would imagine a morphological analyser would also be required in addition to a dictionary.

However, this has inspired me to come up with another related idea, which may be simpler and still achieve the same thing. Do you think the following plan could work? I see that the Google Translate API is cheap to use, and runs in the cloud (https://developers.google.com/translate/v2/pricing). So, how about the following process:

Step 1. Automatically run the Google translate API on an Arabic article, to produce a list of candidate English sentences.
Step 2. We then compare this to a the corresponding English article. Do some kind of matching on the two lists of English sentences to identify good aligned sentence pairs (does anyone think this could be done automatically?)
Step 3. After we have a list of aligned sentences, then pay for manual correction to improve the translation (but at least we have a baseline to work with).

I'd be interested to hear back from the list on the idea of using Google translation API, both to identify sentence pairs, as well as to provide a baseline for paid translation. Anyone done this before?

Also, a question for Radu – can we combine LEXACC with Google Translate API instead of using an Arabic dictionary and morphological analyzer? i.e. leverage what Google have already? Getting hold of a dictionary/analyzer will not be easy otherwise.

Looking forward to hearing back people's thoughts.

Kind Regards,

-- Kais Dukes
School of Computing
University of Leeds

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Radu Ion [radu at racai.ro]
Sent: 11 December 2013 09:34
To: Eric Atwell; Tiberiu Boros
Cc: corpora at uib.no
Subject: Re: [Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost

Hello,

We have used our parallel text mining tool LEXACC to extract parallel
sentences from English, Romanian, Spanish and Slovak Wikipedias. We have
found that, depending on the a particular language-pair Wikipedia size,
there are lots of parallel sentences (e.g. useful to SMT) to be found.
This is because, at least for the Wikipedias we investigated, there are
a lot of articles that are based on the English version of the Wikipedia.

Now, I don't know how large the Arabic Wikipedia is, or how original,
but checking a random article, e.g. "The Moon" and translating it to
English with Google Translate, I could found many truly parallel
sentences. Thus, I think it's worth to experiment searching for parallel
sentence pairs for the English-Arabic language pair. We will offer
assistance to the adaptation of LEXACC to support Arabic (it needs a
seed English-Arabic dictionary, a stop word list and, if possible, a
list of inflectional affixes for content words).

Best regards,
Radu Ion
Research Institute for AI, Romanian Academy

On 11-Dec-13 01:37 AM, Eric Atwell wrote:
> There is a range of "parallelism", from word-by-word literal
> translations (as in your Quranic Arabic Corpus), to less literal
> transaltions capturing the overall message, to "comparable corpora"
> containing similar texts but not translations. I think Arabic Wikipedia
> is n ot a direct or even a loose translation of English wikipedia, as
> even articles on the same topic will be written independently by
> Arabic and English crowd-sources.
> It depends what you want to do with the parallel/comparable corpora
>
> Eric Atwell, School of Computing, Leeds University
>
>
> On Tue, 10 Dec 2013, Tiberiu Boros wrote:
>
>> Hi Kais,
>>
>> I'm not aware if this has been done before, but it is possible to use
>> Wikipedia as a souce of comparable corpora and than use a tool such as
>> LEXACC
>> (http://metashare.elda.org/repository/browse/lexacc-lucene-based-parallel-phrase-extractor-from-comparable-corpora/facd55e0fb6711e2a8ad00237df3e35881478db1bebb4b4f93a7b21e2fc91ab5/)
>>
>> to automatically extract parallel data.
>>
>> Here are some reference papers:
>> This is about using LEXACC to build parallel corpora from wikipedia
>> http://www.researchgate.net/publication/236319575_Parallel-Wiki_A_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia/file/3deec517953143eb34.pdf
>>
>> And this one is about the tool itself:
>> http://mt-archive.info/EAMT-2012-Stefanescu.pdf
>>
>>
>> On 10.12.2013 12:02, Kais Dukes wrote:
>>> I'm performing a small feasibility study to understand how expensive
>>> it would be to build a parallel Arabic-English corpus. I'm aware
>>> that such resources already exist (e.g. LDC), but these don't suit
>>> my purpose. I want to develop something free that can be easily
>>> downloaded and used without cost by the wider research community.
>>> e.g. under a creative commons or other open source license. With
>>> regards to Arabic genre/dialect, I'm only interested in MSA in its
>>> standard register (so not generally social media, blogs, etc.).
>>> Ideally, I'm looking for well-written news articles in Arabic by a
>>> prominent news agency, or something of comparable quality.
>>> Some options I could pursue, from most expensive to least expensive:
>>> 1. Collect some Arabic news articles online, and then pay to have
>>> these translated into English (either by a professional translation
>>> service or via crowdsourcing). Sources could include Al Jazeera,
>>> Associated Press, Arabic Wikinews etc.
>>>
>>> 2. Use the United Nations as a source of parallel translated texts.
>>> My only concern with this option is that these texts sound quite
>>> specific in terms of genre compared to more general news articles,
>>> so might not be an ideal solution for what I want to achieve.
>>>
>>> 3. Use some other high-quality source of Arabic-English (free and
>>> easily available) parallel text that I've not thought of for Modern
>>> Standard Arabic.
>>> My aim is to work out whether or not option 1 is the only way to
>>> develop and publish (to the research community) a free high-quality
>>> Arabic-English parallel corpus, or if there is something I'm
>>> missing. I would also like to have the corpus sentence-aligned,
>>> although this is something I could do myself (semi-automatically
>>> with manual correction) if the two corpora form a good structural
>>> translation pair.
>>> In a nutshell my question is ... is my best option to pay for
>>> high-quality Arabic news articles to get translated into English,
>>> then distribute the resulting corpus as a free resource, or is there
>>> a better (high-quality) starting point for MSA news?
>>>
>>> Any advice is most welcome. For option 1, it would also be great to
>>> get a general feel for high-quality translation costs in terms of
>>> words/dollars.
>>> Also, if anyone is generally interested in helping with this effort,
>>> please do get in touch.
>>> Kind Regards,
>>> -- Kais Dukes
>>> School of Computing
>>> University of Leeds
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>

--
Radu Ion, PhD
Research Institute for Artificial Intelligence
Romanian Academy
Web:   http://www.racai.ro/~radu/
Phone: 0040213188103
Fax:   0040213188142


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list