[Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost

Radu Ion radu at racai.ro
Wed Dec 11 09:34:16 UTC 2013


Hello,

We have used our parallel text mining tool LEXACC to extract parallel 
sentences from English, Romanian, Spanish and Slovak Wikipedias. We have 
found that, depending on the a particular language-pair Wikipedia size, 
there are lots of parallel sentences (e.g. useful to SMT) to be found. 
This is because, at least for the Wikipedias we investigated, there are 
a lot of articles that are based on the English version of the Wikipedia.

Now, I don't know how large the Arabic Wikipedia is, or how original, 
but checking a random article, e.g. "The Moon" and translating it to 
English with Google Translate, I could found many truly parallel 
sentences. Thus, I think it's worth to experiment searching for parallel 
sentence pairs for the English-Arabic language pair. We will offer 
assistance to the adaptation of LEXACC to support Arabic (it needs a 
seed English-Arabic dictionary, a stop word list and, if possible, a 
list of inflectional affixes for content words).

Best regards,
Radu Ion
Research Institute for AI, Romanian Academy

On 11-Dec-13 01:37 AM, Eric Atwell wrote:
> There is a range of "parallelism", from word-by-word literal
> translations (as in your Quranic Arabic Corpus), to less literal
> transaltions capturing the overall message, to "comparable corpora"
> containing similar texts but not translations. I think Arabic Wikipedia
> is n ot a direct or even a loose translation of English wikipedia, as
> even articles on the same topic will be written independently by 
> Arabic and English crowd-sources.
> It depends what you want to do with the parallel/comparable corpora
>
> Eric Atwell, School of Computing, Leeds University
>
>
> On Tue, 10 Dec 2013, Tiberiu Boros wrote:
>
>> Hi Kais,
>>
>> I'm not aware if this has been done before, but it is possible to use
>> Wikipedia as a souce of comparable corpora and than use a tool such as
>> LEXACC
>> (http://metashare.elda.org/repository/browse/lexacc-lucene-based-parallel-phrase-extractor-from-comparable-corpora/facd55e0fb6711e2a8ad00237df3e35881478db1bebb4b4f93a7b21e2fc91ab5/) 
>>
>> to automatically extract parallel data.
>>
>> Here are some reference papers:
>> This is about using LEXACC to build parallel corpora from wikipedia
>> http://www.researchgate.net/publication/236319575_Parallel-Wiki_A_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia/file/3deec517953143eb34.pdf 
>>
>> And this one is about the tool itself:
>> http://mt-archive.info/EAMT-2012-Stefanescu.pdf
>>
>>
>> On 10.12.2013 12:02, Kais Dukes wrote:
>>> I'm performing a small feasibility study to understand how expensive 
>>> it would be to build a parallel Arabic-English corpus. I'm aware 
>>> that such resources already exist (e.g. LDC), but these don't suit 
>>> my purpose. I want to develop something free that can be easily 
>>> downloaded and used without cost by the wider research community. 
>>> e.g. under a creative commons or other open source license. With 
>>> regards to Arabic genre/dialect, I'm only interested in MSA in its 
>>> standard register (so not generally social media, blogs, etc.). 
>>> Ideally, I'm looking for well-written news articles in Arabic by a 
>>> prominent news agency, or something of comparable quality.
>>> Some options I could pursue, from most expensive to least expensive:
>>> 1. Collect some Arabic news articles online, and then pay to have 
>>> these translated into English (either by a professional translation 
>>> service or via crowdsourcing). Sources could include Al Jazeera, 
>>> Associated Press, Arabic Wikinews etc.
>>>
>>> 2. Use the United Nations as a source of parallel translated texts. 
>>> My only concern with this option is that these texts sound quite 
>>> specific in terms of genre compared to more general news articles, 
>>> so might not be an ideal solution for what I want to achieve.
>>>
>>> 3. Use some other high-quality source of Arabic-English (free and 
>>> easily available) parallel text that I've not thought of for Modern 
>>> Standard Arabic.
>>> My aim is to work out whether or not option 1 is the only way to 
>>> develop and publish (to the research community) a free high-quality 
>>> Arabic-English parallel corpus, or if there is something I'm 
>>> missing. I would also like to have the corpus sentence-aligned, 
>>> although this is something I could do myself (semi-automatically 
>>> with manual correction) if the two corpora form a good structural 
>>> translation pair.
>>> In a nutshell my question is ... is my best option to pay for 
>>> high-quality Arabic news articles to get translated into English, 
>>> then distribute the resulting corpus as a free resource, or is there 
>>> a better (high-quality) starting point for MSA news?
>>>
>>> Any advice is most welcome. For option 1, it would also be great to 
>>> get a general feel for high-quality translation costs in terms of 
>>> words/dollars.
>>> Also, if anyone is generally interested in helping with this effort, 
>>> please do get in touch.
>>> Kind Regards,
>>> -- Kais Dukes
>>> School of Computing
>>> University of Leeds
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>

-- 
Radu Ion, PhD
Research Institute for Artificial Intelligence
Romanian Academy
Web:   http://www.racai.ro/~radu/
Phone: 0040213188103
Fax:   0040213188142


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list