[Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost

Tue Dec 10 23:37:11 UTC 2013

There is a range of "parallelism", from word-by-word literal
translations (as in your Quranic Arabic Corpus), to less literal
transaltions capturing the overall message, to "comparable corpora"
containing similar texts but not translations. I think Arabic Wikipedia
is n ot a direct or even a loose translation of English wikipedia, as
even articles on the same topic will be written independently by Arabic and 
English crowd-sources.
It depends what you want to do with the parallel/comparable corpora

Eric Atwell, School of Computing, Leeds University

On Tue, 10 Dec 2013, Tiberiu Boros wrote:

> Hi Kais,
>
> I'm not aware if this has been done before, but it is possible to use
> Wikipedia as a souce of comparable corpora and than use a tool such as
> LEXACC
> (http://metashare.elda.org/repository/browse/lexacc-lucene-based-parallel-phrase-extractor-from-comparable-corpora/facd55e0fb6711e2a8ad00237df3e35881478db1bebb4b4f93a7b21e2fc91ab5/)
> to automatically extract parallel data.
>
> Here are some reference papers:
> This is about using LEXACC to build parallel corpora from wikipedia
> http://www.researchgate.net/publication/236319575_Parallel-Wiki_A_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia/file/3deec517953143eb34.pdf
> And this one is about the tool itself:
> http://mt-archive.info/EAMT-2012-Stefanescu.pdf
>
>
> On 10.12.2013 12:02, Kais Dukes wrote:
>> I'm performing a small feasibility study to understand how expensive it would be to build a parallel Arabic-English corpus. I'm aware that such resources already exist (e.g. LDC), but these don't suit my purpose. I want to develop something free that can be easily downloaded and used without cost by the wider research community. e.g. under a creative commons or other open source license. With regards to Arabic genre/dialect, I'm only interested in MSA in its standard register (so not generally social media, blogs, etc.). Ideally, I'm looking for well-written news articles in Arabic by a prominent news agency, or something of comparable quality.
>> Some options I could pursue, from most expensive to least expensive:
>> 1. Collect some Arabic news articles online, and then pay to have these translated into English (either by a professional translation service or via crowdsourcing). Sources could include Al Jazeera, Associated Press, Arabic Wikinews etc.
>>
>> 2. Use the United Nations as a source of parallel translated texts. My only concern with this option is that these texts sound quite specific in terms of genre compared to more general news articles, so might not be an ideal solution for what I want to achieve.
>>
>> 3. Use some other high-quality source of Arabic-English (free and easily available) parallel text that I've not thought of for Modern Standard Arabic.
>> My aim is to work out whether or not option 1 is the only way to develop and publish (to the research community) a free high-quality Arabic-English parallel corpus, or if there is something I'm missing. I would also like to have the corpus sentence-aligned, although this is something I could do myself (semi-automatically with manual correction) if the two corpora form a good structural translation pair.
>> In a nutshell my question is ... is my best option to pay for high-quality Arabic news articles to get translated into English, then distribute the resulting corpus as a free resource, or is there a better (high-quality) starting point for MSA news?
>>
>> Any advice is most welcome. For option 1, it would also be great to get a general feel for high-quality translation costs in terms of words/dollars.
>> Also, if anyone is generally interested in helping with this effort, please do get in touch.
>> Kind Regards,
>> -- Kais Dukes
>> School of Computing
>> University of Leeds
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Eric Atwell, Associate Professor, Language research group,
  I-AIBS Institute for Artificial Intelligence and Biological Systems
  School of Computing, Faculty of Engineering, UNIVERSITY OF LEEDS
  Leeds LS2 9JT, England.        TEL: 0113-3435430  FAX: 0113-3435468
  WWW: http://www.comp.leeds.ac.uk/eric
       http://www.comp.leeds.ac.uk/nlp
       http://www.comp.leeds.ac.uk/arabic

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora