[Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost
Radu Ion
radu at racai.ro
Wed Dec 11 09:34:16 UTC 2013
Hello,
We have used our parallel text mining tool LEXACC to extract parallel
sentences from English, Romanian, Spanish and Slovak Wikipedias. We have
found that, depending on the a particular language-pair Wikipedia size,
there are lots of parallel sentences (e.g. useful to SMT) to be found.
This is because, at least for the Wikipedias we investigated, there are
a lot of articles that are based on the English version of the Wikipedia.
Now, I don't know how large the Arabic Wikipedia is, or how original,
but checking a random article, e.g. "The Moon" and translating it to
English with Google Translate, I could found many truly parallel
sentences. Thus, I think it's worth to experiment searching for parallel
sentence pairs for the English-Arabic language pair. We will offer
assistance to the adaptation of LEXACC to support Arabic (it needs a
seed English-Arabic dictionary, a stop word list and, if possible, a
list of inflectional affixes for content words).
Best regards,
Radu Ion
Research Institute for AI, Romanian Academy
On 11-Dec-13 01:37 AM, Eric Atwell wrote:
> There is a range of "parallelism", from word-by-word literal
> translations (as in your Quranic Arabic Corpus), to less literal
> transaltions capturing the overall message, to "comparable corpora"
> containing similar texts but not translations. I think Arabic Wikipedia
> is n ot a direct or even a loose translation of English wikipedia, as
> even articles on the same topic will be written independently by
> Arabic and English crowd-sources.
> It depends what you want to do with the parallel/comparable corpora
>
> Eric Atwell, School of Computing, Leeds University
>
>
> On Tue, 10 Dec 2013, Tiberiu Boros wrote:
>
>> Hi Kais,
>>
>> I'm not aware if this has been done before, but it is possible to use
>> Wikipedia as a souce of comparable corpora and than use a tool such as
>> LEXACC
>> (http://metashare.elda.org/repository/browse/lexacc-lucene-based-parallel-phrase-extractor-from-comparable-corpora/facd55e0fb6711e2a8ad00237df3e35881478db1bebb4b4f93a7b21e2fc91ab5/)
>>
>> to automatically extract parallel data.
>>
>> Here are some reference papers:
>> This is about using LEXACC to build parallel corpora from wikipedia
>> http://www.researchgate.net/publication/236319575_Parallel-Wiki_A_Collection_of_Parallel_Sentences_Extracted_from_Wikipedia/file/3deec517953143eb34.pdf
>>
>> And this one is about the tool itself:
>> http://mt-archive.info/EAMT-2012-Stefanescu.pdf
>>
>>
>> On 10.12.2013 12:02, Kais Dukes wrote:
>>> I'm performing a small feasibility study to understand how expensive
>>> it would be to build a parallel Arabic-English corpus. I'm aware
>>> that such resources already exist (e.g. LDC), but these don't suit
>>> my purpose. I want to develop something free that can be easily
>>> downloaded and used without cost by the wider research community.
>>> e.g. under a creative commons or other open source license. With
>>> regards to Arabic genre/dialect, I'm only interested in MSA in its
>>> standard register (so not generally social media, blogs, etc.).
>>> Ideally, I'm looking for well-written news articles in Arabic by a
>>> prominent news agency, or something of comparable quality.
>>> Some options I could pursue, from most expensive to least expensive:
>>> 1. Collect some Arabic news articles online, and then pay to have
>>> these translated into English (either by a professional translation
>>> service or via crowdsourcing). Sources could include Al Jazeera,
>>> Associated Press, Arabic Wikinews etc.
>>>
>>> 2. Use the United Nations as a source of parallel translated texts.
>>> My only concern with this option is that these texts sound quite
>>> specific in terms of genre compared to more general news articles,
>>> so might not be an ideal solution for what I want to achieve.
>>>
>>> 3. Use some other high-quality source of Arabic-English (free and
>>> easily available) parallel text that I've not thought of for Modern
>>> Standard Arabic.
>>> My aim is to work out whether or not option 1 is the only way to
>>> develop and publish (to the research community) a free high-quality
>>> Arabic-English parallel corpus, or if there is something I'm
>>> missing. I would also like to have the corpus sentence-aligned,
>>> although this is something I could do myself (semi-automatically
>>> with manual correction) if the two corpora form a good structural
>>> translation pair.
>>> In a nutshell my question is ... is my best option to pay for
>>> high-quality Arabic news articles to get translated into English,
>>> then distribute the resulting corpus as a free resource, or is there
>>> a better (high-quality) starting point for MSA news?
>>>
>>> Any advice is most welcome. For option 1, it would also be great to
>>> get a general feel for high-quality translation costs in terms of
>>> words/dollars.
>>> Also, if anyone is generally interested in helping with this effort,
>>> please do get in touch.
>>> Kind Regards,
>>> -- Kais Dukes
>>> School of Computing
>>> University of Leeds
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
--
Radu Ion, PhD
Research Institute for Artificial Intelligence
Romanian Academy
Web: http://www.racai.ro/~radu/
Phone: 0040213188103
Fax: 0040213188142
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list