[Corpora-List] Constructing a parallel Arabic-English corpus that can be freely distributed without cost

Darren Cook darren at dcook.org
Mon Dec 16 00:46:42 UTC 2013


> I'm performing a small feasibility study to understand how expensive
>  it would be to build a parallel Arabic-English corpus. ... 1.
> Collect some Arabic news articles online, and then pay to have these
>  translated into English (either by a professional translation 
> service or via crowdsourcing). Sources could include Al Jazeera, 
> Associated Press, Arabic Wikinews etc.

A variation on this, that is much more affordable, is to find news web
sites that produce news in both English and Arabic. They are often
translations from English, but professionally translated for real users
(partly allaying the concerns of artificialness of Ana Frankenberg).

You still have the copyright issue that Alexander Yeh brought up. It is
my understanding that if you matched at the sentence level, then just
released a sorted list of those sentence pairs, you may be okay. (This
is based on the assumption that the value of an article is not that of
the words in each sentence, but of the way those sentences are put
together as paragraphs, the order of the paragraphs, and the associated
images and links.)

But apart from being a legal grey-area, just releasing sentences out of
context makes it less useful for many purposes.

Anyway, here is a starter list of news sources (additions to this list
would be very welcome!):


BBC (http://www.bbc.co.uk/worldservice/languages/index.shtml)
  http://www.bbc.co.uk/news/
  http://www.bbc.co.uk/arabic/

CNN (English, Spanish, Arabic, Japanese, Turkish)
  http://edition.cnn.com/
  http://arabic.cnn.com/

Deutsche-Welle (30 languages):
  http://www.dw.de/
  http://www.dw.de/%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9/s-9106
(assuming the original is always German, this is interesting for the
fact that both English and Arabic will be translations.)

Reuters (about 8 languages)
  http://www.reuters.com/
  http://ara.reuters.com/


HTH,
Darren



-- 
Darren Cook, Software Researcher/Developer

http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list