[Corpora-List] starting a machine translation project

Mike Maxwell maxwell at ldc.upenn.edu
Wed Sep 13 22:01:14 UTC 2006

zhang min wrote:
> Does anyone know where we can get English-to-Indonesian bilingual corpus? 

Joseph Cathcart asked that question on this list in 2001 (I don't think 
he got any responses, but you might ask him), and Jelita Asian was 
looking for generic corpora in 2004 (not necessarily parallel).

When Bill Poser and I were working at the LDC, we (actually, I think it 
was Bill) looked for parallel text in Indonesian.  Bill noted that there 
was lots of news, mostly monolingual, but that one might be able to 
build a bilingual English-Bahasa Indonesian corpus by extracting 
parallel articles from the following site:
   Tempo Interactive (Indonesian)  http://www.tempo.co.id/
   Tempo Interactive (English) http://www.tempointeractive.com/index,uk.asp

Trying it just now, the first site redirects you to the second 
(http://www.tempointeractive.com/).  In any case, it is still possible 
to switch between English and Indonesian (as well as Japanese and 
Mandarin; see the menu on the left of their web page).  Whether you 
could find parallel articles depends on how they produce text in the two 
languages (and access to the English archives apparently now requires 
registration).  When we looked into this kind of thing for Hindi, we 
found to our surprise that most bilingual news sites had little or no 
parallel text.  Maybe it's cheaper there to employ separate reporters 
for the two languages than to employ translators.  That, or the market's 
very different for news in Hindi and in English.

Three years ago, when Bill looked, he was able to find at least one 
parallel article at the above site.  Since he doesn't speak Indonesian 
(at least I _think_ he doesn't, although it wouldn't surprise me to hear 
that he was learning it!), I presume it was fairly easy to find.  But 
when I tried an archive search just now, using proper nouns found in 
either an English or an Indonesian article, I couldn't come up with any 
parallel text.  Maybe your luck will be better...
	Mike Maxwell
	maxwell at ldc.upenn.edu

More information about the Corpora mailing list