[Corpora-List] starting a machine translation project
Mike Maxwell
maxwell at ldc.upenn.edu
Wed Sep 13 22:01:14 UTC 2006
zhang min wrote:
> Does anyone know where we can get English-to-Indonesian bilingual corpus?
Joseph Cathcart asked that question on this list in 2001 (I don't think
he got any responses, but you might ask him), and Jelita Asian was
looking for generic corpora in 2004 (not necessarily parallel).
When Bill Poser and I were working at the LDC, we (actually, I think it
was Bill) looked for parallel text in Indonesian. Bill noted that there
was lots of news, mostly monolingual, but that one might be able to
build a bilingual English-Bahasa Indonesian corpus by extracting
parallel articles from the following site:
Tempo Interactive (Indonesian) http://www.tempo.co.id/
Tempo Interactive (English) http://www.tempointeractive.com/index,uk.asp
Trying it just now, the first site redirects you to the second
(http://www.tempointeractive.com/). In any case, it is still possible
to switch between English and Indonesian (as well as Japanese and
Mandarin; see the menu on the left of their web page). Whether you
could find parallel articles depends on how they produce text in the two
languages (and access to the English archives apparently now requires
registration). When we looked into this kind of thing for Hindi, we
found to our surprise that most bilingual news sites had little or no
parallel text. Maybe it's cheaper there to employ separate reporters
for the two languages than to employ translators. That, or the market's
very different for news in Hindi and in English.
Three years ago, when Bill looked, he was able to find at least one
parallel article at the above site. Since he doesn't speak Indonesian
(at least I _think_ he doesn't, although it wouldn't surprise me to hear
that he was learning it!), I presume it was fairly easy to find. But
when I tried an archive search just now, using proper nouns found in
either an English or an Indonesian article, I couldn't come up with any
parallel text. Maybe your luck will be better...
--
Mike Maxwell
maxwell at ldc.upenn.edu
More information about the Corpora
mailing list