[Corpora-List] Enquiry about Indonesian corpus
Mike Maxwell
maxwell at ldc.upenn.edu
Tue Mar 16 19:23:29 UTC 2004
Jelita Asian wrote:
> ...A person from
> Linguistic Data Consortium recommend me to contact you to get hold of
> some Indonesian corpus. Do you have any Indonesian corpus with you?
> If not, do you know who we can contact to get hold of it?
I'm not sure who you contacted here at the LDC. About a year ago, we looked
into what was available on-line for Bahasa Indonesian, without actually
creating a corpus. It turns out there is a huge amount of news text, which
you can easily download and turn into a news corpus, if that is the type of
corpus you want. Judging by what we've seen in other languages, you can
doubtless find other genera on-line too. (We were specifically searching
for news.)
The Tempo Interactive might be a source of parallel bilingual text.
Caution: when we looked at this, it was not apparent whether their English
and Indonesian articles were actually parallel, which is why I say "might".
There are also several on-line dictionaries and a couple morphological
parsers, although from what little I know of Indonesian, there shouldn't be
too much morphology to worry about.
In summary, if you don't find that anyone else has compiled a corpus, you
could put one together yourselves without too much effort. You might even
find a "market" for it.
Mike Maxwell
Linguistic Data Consortium
maxwell at ldc.upenn.edu
More information about the Corpora
mailing list