[Corpora-List] Enquiry about Indonesian corpus

Mike Maxwell maxwell at ldc.upenn.edu
Tue Mar 16 19:23:29 UTC 2004


Jelita Asian wrote:
> ...A person from
> Linguistic Data Consortium recommend me to contact you to get hold of
> some Indonesian corpus. Do you have any Indonesian corpus with you?
> If not, do you know who  we can contact to get hold of it?

I'm not sure who you contacted here at the LDC.  About a year ago, we looked
into what was available on-line for Bahasa Indonesian, without actually
creating a corpus.  It turns out there is a huge amount of news text, which
you can easily download and turn into a news corpus, if that is the type of
corpus you want.  Judging by what we've seen in other languages, you can
doubtless find other genera on-line too.  (We were specifically searching
for news.)

The Tempo Interactive might be a source of parallel bilingual text.
Caution: when we looked at this, it was not apparent whether their English
and Indonesian articles were actually parallel, which is why I say "might".

There are also several on-line dictionaries and a couple morphological
parsers, although from what little I know of Indonesian, there shouldn't be
too much morphology to worry about.

In summary, if you don't find that anyone else has compiled a corpus, you
could put one together yourselves without too much effort.  You might even
find a "market" for it.

    Mike Maxwell
    Linguistic Data Consortium
    maxwell at ldc.upenn.edu



More information about the Corpora mailing list