[Corpora-List] Web/Corpora Questions

Mon Oct 20 19:14:43 UTC 2003

William Fletcher wrote:
> Personally I believe for the major languages the Web is most useful
> for compiling ad-hoc corpora of texts dealing with specific domains
> or emerging usage, or else for answering specific questions...

A general comment (probably not relevant to the original query, but perhaps
of interest to other readers).  At the Linguistic Data Consortium, we've
been reasonably successful at collecting corpora of non-major languages.  We
found substantial quantities of text (especially news text ) for Hindi,
although that is probably not one would call a minority language in terms of
population.  But even for Cebuano, a Philipine language with about 15
million speakers, we found well over 100K words of text.  I've run trial
searches for Tzeltal (a Mayan language with 100k+ speakers) turning up some
texts (mainly collected by anthropologists).  For Shuar (an indigenous
language of Ecuador, 30k speakers), I was able to come up with a few hits,
although they were pretty much limited to a Bible translation into that
language.

Finding texts on the web in smaller languages is pretty much a hit-(pardon
the pun) and-miss thing.  Obviously, a lot depends on the number of people
in the country who have web access, although that seems to be less of a
consideration than I would have thought; the other big factor seems to be
the official status of the language in the country. Among Indonesian
languages, for example, it's very difficult to find anything that's not in
Bahasa Indonesian.  In the cases of non-official languages, you often get
more hits outside the country than you do inside--expat populations are
often more likely to have web access and web sites than people inside the
country, from my admittedly limited experience.

Another thing that makes it difficult to track down corpora for smaller
languages is the fact that encodings and even writing systems are not
standardized.  It's not too bad if the language's phonology is
simple--Cebuano, for instance.  But if the language has sounds which are not
readily represented in ASCII characters, it can be more difficult.  You have
to think about how nasalized vowels, for example, might be written--or in
some cases, not written.  (Unicode is nice when it is used, but more often
than not, it isn't.)

I've thought about writing up our experiences in compiling archives for
smaller languages, but I'm not sure what a good forum would be.  And I
probably don't have a good handle on what has already been published on this
topic.

    Mike Maxwell
    Linguistic Data Consortium
    maxwell at ldc.upenn.edu