[Corpora-List] Quantitive Corpus Linguistics

Michael Maxwell maxwell at umiacs.umd.edu
Fri Aug 22 15:28:43 UTC 2008


Dom Widdows wrote:
> On 8/21/08, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
>> Not to mention that if you limit yourself to studying things that
>>  require large corpora, you rule out studying perhaps 99% of the
>>  languages in the world.
>
> This I'd disagree with - you can learn things about the structure of
> language in general by considering available large corpora, and use
> this knowledge to try and enhance what you can do with small datasets.
> Linear B was a comparatively small corpus, but using knowledge of
> classical Greek, it could be decifered. Perhaps this is a canned
> example since the languages are in a sense "the same" - but even for
> completely unrelated languages, a good linguist uses information
> learned about familiar languages to build expertise on language in
> general, and can then apply this expertise and technique to fresh
> languages with small amounts of data.

Maybe I'd better clarify my comment.  For 99% of the world's languages,
there are no gigabyte corpora.  Hence any studies that rely on large
corpora for those languages will have to use other methods in addition to
_pure_ corpora methods.  Certainly looking at related languages (where
there are related languages, which leaves out the ten or fifteen percent
of language isolates) is a valid method, in my view.  (In fact, I would
*love* to see more work in machine learning that bootstrapped off of
related languages...).

For perhaps 30% of the world's languages, the only written (never mind
electronic) corpus is the New Testament of the Bible.  That's a pretty
small corpus, and it's a translation, with all the problems that can
bring.  And for perhaps 50% of the languages, there is no written corpus
whatsoever, because the languages haven't been written down.  (There might
be a word list written down by someone, and that someone might or might
not have had any linguistic training.)  This is true for thousands of
minority spoken languages, and for nearly all sign languages.

I hasten to add that the above percentages are pulled out of my head, and
are therefore wrong.  But I would bet that they're not far wrong.

In sum, if you want to work with 99% of the world's languages, corpus
methods are going to have to be supplemented by something else: knowledge
of related languages, directed elicitation, training of native speakers in
lexicography and linguistics--and of course more corpus collection.  You
can't just "trust the corpus", because there isn't one.  But we don't stop
working on those languages just because we can't use a "pure" corpus
linguistics methodology.

   Mike Maxwell
   CASL/ U MD


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list