[Corpora-List] LULCL II

Fri Aug 22 16:39:02 UTC 2008

Mike,

It looks like your prayers have been answered!

:-)

Justin Washtell
University of Leeds

Quoting maxwell at umiacs.umd.edu:

> Dom Widdows wrote:
>> On 8/21/08, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
>>> Not to mention that if you limit yourself to studying things that
>>>  require large corpora, you rule out studying perhaps 99% of the
>>>  languages in the world.
>>
>> This I'd disagree with - you can learn things about the structure of
>> language in general by considering available large corpora, and use
>> this knowledge to try and enhance what you can do with small datasets.
>> Linear B was a comparatively small corpus, but using knowledge of
>> classical Greek, it could be decifered. Perhaps this is a canned
>> example since the languages are in a sense "the same" - but even for
>> completely unrelated languages, a good linguist uses information
>> learned about familiar languages to build expertise on language in
>> general, and can then apply this expertise and technique to fresh
>> languages with small amounts of data.
>
> Maybe I'd better clarify my comment.  For 99% of the world's languages,
> there are no gigabyte corpora.  Hence any studies that rely on large
> corpora for those languages will have to use other methods in addition to
> _pure_ corpora methods.  Certainly looking at related languages (where
> there are related languages, which leaves out the ten or fifteen percent
> of language isolates) is a valid method, in my view.  (In fact, I would
> *love* to see more work in machine learning that bootstrapped off of
> related languages...).
>
> For perhaps 30% of the world's languages, the only written (never mind
> electronic) corpus is the New Testament of the Bible.  That's a pretty
> small corpus, and it's a translation, with all the problems that can
> bring.  And for perhaps 50% of the languages, there is no written corpus
> whatsoever, because the languages haven't been written down.  (There might
> be a word list written down by someone, and that someone might or might
> not have had any linguistic training.)  This is true for thousands of
> minority spoken languages, and for nearly all sign languages.
>
> I hasten to add that the above percentages are pulled out of my head, and
> are therefore wrong.  But I would bet that they're not far wrong.
>
> In sum, if you want to work with 99% of the world's languages, corpus
> methods are going to have to be supplemented by something else: knowledge
> of related languages, directed elicitation, training of native speakers in
> lexicography and linguistics--and of course more corpus collection.  You
> can't just "trust the corpus", because there isn't one.  But we don't stop
> working on those languages just because we can't use a "pure" corpus
> linguistics methodology.
>
>    Mike Maxwell
>    CASL/ U MD
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora