[Corpora-List] Re: Minor(ity) Language

Mike Maxwell maxwell at ldc.upenn.edu
Wed Mar 8 17:20:27 UTC 2006


Chantal ENGUEHARD wrote:
> Note : [In 2004, vincent Berment defined in his thesis* an evaluation grid to
> note precisely what is the degree of computerization of any language. This
> grid allow to calculate a number (a note on a scale of 20 points).
> If this number is less than 10 points, the language is said to be a
> pi-language (pi being the greek letter p).
> If this number is more than 14 points, the language is said to be a
> tau-language (tau being the greek letter t).
> Otherwise the language is said to be a mu-language (mu being the greek letter
> m).]

Reminds me of a project we (mostly Bill Poser and myself) did at the LDC 
a few years back, in which we tried to quantify the resources available 
for languages with at least a million speakers (of which the Ethnologue 
reports something like 330).  We looked on the web for things like 100k 
words of monolingual and bilingual text, bilingual lexicons, 
morphological parsers (where relevant), etc.  We did _not_ try to 
quantify more high-end things, such as syntactic parsers or MT programs 
(although we recorded them if we found them).  Everything was 
text-based: we did not look at speech resources.

A language was scored on each of these categories in a yes/no fashion. 
(It would have been nice to say how much bilingual text there was, 
rather than just more than or less than 100k words, but in many cases 
it's hard enough to find the answer to the yes/no question.)  We then 
did a spreadsheet, with green for 'yes' in a given category, and red for 
'no'.  By assigning numerical scores to various categories, we could 
easily sort the list of languages.

In the end, we only had time to do about 150 languages (intentionally 
leaving out MSA, Mandarin Chinese, and most of the European languages, 
even the minor(ity) ones).  When we showed the results to people, they 
thought it was the best thing since sliced bread.  There are lots of 
ways it could be improved if we did it again.  Unfortunately, such a 
survey quickly becomes out of date, and we have not found funding to 
revisit it.

I'll have to see if I can get a copy of Berment's thesis...

    Mike Maxwell
    CASL/ U of Maryland



More information about the Corpora mailing list