[Corpora-List] Re: Minor(ity) Language
Mike Maxwell
maxwell at ldc.upenn.edu
Wed Mar 8 17:20:27 UTC 2006
Chantal ENGUEHARD wrote:
> Note : [In 2004, vincent Berment defined in his thesis* an evaluation grid to
> note precisely what is the degree of computerization of any language. This
> grid allow to calculate a number (a note on a scale of 20 points).
> If this number is less than 10 points, the language is said to be a
> pi-language (pi being the greek letter p).
> If this number is more than 14 points, the language is said to be a
> tau-language (tau being the greek letter t).
> Otherwise the language is said to be a mu-language (mu being the greek letter
> m).]
Reminds me of a project we (mostly Bill Poser and myself) did at the LDC
a few years back, in which we tried to quantify the resources available
for languages with at least a million speakers (of which the Ethnologue
reports something like 330). We looked on the web for things like 100k
words of monolingual and bilingual text, bilingual lexicons,
morphological parsers (where relevant), etc. We did _not_ try to
quantify more high-end things, such as syntactic parsers or MT programs
(although we recorded them if we found them). Everything was
text-based: we did not look at speech resources.
A language was scored on each of these categories in a yes/no fashion.
(It would have been nice to say how much bilingual text there was,
rather than just more than or less than 100k words, but in many cases
it's hard enough to find the answer to the yes/no question.) We then
did a spreadsheet, with green for 'yes' in a given category, and red for
'no'. By assigning numerical scores to various categories, we could
easily sort the list of languages.
In the end, we only had time to do about 150 languages (intentionally
leaving out MSA, Mandarin Chinese, and most of the European languages,
even the minor(ity) ones). When we showed the results to people, they
thought it was the best thing since sliced bread. There are lots of
ways it could be improved if we did it again. Unfortunately, such a
survey quickly becomes out of date, and we have not found funding to
revisit it.
I'll have to see if I can get a copy of Berment's thesis...
Mike Maxwell
CASL/ U of Maryland
More information about the Corpora
mailing list