[Corpora-List] Re: Minor(ity) Language
Sigrun Helgadottir
sigrunh at lexis.hi.is
Thu Mar 9 08:50:09 UTC 2006
The discussion about "minority languages" on this list puzzles me slightly.
My understanding is that a "minority language" is a language spoken by a
minority. In other words it describes a relative situation. Swedish is a
minority language in Finland just as Finnish is a minority language in
Sweden. Icelandic is mainly spoken by the 300 thousand or so inhabitants
of Iceland but is certainly not a minority language there. However, it is a
minority language in Canada for example where it is spoken by the
descendants of Icelandic immigrants. Polish is not a minority language in
Poland but it is a minority language in Iceland where it is spoken by
Polish immigrants who make up about 1% of the population of Iceland.
Sigrún Helgadóttir
At 12:20 8.3.2006 -0500, Mike Maxwell wrote:
>Chantal ENGUEHARD wrote:
>>Note : [In 2004, vincent Berment defined in his thesis* an evaluation grid to
>>note precisely what is the degree of computerization of any language. This
>>grid allow to calculate a number (a note on a scale of 20 points).
>>If this number is less than 10 points, the language is said to be a
>>pi-language (pi being the greek letter p).
>>If this number is more than 14 points, the language is said to be a
>>tau-language (tau being the greek letter t).
>>Otherwise the language is said to be a mu-language (mu being the greek letter
>>m).]
>
>Reminds me of a project we (mostly Bill Poser and myself) did at the LDC a
>few years back, in which we tried to quantify the resources available for
>languages with at least a million speakers (of which the Ethnologue
>reports something like 330). We looked on the web for things like 100k
>words of monolingual and bilingual text, bilingual lexicons, morphological
>parsers (where relevant), etc. We did _not_ try to quantify more high-end
>things, such as syntactic parsers or MT programs (although we recorded
>them if we found them). Everything was text-based: we did not look at
>speech resources.
>
>A language was scored on each of these categories in a yes/no fashion. (It
>would have been nice to say how much bilingual text there was, rather than
>just more than or less than 100k words, but in many cases it's hard enough
>to find the answer to the yes/no question.) We then did a spreadsheet,
>with green for 'yes' in a given category, and red for 'no'. By assigning
>numerical scores to various categories, we could easily sort the list of
>languages.
>
>In the end, we only had time to do about 150 languages (intentionally
>leaving out MSA, Mandarin Chinese, and most of the European languages,
>even the minor(ity) ones). When we showed the results to people, they
>thought it was the best thing since sliced bread. There are lots of ways
>it could be improved if we did it again. Unfortunately, such a survey
>quickly becomes out of date, and we have not found funding to revisit it.
>
>I'll have to see if I can get a copy of Berment's thesis...
>
> Mike Maxwell
> CASL/ U of Maryland
>
More information about the Corpora
mailing list