[Corpora-List] Re: Minor(ity) Language

Sigrun Helgadottir sigrunh at lexis.hi.is
Thu Mar 9 08:50:09 UTC 2006


The discussion about "minority languages" on this list puzzles me slightly. 
My understanding is that a "minority language" is a language spoken by a 
minority. In other words it describes a relative situation. Swedish is a 
minority language in Finland just as Finnish is a minority language in 
Sweden. Icelandic is mainly spoken by the 300 thousand  or so inhabitants 
of Iceland but is certainly not a minority language there. However, it is a 
minority language in Canada for example where it is spoken by the 
descendants of Icelandic immigrants. Polish is not a minority language in 
Poland but it is a minority language in Iceland where it is spoken by 
Polish immigrants who make up about 1% of the population of Iceland.
Sigrún Helgadóttir

At 12:20 8.3.2006 -0500, Mike Maxwell wrote:
>Chantal ENGUEHARD wrote:
>>Note : [In 2004, vincent Berment defined in his thesis* an evaluation grid to
>>note precisely what is the degree of computerization of any language. This
>>grid allow to calculate a number (a note on a scale of 20 points).
>>If this number is less than 10 points, the language is said to be a
>>pi-language (pi being the greek letter p).
>>If this number is more than 14 points, the language is said to be a
>>tau-language (tau being the greek letter t).
>>Otherwise the language is said to be a mu-language (mu being the greek letter
>>m).]
>
>Reminds me of a project we (mostly Bill Poser and myself) did at the LDC a 
>few years back, in which we tried to quantify the resources available for 
>languages with at least a million speakers (of which the Ethnologue 
>reports something like 330).  We looked on the web for things like 100k 
>words of monolingual and bilingual text, bilingual lexicons, morphological 
>parsers (where relevant), etc.  We did _not_ try to quantify more high-end 
>things, such as syntactic parsers or MT programs (although we recorded 
>them if we found them).  Everything was text-based: we did not look at 
>speech resources.
>
>A language was scored on each of these categories in a yes/no fashion. (It 
>would have been nice to say how much bilingual text there was, rather than 
>just more than or less than 100k words, but in many cases it's hard enough 
>to find the answer to the yes/no question.)  We then did a spreadsheet, 
>with green for 'yes' in a given category, and red for 'no'.  By assigning 
>numerical scores to various categories, we could easily sort the list of 
>languages.
>
>In the end, we only had time to do about 150 languages (intentionally 
>leaving out MSA, Mandarin Chinese, and most of the European languages, 
>even the minor(ity) ones).  When we showed the results to people, they 
>thought it was the best thing since sliced bread.  There are lots of ways 
>it could be improved if we did it again.  Unfortunately, such a survey 
>quickly becomes out of date, and we have not found funding to revisit it.
>
>I'll have to see if I can get a copy of Berment's thesis...
>
>    Mike Maxwell
>    CASL/ U of Maryland
>



More information about the Corpora mailing list