[Corpora-List] Re: Minor(ity) Language

Chris Brew cbrew at acm.org
Thu Mar 9 15:14:47 UTC 2006


Whether a language gets worked on in corpus linguistics/NLP/computational
linguistics depends on at least the
following:

- the number of people who speak it
- the total income of people who speak it
- the extent to which computational and/or lexical
   resources exist for it
- the extent to which the people who hold the lexical
   resources make them conveniently available in ways that foster
   research. This also affects the nature of the research: people
   who want to run machine learning algorithms look for different
   kinds of access than those who want to see a small number of
   key examples presented in context
- the level of governmental support, enthusiasm and funding
- the extent to which researchers who choose to work on the
   language are loved and appreciated by the society.
- whether language is a significant political issue and how
- the potential scientific payoff of working on the languages
   in question.

Given the number of dimensions involved (I'm sure the above is not
exhaustive), I doubt if it makes any sense to draw hard decision
boundaries between minority/majority, endangered/safe/hegemonic or
indeed any other fixed set of terms. So when we write about our work,
we'll just have to get used to including brief summaries of the
relevant aspects of the language situation. Self evidently it is
somehow different to study the Arabic of Dearborn, Michigan or the
Spanish of emigre Puerto Ricans and Mexicans in Lorain County, Ohio
than to study them in San Juan, Tijuana or Lebanon, but until we get
to specifics we won't want to pick terms that describe the languages
in a hard and fast way.

Chris





On Thu, Mar 09, 2006 at 09:36:06AM -0500, Ed Kenschaft wrote:
> On 3/9/06, Nicholas Sanders <nick at semiotek.org> wrote:
>> But the Polish and Icelandic examples don't fit the model,
>> because they have no official status in the countries cited.
> 
> Correct me if I'm wrong, but I don't think *any* language has official
> status in the United States.  Does that mean we don't have any
> minority (or majority) languages?
> 
> Still, you make a good point.  A language that is clearly not a
> minority language worldwide (e.g. Hindi) might well be a minority
> language in a specific context.  Thus complicating the terminology
> still further.
> 
> On 3/8/06, Mike Maxwell <maxwell at ldc.upenn.edu> wrote:
>> On this side of the Atlantic, the term seems to be "low density
>> languages" ...
> 
> In my circle, the most common term might be "scarce-resource
> languages".  (We got tired of explaining to people that the meaning of
> "low density" had nothing to do with density.)  The term gets at the
> idea that a language might be spoken by a lot of people, but still not
> have a lot of computational resources available (e.g. Hindi, Urdu).
> 
> Cheers.
> 
> --
> Ed Kenschaft
> ekenschaft at gmail.com
> www.umiacs.umd.edu/users/kensch/
> 



More information about the Corpora mailing list