ELL: definitions for various terms re: minority languages

Mon Mar 15 18:55:21 UTC 1999

Date: Mon, 15 Mar 1999 19:55:21 +0100
From: Jeff ALLEN <jeff at elda.fr>
Subject: ELL: definitions for various terms re: minority languages
At 09:57 15/03/99 -0500, Tom Tehan <tsc_msea at SIL.ORG> wrote:

>     I thought I would submit my question to the whole e-list for
>     discussion because I think many subscribers would have interesting
>     thoughts to add. The question has to do with what is a minority
>     language/group, a vernacular language, a threatened language/group, or
>     an endangered language.

The following terms are those that are used quite a bit right now, with my
definitions. I acknowledge that my definitons may not suit the needs of
everyone on this list, but they do allow me to distinguish between the
different sociolinguistic factors at work in  my research.

* high-density languages: languages for which there is abundant on-line
electronic resources/data. (For ex. English, French)

* low-density languages: languages for which there is very little, if not
any, on-line electronic resources/data.

(the density terminology tends to be used by the military)

* sparse-data languages: equivalent to definition of low-density, but it is

* less(er)-common(ly) taught languages: all languages other than French,
English, Spanish and German.

* lesser-used languages: takes into account the 1st and 2nd (3rd, 4th)
languages that are written and spoken in the world to indicate what
languages are used to what extent by how many people. The lesser-used are
those that are in fact less-used on the scale with comparison to those that
are most-used.

* low-diffusion languages:  (I just came across the term the other day and
still haven't worked on a definition for it).

* vernacular languages: this tends to be used in sociolinguistic circles to
refer to traditional oral languages.

* high language: In Ferguson terms, the politically and culturally dominant
language in a diglossic situation.

* low language: In Ferguson terms, the politically and culturally
subordinate language in a diglossic situation.

* neglected languages: those languages that could be developed in some way,
but there are political, financial, economic, etc factors that are blocking
the avancement of such development work.

* endangered languages: languages that risk disappearing in the coming

* official language: well this is clear.

* national language: any language that is spoken by a significant part of
the population. This is a difficult term to quantify.

* language vs. dialect vs. patois
   (It all depends on your framework on thinking. The definition of
   language and dialect is completely different if you are a theoretical
   linguist, a dialectologist, a socio- or ethnolinguist, or simply a
   non-linguistics oriented person.  I have learned to define my terminology
   with respect to my audience.  When I change audiences, I often have to
   modify my definitions to adapt to their way of viewing the world).

   And here are some excerpts from various recent papers (as you can tell, I
   have a very good electronic collection of materials).

   Taken from my recent paper:
   ALLEN, Jeffrey. 1998a. Lexical variation in Haitian Creole and orthographic
   issues for Machine Translation (MT) and Optical Character Recognition (OCR)
   applications. Paper presented at the workshop on Embedded MT systems of the
   Association for Machine Translation in the Americas (AMTA) conference,
   Philadelphia, 28 October 1998.


   In this paper, several sociolinguistic and psycholinguistic variables
   pertinent to an adequate linguistic analysis of Haitian Creole are
   presented in order to resolve issues in the development of natural language
   processing (NLP) systems -- including machine translation, speech
   recognition and optical character recognition -- for this language.
   Consideration is taken with regard to the standard vs. non-standard status
   of the language being analyzed for NLP system development.  Such
   extra-linguistic factors in 'vernacular' languages (e.g., Haitian Creole)
   must be evaluated in order to sufficiently provide processing techniques in
   systems for issues of linguistic variation that permeate the entire lexicon
   of such languages.

   Section 1. Standard vs. Vernacular Languages

   The Croatian language is an example of both a) a .ow-density.language --
   i.e., language with little or no accessible on-line data -- and b) a less
   commonly taught language[1]. Haitian Creole (henceforth HC) is a + b, yet
   it presents an additional set of issues because it is also c) a
   .ernacular.language[2] that is in the process of standardization and
   normalization. A vernacular language is defined as an "everyday spoken
   language or languages of a community, as contrasted with a standard or
   official language' -- generally, a 'Low' as opposed to a 'High' variety in
   Ferguson's (1959) terms" (Tabouret-Keller et al. 1997, p. 6).

   [1]  References on Less Commonly Taught Languages (LCTLs):

   [2]  HC is an exception in the general definition of vernaculars because
   the language was officialized and given equal status with French in 1986.
   Despite this decree, literacy and education in HC in Haiti is very limited.
    Only one higher education institution (Universit.Cara.e) offers classes
    taught in HC.  HC therefore continues to reflect the status of other
   and vernacular languages.

   Taken from:
   LENZO, Kevin, HOGAN, Christopher, and Jeffrey ALLEN. 1998.
   Rapid-Deployment Text-to-Speech in the DIPLOMAT System.  Poster presented
   at the International Conference on Spoken Language Processing.  30 November
   - 4 December 1998, Sydney, Australia.

   Section on data collection:

   Collection --  The difficulty of text corpus collection varies by language.
    For the case of Korean, the collection of texts is a straightforward
    process, since information written in Korean is abundant and available
    current resources on the Internet.  For this case, texts were obtained
    Internet broadcasting sources and the selected material did not pose any
    significant difficulty for Korean speakers.
    The task is significantly more difficult for languages that are not widely
    taught, such as Haitian Creole (Allen and Hogan, 1998, Decrozant and Voss,
    1998), because they are "low-density" languages, and there are few
    available documents in electronic form. Finding electronic texts written
    Creole required about five months of part-time research on the Internet,
    addition to contacting dozens of non-governmental organizations and
    literacy institutes worldwide that eventually provided electronic versions
    of their texts.
    It is possible to scan and correct texts from paper documents, but our
    experience for Croatian and Haitian Creole was similar to that of
    (Decrozant and Voss, 1998) in that current OCR software packages provide
    poor recognition accuracy on less commonly taught languages for which
    customized character recognition has not been specifically developed. Our
    Creole corpus includes all types of text (e.g., novels, political
    language learning books, literacy primers, religious texts, etc.) that
    been collected from all available resources whereas the Korean corpus
    remains in domain with abundant amounts of text.

    Taken from:
    Decrozant, Lisa and Clare Voss. 1999.  In ELRA Newsletter. Vol 4 issue 1;
    January 1999, Paris: European Language Resources Association. pp. 10-11.

        As researchers tasked with evaluating machine translation (MT) tools
	for military linguists in the field, we must often work with "less
	taught languages" (LCTLs) for which little readily available on-line
	exists.  While many linguistic resources needed for MT evaluation are
	commonly found in electronic form for the major languages of commerce
	(English, French, Japanese, etc.), this is typically not the case for
	[1] . In this brief note, we describe our recent effort transforming
	hardcopy parallel, sentence-aligned text into on-line form.

	[1]The LCTL we discuss here is thus a "low-density" or a
	language, in that few linguistic resources are available on-line.


	My colleague Christopher Hogan chose the term "minority language" for
	Haitian Creole in a recent paper:

	Hogan, Christopher (1998) Embedded Spelling Correction for OCR with an
	Application to Minority Languages. Paper presented at Workshop on
	MT Systems, in conjunction with the AMTA 98 conference. 28 October
	Langhorne, Pennsylvania.


	I tend to agree more with the terminology in the following two

	SOMERS, Harold.  Language Resources and Minority Languages. In
	Today. Number 5, 1998.   Nottingham, UK: Language Publications
	Ltd. pp. 20-24.

	Paul Baker, Tony McEnery, Mark Sebba, Lou Burnard. Minority Language
	Engineering. In ELRA Newsletter.  Vol.3, Number 4, November 1998

	where Minority languages tend to be multiple in a country where there
	is a
	different majority official language.  In the UK, English is the
	majority language, but there are many, many minority languages (East
	West Indian, Mid-East, etc).


	This message will certainly stir up some discussion on the topic.



