ELL: definitions for various terms re: minority languages
Jeff ALLEN
jeff at elda.fr
Mon Mar 15 18:55:21 UTC 1999
id DAA14062
To: owner-endangered-languages-l at carmen.murdoch.edu.au
Precedence: bulk
Reply-To: endangered-languages-l at carmen.murdoch.edu.au
*** EOOH ***
Return-Path: <owner-endangered-languages-l at carmen.murdoch.edu.au>
X-Authentication-Warning: carmen.murdoch.edu.au: majodomo set sender to
owner-endangered-languages-l at carmen.murdoch.edu.au using -f
X-Sender: jeff!elda.fr at 192.168.1.1
Date: Mon, 15 Mar 1999 19:55:21 +0100
To: endangered-languages-l at carmen.murdoch.edu.au
From: Jeff ALLEN <jeff at elda.fr>
Subject: ELL: definitions for various terms re: minority languages
In-Reply-To: <199903151450.WAA13674 at carmen.murdoch.edu.au>
Content-Type: text/plain; charset="iso-8859-1"
X-MIME-Autoconverted: from quoted-printable to 8bit by carmen.murdoch.edu.au
id DAA14062
Sender: owner-endangered-languages-l at carmen.murdoch.edu.au
Precedence: bulk
Reply-To: endangered-languages-l at carmen.murdoch.edu.au
At 09:57 15/03/99 -0500, Tom Tehan <tsc_msea at SIL.ORG> wrote:
> I thought I would submit my question to the whole e-list for
> discussion because I think many subscribers would have interesting
> thoughts to add. The question has to do with what is a minority
> language/group, a vernacular language, a threatened language/group, or
> an endangered language.
The following terms are those that are used quite a bit right now, with my
definitions. I acknowledge that my definitons may not suit the needs of
everyone on this list, but they do allow me to distinguish between the
different sociolinguistic factors at work in my research.
* high-density languages: languages for which there is abundant on-line
electronic resources/data. (For ex. English, French)
* low-density languages: languages for which there is very little, if not
any, on-line electronic resources/data.
(the density terminology tends to be used by the military)
* sparse-data languages: equivalent to definition of low-density, but it is
self-explanatory.
* less(er)-common(ly) taught languages: all languages other than French,
English, Spanish and German.
* lesser-used languages: takes into account the 1st and 2nd (3rd, 4th)
languages that are written and spoken in the world to indicate what
languages are used to what extent by how many people. The lesser-used are
those that are in fact less-used on the scale with comparison to those that
are most-used.
* low-diffusion languages: (I just came across the term the other day and
still haven't worked on a definition for it).
* vernacular languages: this tends to be used in sociolinguistic circles to
refer to traditional oral languages.
* high language: In Ferguson terms, the politically and culturally dominant
language in a diglossic situation.
* low language: In Ferguson terms, the politically and culturally
subordinate language in a diglossic situation.
* neglected languages: those languages that could be developed in some way,
but there are political, financial, economic, etc factors that are blocking
the avancement of such development work.
* endangered languages: languages that risk disappearing in the coming
decades.
* official language: well this is clear.
* national language: any language that is spoken by a significant part of
the population. This is a difficult term to quantify.
* language vs. dialect vs. patois
(It all depends on your framework on thinking. The definition of
language and dialect is completely different if you are a theoretical
linguist, a dialectologist, a socio- or ethnolinguist, or simply a
non-linguistics oriented person. I have learned to define my terminology
with respect to my audience. When I change audiences, I often have to
modify my definitions to adapt to their way of viewing the world).
And here are some excerpts from various recent papers (as you can tell, I
have a very good electronic collection of materials).
Taken from my recent paper:
ALLEN, Jeffrey. 1998a. Lexical variation in Haitian Creole and orthographic
issues for Machine Translation (MT) and Optical Character Recognition (OCR)
applications. Paper presented at the workshop on Embedded MT systems of the
Association for Machine Translation in the Americas (AMTA) conference,
Philadelphia, 28 October 1998.
Abstract:
In this paper, several sociolinguistic and psycholinguistic variables
pertinent to an adequate linguistic analysis of Haitian Creole are
presented in order to resolve issues in the development of natural language
processing (NLP) systems -- including machine translation, speech
recognition and optical character recognition -- for this language.
Consideration is taken with regard to the standard vs. non-standard status
of the language being analyzed for NLP system development. Such
extra-linguistic factors in 'vernacular' languages (e.g., Haitian Creole)
must be evaluated in order to sufficiently provide processing techniques in
systems for issues of linguistic variation that permeate the entire lexicon
of such languages.
Section 1. Standard vs. Vernacular Languages
The Croatian language is an example of both a) a .ow-density.language --
i.e., language with little or no accessible on-line data -- and b) a less
commonly taught language[1]. Haitian Creole (henceforth HC) is a + b, yet
it presents an additional set of issues because it is also c) a
.ernacular.language[2] that is in the process of standardization and
normalization. A vernacular language is defined as an "everyday spoken
language or languages of a community, as contrasted with a standard or
official language' -- generally, a 'Low' as opposed to a 'High' variety in
Ferguson's (1959) terms" (Tabouret-Keller et al. 1997, p. 6).
[1] References on Less Commonly Taught Languages (LCTLs):
http://www.councilnet.org/pages/CNet_Learn_FAQs.html#2
http://bioc09.uthscsa.edu/natnet/archive/ng/95/0021.html
[2] HC is an exception in the general definition of vernaculars because
the language was officialized and given equal status with French in 1986.
Despite this decree, literacy and education in HC in Haiti is very limited.
Only one higher education institution (Universit.Cara.e) offers classes
taught in HC. HC therefore continues to reflect the status of other
Creole
and vernacular languages.
----
Taken from:
LENZO, Kevin, HOGAN, Christopher, and Jeffrey ALLEN. 1998.
Rapid-Deployment Text-to-Speech in the DIPLOMAT System. Poster presented
at the International Conference on Spoken Language Processing. 30 November
- 4 December 1998, Sydney, Australia.
Section on data collection:
Collection -- The difficulty of text corpus collection varies by language.
For the case of Korean, the collection of texts is a straightforward
process, since information written in Korean is abundant and available
from
current resources on the Internet. For this case, texts were obtained
from
Internet broadcasting sources and the selected material did not pose any
significant difficulty for Korean speakers.
The task is significantly more difficult for languages that are not widely
taught, such as Haitian Creole (Allen and Hogan, 1998, Decrozant and Voss,
1998), because they are "low-density" languages, and there are few
available documents in electronic form. Finding electronic texts written
in
Creole required about five months of part-time research on the Internet,
in
addition to contacting dozens of non-governmental organizations and
literacy institutes worldwide that eventually provided electronic versions
of their texts.
It is possible to scan and correct texts from paper documents, but our
experience for Croatian and Haitian Creole was similar to that of
(Decrozant and Voss, 1998) in that current OCR software packages provide
poor recognition accuracy on less commonly taught languages for which
customized character recognition has not been specifically developed. Our
Creole corpus includes all types of text (e.g., novels, political
speeches,
language learning books, literacy primers, religious texts, etc.) that
have
been collected from all available resources whereas the Korean corpus
remains in domain with abundant amounts of text.
---
Taken from:
Decrozant, Lisa and Clare Voss. 1999. In ELRA Newsletter. Vol 4 issue 1;
January 1999, Paris: European Language Resources Association. pp. 10-11.
Introduction
As researchers tasked with evaluating machine translation (MT) tools
for military linguists in the field, we must often work with "less
commonly
taught languages" (LCTLs) for which little readily available on-line
text
exists. While many linguistic resources needed for MT evaluation are
commonly found in electronic form for the major languages of commerce
(English, French, Japanese, etc.), this is typically not the case for
LCTLs
[1] . In this brief note, we describe our recent effort transforming
hardcopy parallel, sentence-aligned text into on-line form.
[1]The LCTL we discuss here is thus a "low-density" or a
"low-diffusion"
language, in that few linguistic resources are available on-line.
______
My colleague Christopher Hogan chose the term "minority language" for
Haitian Creole in a recent paper:
Hogan, Christopher (1998) Embedded Spelling Correction for OCR with an
Application to Minority Languages. Paper presented at Workshop on
Embedded
MT Systems, in conjunction with the AMTA 98 conference. 28 October
1998,
Langhorne, Pennsylvania.
---
I tend to agree more with the terminology in the following two
articles:
SOMERS, Harold. Language Resources and Minority Languages. In
Language
Today. Number 5, 1998. Nottingham, UK: Language Publications
Ltd. pp. 20-24.
Paul Baker, Tony McEnery, Mark Sebba, Lou Burnard. Minority Language
Engineering. In ELRA Newsletter. Vol.3, Number 4, November 1998
where Minority languages tend to be multiple in a country where there
is a
different majority official language. In the UK, English is the
official,
majority language, but there are many, many minority languages (East
and
West Indian, Mid-East, etc).
____
This message will certainly stir up some discussion on the topic.
Best,
Jeff
=================================================
Jeff ALLEN - Directeur Technique
European Language Resources Association (ELRA) &
European Language Resources Distribution Agency (ELDA)
(Agence Europ.nne de Distribution des Ressources Linguistiques)
55, rue Brillat-Savarin
75013 Paris FRANCE
Tel: (+33) (0) 1.43.13.33.33 - Fax: (+33) (0) 1.43.13.33.30
mailto:jeff at elda.fr
http://www.icp.grenet.fr/ELRA/home.html
----
Endangered-Languages-L Forum: endangered-languages-l at carmen.murdoch.edu.au
Web pages http://carmen.murdoch.edu.au/lists/endangered-languages-l/
Subscribe/unsubscribe and other commands: majordomo at carmen.murdoch.edu.au
----
More information about the Endangered-languages-l
mailing list