31.1411, Review: Anthropological Linguistics; Language Documentation; Typology: Jones (2019)
The LINGUIST List
linguist at listserv.linguistlist.org
Tue Apr 21 01:43:26 UTC 2020
LINGUIST List: Vol-31-1411. Mon Apr 20 2020. ISSN: 1069 - 4875.
Subject: 31.1411, Review: Anthropological Linguistics; Language Documentation; Typology: Jones (2019)
Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Please support the LL editors and operation with a donation at:
https://funddrive.linguistlist.org/donate/
Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================
Date: Mon, 20 Apr 2020 21:42:56
From: Michael Maxwell [mmaxwell at umd.edu]
Subject: Endangered Languages and New Technologies
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36585077
Book announced at http://linguistlist.org/issues/30/30-3430.html
EDITOR: Mari C. Jones
TITLE: Endangered Languages and New Technologies
PUBLISHER: Cambridge University Press
YEAR: 2019
REVIEWER: Michael B. Maxwell, University of Maryland
SUMMARY
Twenty years ago, I reviewed a publication coming out of the First
International Conference on Language Resources and Evaluation (LREC). While
that conference ostensibly targeted smaller languages, I remarked in my review
that their notion of ''smaller'' appeared to be restricted to the largest
hundred or so languages of the world, minus English, Modern Standard Arabic,
and Mandarin Chinese.
The situation is improved today, with entire conferences looking at ways to
study and preserve endangered languages, in many cases relying on
computational analysis. Nevertheless, endangered languages are hardly a topic
of great interest in the field of computational linguistics, with a few
exceptions. This book is one of those exceptions. Its preface, by Mari C.
Jones, states that it aims for ''a practicable synthesis of old and new
methodologies'', where ''old'' is presumably pencil-and-paper techniques, and
''new'' is computationally informed (if not necessarily driven) technologies.
The synthesis is viewed from two directions: new technologies for description
and analysis, and how new technologies are being used for language
revitalization (but see my comment at the end of this review).
Nicholas Ostler provides an ''Introduction: Endangered languages in the new
multi-lingual order per genus et differentiam.'' The claim here is that ''the
world will lose its motivation to maintain English as a convenient lingua
franca just as automatic language conversion becomes...realistic.'' Ostler
expresses the hope that automatic language conversion (i.e. Machine
Translation, MT) will extend to smaller languages--but in the end, the hope
that this will breathe new life into those languages has been dashed by their
lack of computer-readable data, particularly parallel text. Ostler briefly
describes approaches to solving this problem, such as finding parallel corpora
between more documented and less documented related languages, so that their
similarities and differences (the ''genus et differantiam'' of the title)
could in theory more easily provide a way to discover the properties of the
less documented language, and thereby build MT systems (presumably by pivoting
on the more documented language, although Ostler does not go into detail
here). There has indeed been some research on this approach (which I will
come back to at the end of this review); but even this is not enough, if only
because not all small languages have better documented relatives. Bilingual
dictionaries could provide another help for MT; but it has been parallel text,
not bilingual dictionaries, that jump-started machine translation in the last
couple decades. In sum, while I find it quite possible that English will be
displaced as a lingua franca in some future, and that in that future there
will be MT systems for many more languages, in my view that will not come soon
enough to help languages that are endangered today.
Aimée Lahaussois writes about ''The Kiranti comparable corpus: A prototype
corpus for the comparison of Kiranti languages and mythology.'' This is a
study of how interlinear text in three closely related languages of Nepal can
be aligned across documents in the different languages, and then used for
comparative study of the languages. At present, there is only a single
narrative transcribed from a single speaker in each of three languages, but it
provides a proof of concept.
Sjef Barbiers' ''European Dialect Syntax: Towards an infrastructure for
documentation and research of endangered dialects'' argues that since the
boundary between language and dialect is not a principled one, it is
reasonable to make an effort to document endangered dialects. Additionally,
since the variation between dialects is (by definition) smaller than those
between distinct languages, those variations may shed light on individual
parameters of variation, such as syntactic parameters. The age old issue of
elicitation vs. pure corpora approaches however arises, since some syntactic
variation (the examples are from Dutch dialects) is quite rare, making
directed elicitation necessary in order to obtain sufficient examples for
study.
Hugh Patterson writes about ''Keyboard Layouts: Lessons from the Me'paaa and
Sochiapam Chinantec designs.'' Both these Mexican languages are written in
Latin scripts, but with some diacritics or other combining characters that are
not found in Spanish. While smart phone input is mentioned, most of the
attention is given to physical keyboards in Windows and Mac systems (Linux is
mentioned in passing). One of the problems which Patterson discusses is that
of Unicode normalization, although he does not use that term (which is well
known in the literature about Unicode).
Matt Coler and Petr Homola describe a rule-based machine translation system
they are developing for translating from Aymara into Spanish and English. The
MT system relies on a number of theories, including a version of Lexical
Functional Grammar (LFG) using dependency parsing, and augmented with
additional structures. But the article is too short to provide an
understanding of how all the structures are derived (automatically?) from the
input sentences, and how the MT system pieces together target language output
while referring to these multiple structures. For example, while Aymara is an
agglutinating language, and the MT system must deal with considerable
derivational as well as inflectional morphology, the morphological parser is
mentioned in a single short paragraph, which does not explain how its rules
are written, what technology is used for parsing (a finite state transducer?),
how much ambiguity there is at the morphological output stage, or how the
syntactic parser deals with this ambiguity.
The article closes by claiming a 12.1% Word Error Rate (WER) from Aymara to
English. While WER is sometimes used to evaluate MT systems, BLEU score is
more commonly used for this purpose (whereas WER is used for speech
recognition; see Cer, Manning and Jurafsky 2010 for a discussion of MT
metrics). It is also not clear how the 12.1% WER was measured (on how large a
corpus, for example), much less what it means in this case (for example, the
size of the vocabulary could have an effect on WER).
Dorothee Beermann describes a data management and analysis system for
endangered languages (although the languages used as examples are not
endangered at present). The system, called TypeCraft (TC), emphasizes the
production and display of interlinear glossed text (IGT) from text data (audio
and video data is apparently planned for the future). TC differs from such
tools as SIL's Fieldworks Language Explorer in that annotation and display can
both be done over the web. Unfortunately, like the preceding article, many
details are unclear. For example, while collaborative annotation is listed as
supported, and eleven annotators are said to have worked on IGT using this
system, it is unclear whether simultaneous annotation by different annotators
is possible, or whether multiple annotators must coordinate non-overlapping
work times.
This project uses an LFG parser and an HPSG (Head Driven Phrase Structure
grammar) parser, such that ''the subsequent linguistic analysis becomes linked
to the material on which it is built.'' But it is unclear how this linking
works: can these parsers be integrated into TC? Or is the data in TC exported
in some form usable by the parser, then the parser is run on that data, and
the result is imported back into TC? Also unclear is whether the annotation
of IGT results in a dictionary of morphemes, and whether the system offers
previous analyses of a particular word when that same wordform is encountered
in later annotation, which would speed up annotation and encourage
consistency.
Russell Hugo presents some ''fundamental questions for endangered language
learning technology projects'', the answers to which should drive projects to
produce pedagogical materials for the teaching of endangered languages in
revitalization projects. He proposes the use of a Learning Management System
(LMS) such as Moodle (https://moodle.org/). While such a system generally
requires internet accessibility, and while such accessibility is increasingly
available, it precludes the use of the system in some parts of the world where
endangered languages are found. Hugo points out that the overwhelming
advantage of developing a language learning curriculum in such a pre-existing
tool is that it removes the need to re-invent the wheel by providing the
software framework. Moreover, software changes; Moodle will not always be the
best tool for teaching languages. But by using a tool like Moodle, which
provides for the export of lessons, one can future-proof the lesson content.
Bernard Bel and Médéric Gasquet-Cyrus argue in favor of not simply preserving
data about endangered languages, but curating it, by which they mean adding at
least enough metadata to make the resources findable, defining usage rights,
and possibly labeling the data (or chunks of data) with location identifiers
and linguistic concepts. (I would have thought that everyone did this, but
apparently not.) They illustrate using their own efforts to document
endangered varieties (dialects) of Occitan, primarily with audio and video,
but also with information about ''informants'' (their term), photographs, and
data about the audio and video collection methods. (The source they point to
for linguistic labels, www.isocat.org, is unfortunately defunct, a problem
that recurs distressingly frequently.) They discuss in some depth the legal
and ethical constraints on the collected data, e.g. protecting ''problematic''
parts of sound files by replacing those parts with humming so as to preserve
the prosody. (Again, details would be helpful: must the prosodically-based
humming be done by humans, or is it possible to generate this automatically?)
Anthony Scott Warren and Geraint Jennings document efforts to preserve
Jèrriais, a language spoken on the island of Jersey between France and
England. The language has been in print for over two centuries, but English
has been taking over domains of use for a hundred years. The authors then
turn to developments of the last few decades which have been used to promote
the use of Jèrriais: initially, internet web pages, and more recently, smart
phones, twitter, Facebook, Youtube, and so forth. These tools have provided
both ways to promote the use of Jèrriais, and the sharing of ideas with groups
trying to maintain other endangered languages.
Tjeerd de Graaf, Cor van der Meer, and Lysbeth Jongbloed-Faber document
efforts to sustain West Frisian (Netherlands). This language has hundreds of
thousands of native speakers, and many more second language speakers. The
language enjoys status as the official second language of the Netherlands, and
has played an official role in primary education for over a century.
Nevertheless, UNESCO considers it to be ''vulnerable''. The article briefly
describes the many things that have been done over the past 20 years to
maintain the language, ranging from TV shows to Twitter accounts. All these
provide a mine of ideas that other languages could try, although I suspect
most truly endangered languages could only wish for the budgets and support
available for West Frisian (and likewise Jèrriais).
Cecilia Odé writes about a project for Tundra Yukaghir (a language of
Siberia). In a predicament much more similar to most other endangered
languages than that of West Frisian, the only fluent speakers of Yukaghir are
elderly, while teachers--although motivated--are not fluent; and support from
the government is less than desired. The project developed an academic
grammar, recordings of the spoken language and of songs, and courseware for
teachers. Recordings were made in both audio and video form, and the
discussion of this work may provide useful ideas to those working with other
endangered languages.
Unlike virtually all other sign languages, American Indian Sign Language
(AISL) served as a language of communication among non-deaf speakers of
diverse, even unrelated, languages. But with the dominance of English, AISL
is disappearing. Jeffrey Davis describes efforts to document and describe
this unique language, combining the digitization of historically collected
materials with ''born-digital'' documentation.
EVALUATION
This book is not intended as a handbook of new technologies for endangered
languages. There are no descriptions here of methods of elicitation suited to
digital methods; no papers about corpus collection or lexicography or
interlinearization or grammatical description. Rather its purpose is to
describe a set of new(ish) ideas in the use of technology to document and
describe endangered languages. Some of these new directions may be fruitful,
while others may prove less so.
Some new directions are not covered at all. For example, there is virtually
no discussion here of the use of machine learning for language documentation.
Examples of ways in which machine learning might be used include Automatic
Speech Recognition (ASR), dictionary induction, parser induction, and the
development of ways to communicate with speakers of endangered languages in
emergency situations--as in work coming out of the US DARPA Low Resource
Languages for Emergent Incidents (LORELEI) project. To be sure, all of these
are experimental technologies: they are anything but mature, and many current
machine learning techniques require larger quantities of data (particularly,
annotated data) than there will ever be for most endangered languages. That
said, there is research into ways to reduce that data requirement, e.g. by
using cross-lingual alignment (as briefly mentioned by Nicholas Ostler in the
introduction to this book), as well as research into ways to collect more
transcribed data (see e.g. Bird 2010 for one such method).
While the chapters are organized into two sections, namely Creating New
Technologies for Endangered Languages, and Applying New Technologies, it is
not clear that the chapters actually fell neatly into this dichotomy. Hugo's
chapter, for example, is in the section on creating new technologies, but it
is actually a call for using an existing technology, Moodle.
The endangered languages used as case studies are oddly skewed: while most
endangered languages are to be found outside of Europe (as a glance at the map
in http://www.endangeredlanguages.com/ will show), four of the eleven
languages discussed in this book are found in Europe, and two more in the
United States or Canada. This is doubtless due to the availability of
speakers of those languages in close proximity to linguists, and perhaps also
to the availability of technology in these regions (something that Bel and
Gasquet-Cyrus allude to).
In sum, if you come to this book expecting a handbook, you will be
disappointed. But if you come looking for new ideas, you may find useful
ideas. As for the lack of studies here putting machine learning to work, it
is the nature of books like this to be superseded--and I'm sure any advocate
of documentation and description of endangered languages will join me in
hoping that day will come soon.
REFERENCES
Bird, Steven. (2010). A Scalable Method for Preserving Oral Literature from
Small Languages. 6102. 5-14. 10.1007/978-3-642-13654-2_2.
Cer, Daniel; Christopher D. Manning, and Daniel Jurafsky. 2010. ''The Best
Lexical Metric for Phrase-Based Statistical MT System Optimization.'' Pp.
555-563 in ACL 2010.
https://nlp.stanford.edu/pubs/best_lexical_metric_statmt.pdf.
ABOUT THE REVIEWER
Michael Maxwell is a research scientist at the University of Maryland, with
experience in language documentation and description, and computational
linguistic methods including computational lexicography and morphological
parsing. In the past, he developed an appreciation for minority and endangered
languages working in Ecuador and Colombia with SIL International, and for
other low density languages while working with the Linguistic Data Consortium
at the University of Pennsylvania.
------------------------------------------------------------------------------
*************************** LINGUIST List Support ***************************
The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
to find out how to donate and check how your university, country or discipline
ranks in the fund drive challenges. Or go directly to the donation site:
https://iufoundation.fundly.com/the-linguist-list-2019
Let's make this a short fund drive!
Please feel free to share the link to our campaign:
https://funddrive.linguistlist.org/donate/
----------------------------------------------------------
LINGUIST List: Vol-31-1411
----------------------------------------------------------
More information about the LINGUIST
mailing list