26.2650, Review: Discipline of Ling; Discourse; Pragmati =?utf-8?Q?cs; _Text/Corpus_Ling:_Ruhi,_W=C3=B6rner,_Schmidt,_Haugh_(2014)?=

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Wed May 27 17:55:30 UTC 2015


LINGUIST List: Vol-26-2650. Wed May 27 2015. ISSN: 1069 - 4875.

Subject: 26.2650, Review: Discipline of Ling; Discourse; Pragmatics; Text/Corpus Ling: Ruhi, Wörner, Schmidt, Haugh (2014)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
              http://funddrive.linguistlist.org/donate/

Editor for this issue: Sara  Couture <sara at linguistlist.org>
================================================================


Date: Wed, 27 May 2015 13:54:27
From: Yolanda Rivera Castillo [riveray at gmail.com]
Subject: Best Practices for Spoken Corpora in Linguistic Research

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36006197

Book announced at http://linguistlist.org/issues/25/25-3308.html

EDITOR: Şükriye  Ruhi
EDITOR: Michael  Haugh
EDITOR: Thomas  Schmidt
EDITOR: Kai  Wörner
TITLE: Best Practices for Spoken Corpora in Linguistic Research
PUBLISHER: Cambridge Scholars Publishing
YEAR: 2014

REVIEWER: Yolanda Rivera Castillo, University of Puerto Rico-Rio Piedras Campus

Review's Editor: Helen Aristar-Dry

SUMMARY

The book entitled “Best Practices for Spoken Corpora in Linguistics Research”
describes projects of spoken data documentation, as well as current standards
in the field. It includes projects documenting diatopic varieties of German
and English (such as British, American, and Australian English), French,
Turkish, Czech, Russian, Portuguese, Catalan, Swedish, Danish, Norwegian,
Faroese, and Basque. It represents a variety of databases, from those encoding
formal and academic registers to informal varieties of some languages
(Turkish). The goal of the book is to provide standards on data documentation,
curation, processing, annotation, and preservation, such that researchers can
retrieve information in formats that are standardized, easy to use, and
accessible. It also aims to describe methods of corpus construction and
sharing. It consists of two major sections and fourteen (14) chapters, and
provides an introductory chapter (Chapter 1) that summarizes the content and
general goals of the volume. Chapter 1 (şükriye Ruhi, Michael Haugh, Thomas
Schmidt, and Kai Wörner) provides a justification and a brief description of
the following monographs. 

Since the book includes mostly projects involving spoken corpora, it might be
of interest to linguists studying discourse analysis, language variation,
conversation analysis, phonology, and pragmatics. There is great emphasis on
the creation of metadata, data conservation and segmentation, all key issues
in these fields. Only one chapter (Chapter 7) describes a project on speech
data rather than spoken data (see definition below). 

The first section of the book — entitled “Case Studies on Corpora Design,
Annotation and Analysis” — describes five projects. Chapter 2 (Adriana
Slavcheva and Cordula Maißner) summarizes the main characteristics of the
GeWiss Corpus, a corpus of German academic spoken data. The following chapter
(3) (şükriye Ruhi and E. Eda Işik Taş) discusses the components of the STC
(Spoken Turkish Corpus) and STCDC (Spoken Turkish Cypriot Dialect Corpus)
corpora encompassing informal spoken varieties of Turkish. These corpora
include information about allophonic variation. Chapter 4 (Theodosia-Soula
Pavlidou, Charikleia Kapellidi, and Eleni Karafoti) describes the Corpus of
Spoken Greek (CSG), which comprises informal conversations (including phone
exchanges) and their orthographic transcriptions. Chapter 5 (Ines Rehbein,
Sören Schalowski and Heike Wiese) explains that KidKO consists on
syntactically annotated data on the Kiezdeutsch contact variety (German) used
by teenagers in multiethnic communities. In Chapter 6 (Seongsook Choi and
Keith Richards), the last one in this section, the authors explain the
features of the MICASE database, which encodes metadata on conversations
between English speakers. 

The second section of the book — entitled “Discussions on Best Practices in
Spoken Corpora” — includes eight chapters. In Chapter 7 (Pavel Skrelin and
Daniil Kocharov), the project managers produced a speech database for Russian.
Lucie Besešová, Martina Waclawičová, and Michal Křen describe in Chapter 8 a
database on spoken Czech. Chapter 9 (Oliver Ehmer and Camille Martinez)
provides information on a database of naturally occurring spoken French in
twenty-four communicative areas over the world. Chapter 10 (Peter M. Fisher
and Andreas Witt) distinguishes between “data providers”, “data compilers”,
“data curators”, and “data consumers” in data preservation for the Dutenbank
für Gesprochenes Deutsch, a project that includes  preservation of data from
previous projects on the German language. In Chapter 11 (Sebastian Drude, Paul
Trilsbeek, Han Sloetjes, and Dan Broeder), issues of privacy, and the ethical
treatment of data providers are discussed for the DOBES corpus. Chapter 12
(Hanna Hedeland, Timm Lehmberg, Thomas Schmidt, and Kai Wörmer) describes a
multilingual corpus, encompassing data from 1999-2011. The Australian National
Corpus includes a variety of data with different types of annotation (Chapter
13). The authors (Simon Musgrave, Andrea C. Schalley, and Michael Haugh)
devised an “interlingua ontology” to represent “the knowledge embodied by all
the annotations of all the collected data.” (p. 226). The last chapter
sketches the history of annotation conventions, highlighting the conventions
followed by the Hamburg Centre for Language Corpora (HZSK) (Thomas Schmidt). 

In summary, the book provides an overview of numerous projects of spoken
corpora, and discusses the main issues related to the standardization,
creation, annotation, copyright, and conservation of these data. It provides
clear explanations for the non-specialist, and discusses key issues of
interest for the specialist as well. It is a good introduction for those
pursuing projects on spoken data documentation. The book aims at reporting on
developing standards and common practices in the field.

EVALUATION

The projects’ description includes information on the status of corpus
creation activities, web addresses for corpora, tools and protocols for data
annotation, data curation, and data dissemination. The initial and final
chapters of the book sum up the main issues that constitute the backbone of
the collection. These projects center on annotation standards and tools, a key
issue since the publication of Bird and Lieberman’s (2001) paper on the
annotation graph framework. Many of these projects also select the same
annotation tools (such as EXMARaLDA), as well as the same tools for data
encoding, such as XML markup.

The introductory chapter (1) states a distinction between “speech” and “spoken
corpora”, indicating that the former are intended as tools for phonological
analysis, while the later have a different range of uses and aim at
representing “language as used by its speakers in naturally occurring
communicative contexts” (p.3). However, both fields share many of the same
goals and standards. Indeed, some of contributors to this volume also
participate in a collection of speech corpora published in the same year
(Durand, Gut and Kristoffersen 2014).

Chapter 1 (p. 5-6) also makes a distinction between recordings and data,
parallel to the difference between primary and non-primary data described by
Himmelman (2012, p. 188). The distinction between raw, primary and structural
data is key to language documentation. In fact, as stated by Himmelman,
linguistic analysis depends on these distinctions; and despite many
misconceptions of the role of documentary linguistics, the creation of
different types of data in this field is key to linguistic analysis: ''[…]
documentary linguistics has the important task of making descriptive
generalizations replicable and accountable, and in this sense it provides the
empirical basis for many branches of linguistics'' (Himmelman, 2012, p. 187).

The first part of the book — “Case Studies on Corpora Design, Annotation and
Analysis” — deals with data processing and selection. The description of the
projects shows that the goals of individual projects result in vast
differences among these. For example, the communicative situations documented
vary greatly, ranging from formal academic presentations (Chapter 2) and radio
and telephone communications, to spontaneous conversations recorded without
the interviewers’ influence (Chapter 9, an ecological approach). One important
question not addressed by the editors is which of these approaches is more
effective in representing “natural” exchanges between speakers.

Chapter 3 describes an important issue related to data processing: how to
handle segmentation of spoken corpora. In some projects, segmentation is
applied to equal chunks based on time units. This, obviously, produces data
that might not be suitable for discourse analysis, since discourse units are
ignored and are replaced by time units. 

An additional issue is the amount of data collected by these projects.
Determining how much data are required, how much evidence is necessary to
analyze certain aspects of language use or the linguistic system is not
addressed by some of the papers. Most projects are data driven and collect a
large number of words. It is not always clear why a specific amount of data is
necessary. An explanation of these issues would help the reader understand and
make decisions during data collection. Additionally, except for Chapter 5,
there are few references to the role of corpora creation for linguistic
analysis and linguistic theory.

This section of the book also examines issues related to data management, such
as automated transcription and the choice between orthographic and phonetic
transcription. In fact, orthographic transcription dominates in project
annotation.

The second section of the book places individual corpus creation within a
larger context of ethical issues regarding data management, and the
availability of data for future generations. For example, Chapter 11 discusses
ethical issues in data use and accessibility, particularly with the purpose of
shielding data providers. It describes a project that allows four levels of
data accessibility to protect speakers and the copyright of collected data (p.
201). 

Additionally, one of the most important issues discussed in the second section
of the book is data preservation. The role of data curators and the
differences between data migration and emulation are fundamental for those
interested in long-term archiving of corpora. Chapter 10 deals with this issue
and with the issue of software availability for corpora creation. Table 10-1
(pp. 166-167) is remarkably useful as a summary of descriptions of these
tools. As indicated, the book also places great emphasis on data “curation”
and migration from one format to another to preserve these data for future
generations. This issue is critical, particularly in the case of endangered
languages. Bird and Simons (2003) stress this point in corpora creation:

''Funded documentation projects are usually tied to software versions, file
formats, and system configurations having a lifespan of three to five years.
Once this infrastructure is no longer tended, the language documentation is
quickly mired in obsolete technology. The issue is acute for endangered
languages. In the very generation when the rate of language death is at its
peak, we have chosen to use moribund technologies, and to create endangered
data'' (p. 557).

On the other hand, related to the type of data collected, one wonders whether
the book should place more emphasis on the documentation of lesser-known
languages. The book describes documentation efforts in a variety of languages.
>From these, Faroese is the language with the smallest number of speakers.
Documenting lesser-known languages can contribute to understanding differences
and similarities in conversation exchanges across cultures. The description of
projects on poorly studied languages can provide linguists interested in
documenting these with the knowledge required to start a project.
Documentation and extensive study of poorly studied and endangered languages
are very important for numerous reasons (See Krauss 1992).

Another concern in this section is the incorporation of data from different
fieldwork projects into a single corpus. Chapter 13 describes a project that
intends to create an interlingua ontology to represent “the knowledge embodied
in all the annotations of all collected data.” They built this ontology in an
inductive way since they are collecting materials from many different
projects, and these followed different methodologies in data gathering.

A few issues are not discussed in detail in the book, such as the role of
native speakers in data collection, segmentation, and annotation. In fact,
native speakers have participated in the creation of many projects described
in the book. 

Finally, this book is an important contribution to the documentation of
ongoing projects on corpora creation. The authors provide detailed
descriptions that offer the reader enough information on current standards,
project content, and the rationale behind the decision making in corpus
linguistics. This book is particularly useful in the creation of large
databases for a diverse body of languages.

REFERENCES

Bird, Steven and Gary Simons. 2003. Seven Dimensions of Portability for
Language Documentation and Description. Language 79(3). 557-82.

Durand, Jacques, Ulrike Gut, and Gjert Kristoffersen. 2014. The Oxford
Handbook of Corpus Phonology. Oxford: Oxford University Press.

Krauss, Michael.  1992.  The World Languages in Crisis. Language, 68(1). 4-10.

Himmelman, Nikolaus P. 2012. Linguistic data types and the interface between
language documentation and description. Language documentation and
conservation 26. 187-207.


ABOUT THE REVIEWER

Yolanda Rivera-Castillo is currently a professor at the University
of Puerto Rico, Río Piedras campus. She has taught at different institutions
in the US, and has chaired linguistic programs. Her research interests include
the study of the Papiamentu prosodic system, as well as nasalization and vowel
harmony in Papiamentu and other Atlantic Creoles. She is currently working on
a project on language documentation and has published papers on Creole
phonology as well as on the Phonology-Syntax interface.



----------------------------------------------------------
LINGUIST List: Vol-26-2650	
----------------------------------------------------------







More information about the LINGUIST mailing list