25.3396, Review: Computational Linguistics; Historical Linguistics: McGillivray (2013)

Thu Aug 28 18:36:28 UTC 2014

LINGUIST List: Vol-25-3396. Thu Aug 28 2014. ISSN: 1069 - 4875.

Subject: 25.3396, Review: Computational Linguistics; Historical Linguistics: McGillivray (2013)

Moderators: Damir Cavar, Indiana U <damir at linguistlist.org>
            Malgorzata E. Cavar, Indiana U <gosia at linguistlist.org>

Reviews: reviews at linguistlist.org
Anthony Aristar <aristar at linguistlist.org>
Helen Aristar-Dry <hdry at linguistlist.org>
Mateja Schuck, U of Wisconsin Madison

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Malgorzata Cavar <gosia at linguistlist.org>
================================================================  

Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/

Date: Thu, 28 Aug 2014 14:35:50
From: Onna Nelson [onna.nelson at gmail.com]
Subject: Methods in Latin Computational Linguistics

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=25-3396.html&submissionid=30640438&topicid=9&msgnumber=1

Discuss this message: 
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=30640438

Book announced at http://linguistlist.org/issues/24/24-4956.html

AUTHOR: Barbara  McGillivray
TITLE: Methods in Latin Computational Linguistics
SERIES TITLE: Brill's Studies in Historical Linguistics
PUBLISHER: Brill
YEAR: 2013

REVIEWER: Onna Adele Nelson, University of California, Santa Barbara

SUMMARY
In her book, ''Methods in Latin Computational Linguistics'', Barbara
McGillivary builds off of Piotrowski (2012), offering historical linguists
basic training in quantitative and corpus methods, while offering
computational linguists the interesting challenge of exploring historical data
through the use of several case studies. Chapters 1 and 2 give a general
overview of the fields of Latin linguistics, computational linguistics, and
their intersections. Chapter 3 covers the creation of a verb valency lexicon,
which is a valuable resource for future studies. Chapters 4 and 5 cover a case
study in selectional preferences and argument structure; the former details
the linguistic theory while the latter covers the computational and
statistical methods. Chapters 6 and 7 cover another case study on Latin
preverbs; again, the former details the linguistics while the latter details
the computer science and mathematics. Finally, chapter 8 ties everything
together, defining ''Latin computational linguistics'' as a unified field
which needs expertise from a variety of interdisciplinary scholars.
Chapter 1, ''Historical Languages, Corpora, and Computational Methods'',
situates the book for both historical linguists and computational linguists.
This chapter overviews some of the challenges of Latin for computational
linguists, such as the fact that spoken Latin is mostly unknowable, the
dataset is limited because there are no living native speakers, and the
language is morphologically rich with flexible word order. Additionally, the
author introduces the reader to some basic concepts in computational
linguistics, such as corpus annotation, automatic parsing, statistical
significance, and the creation of a well balanced corpus, explaining how each
of these might benefit Latin scholars. McGillivary defines ''Latin'' and
''language'' as a certain subset of all the available data, for the purposes
of her case studies, and then outlines the remainder of the book by previewing
the case studies covered in later chapters.
Chapter 2, ''Computational Resources and Tools for Latin'', overviews the
currently available corpora and programs for Latin, as well as the steps
necessary to create new tools and resources for Latin. The author points out
that although the Latin Index Thomisticus (Busa 1980) was the first
electronically available corpus in any language, Latin has not kept up with
modern languages such as English and other modern languages which have the
benefit of native speakers and a market demand for resources such as machine
translation, which in turn drives the field of computational linguistics in
those languages. Although Latin does not have resources like English in terms
of the scale and availability of resources such as digitized corpora,
automatic annotation tools, part-of-speech taggers, treebanks, and lexical
databases, the author seeks to partially remedy the situation through her
work.
Chapter 3, ''Verbs in Corpora, Lexicon ex machina'', exemplifies a
computational approach to Latin which solves one of the problems introduced in
Chapter 2: the lack of a verb valency lexicon. The concepts of verb valency,
transitivity, and semantic roles are introduced. Next, McGillivary discusses
the advantages of a corpus-based distributional approach to semantics over
the traditional lexicography approach to verb valency, including detailed
usage-based frequency information and the lack of any sectional biases made
when a lexicographer is forced to choose only one or two examples due to the
space limitations of traditional dictionaries. The chapter then overviews how
to work with the Prague Dependency Treebank using MySQL queries to create the
valency lexicon. One challenge of Latin and verb valency is exemplified by the
fact that Latin allows pro-drop: in order to count all arguments of a verb,
one must account for the subject by extracting the person-marking from the
verb. Once the verb valency lexicon is created, a number of additional studies
can be carried out. For example, the author demonstrated how the valency
lexicon allows one to test diachronic trends, finding that VO word order is
slightly more common in more modern Latin while OV word order is slightly more
common in older Latin.
Chapter 4, ''The Agonies of Choice: Automatic Selectional Preferences'',
outlines a case study which makes use of the valency lexicon created in
Chapter 3. This chapter covers the linguistic background behind concepts such
as selectional preferences, argument structure, semantic features, and
animacy. The benefits of a computational approach are also outlined: manual
coding of these features is costly and time-consuming, but automatic
computational methods can complete this process quickly and accurately.
Semantic similarity can be measured computationally as well, either through
synonym resources such as WordNet or through distributional approaches which
makes use of relative frequencies and word collocations. Both of these
approaches, the knowledge-based WordNet and the knowledge-free distributional
approach, are tested against a ''gold standard''. Although normally the ''gold
standard'' is made by native speakers, Latin requires that the gold standard
is made from a separate test corpus.
Chapter 5, ''A Closer Look at Automatic Selectional Preferences for Latin'',
covers the statistical and computational methods as well as some of the
technical details behind the case study outlined in Chapter 4. This chapter
covers the structure of the synsets found in WordNet, the organization of data
into a matrix of variables, as well as the concepts of vector space and
clustering algorithms. Examples of different clustering algorithms are
illustrated with charts and dendrograms, and the benefits and drawbacks of
various techniques are discussed. Some probabilistic models as well as the
variety of statistical tests carried out on the data are also discussed.
Chapter 6, ''A CorpusBased Foray into Latin Preverbs'', outlines the typical
corpusbased approach to linguistic hypothesis formation and testing. This
chapter then tackles the Latin pre-verb system, which is an interesting test
case in diachronic morphosyntax. After covering some of the typological
background of analytic and synthetic languages, as well as the known facts
about the evolution of Latin into the modern Romance languages, this chapter
delves into another case study using Latin corpora, which seeks to replicate
the work done by hand by Bennett (1914). The hypotheses tested include whether
pre-verbs correspond to various Latin cases or prepositions. A multivariate
analysis is conducted to test the relationship between linguistic features
such as each pre-verb, the prepositional phrase, features of the noun such as
case or animacy, features of the verbs such as argument structure, selectional
preference, or semantics, as well as other variables such as the author of the
text, the era in which the text was written, and the genre of the text. The
results suggest what was already known: Latin underwent grammaticalization
from an inflectionally rich language to the more analytic Romance languages.
However, the author argues that because this study is replicable,
statistically significant, and does not rely on selectional biases inherent in
choosing examples by hand, it is an improvement on Bennett's (1914) work.
Chapter 7, ''Statistical Background to the Investigation on Preverbs'' covers
the statistical and computational side of the study outlined in Chapter 6.
Topics include basic hypothesis testing, the concept that correlation does not
imply causation, and some of the theories and formulae behind linear
regression models, correspondence analysis, multiple correspondence analysis,
and singular value decomposition. The benefits and drawbacks of each approach
are discussed and illustrated with various graphs.
Chapter 8, ''Latin Computational Linguistics'', wraps everything up by
summarizing the main goals and contributions of the book. The author suggests
several lines of inquiry that future Latin computational linguists could take.
McGillivary concludes that computational approaches are an ''unavoidable step
in the digital era'' and advises that all scholars ''have a responsibility to
acquaint themselves with each other's fields'' (p. 216).
EVALUATION
Despite the narrow subfield implied by the title, this book could be of
interest to a wide variety of scholars in the broad discipline of the digital
humanities. Latin scholars can benefit from more efficient data-mining and
analysis, as well as the increased scientific rigor of replicable,
quantitative studies. Corpus and computational linguists benefit by adapting
methods used on the million word corpora of modern, synchronic languages to
the smaller diachronic corpora available for Latin, while meeting the
computational challenges of an inflectionally rich language with relatively
free word order and no native speakers to test on. Latin, however, is just a
case study: many of the methods and concepts covered in this book are widely
applicable to any diachronic corpus, as with historical or acquisition data,
as well as any small corpus, as with endangered or extinct languages.
Those without a background in Latin, linguistics, computer science, and
statistics may find parts of this book difficult. Some Latin examples are
occasionally given without a translation, and the statistical formulae are
given with an expectation of at least some prior knowledge of Bayes' theorem.
It is also expected that the reader is familiar with morphosyntax,
particularly the Latin case system. Furthermore, it is important to note that
this is not a ''how-to'' guide to Latin computational linguistics. While
there is some discussion of the programs and packages used, and a few examples
of code or psuedocode, for the most part this book only covers the theoretical
background -- both linguistic and computational -- behind the analyses, not
the practical details of the analyses themselves.
Overall, this book makes a unique contribution to the field, both by expanding
existing Latin resources as well as encouraging greater interdisciplinary
research among scholars from such disparate fields as historical linguistics
and computer science.
REFERENCES
Bennett, C.E. 1914. Syntax of early Latin, Volume IIThe Cases. Boston: Allyn
and Bacon.
Busa, R. 19741980. Index Thomisticus: sancti Thomae Aquinatis operum indices
et concordantiae, in quibus verborum omnium et singulorum formae et lemmata
cum suis frequentiis et contextibus variis modis referuntur quaeque /
consociata plutrium opera atque electronico IBM automato usus digessit
Robertus Busa SJ. Stuttgart  Bad Cannstatt: Frommann  Holzboog.
Piotrowski, M. 2012. Natural Language Processing for Historical Texts. Morgan
& Claypool Publishers.

ABOUT THE REVIEWER

Onna Nelson is pursuing a Ph.D in Linguistics with an emphasis in Cognitive
Science at the University of California, Santa Barbara. Her research uses
corpus methods to explore language use in social media and language
acquisition.

----------------------------------------------------------
LINGUIST List: Vol-25-3396	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/