26.3842, Review: Computational Ling; Text/Corpus Ling: Moisl (2015)

Tue Sep 1 00:48:45 UTC 2015

LINGUIST List: Vol-26-3842. Mon Aug 31 2015. ISSN: 1069 - 4875.

Subject: 26.3842, Review: Computational Ling; Text/Corpus Ling: Moisl (2015)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
              http://funddrive.linguistlist.org/donate/

Editor for this issue: Sara  Couture <sara at linguistlist.org>
================================================================

Date: Mon, 31 Aug 2015 20:48:25
From: Paul Isambert [zappathustra at free.fr]
Subject: Cluster Analysis for Corpus Linguistics

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36057777

Book announced at http://linguistlist.org/issues/26/26-413.html

AUTHOR: Hermann L. Moisl
TITLE: Cluster Analysis for Corpus Linguistics
SERIES TITLE: Quantitative Linguistics [QL] 66
PUBLISHER: De Gruyter Mouton
YEAR: 2015

REVIEWER: Paul Isambert, Laboratoire LaTTiCe – CNRS

Reviews Editor: Helen Aristar-Dry

SUMMARY

After a short introductory chapter, in which the author defends the use of quantitative methods in linguistics, and corpus linguistics in particular, Chapter 2, ''Motivation'', illustrates the approach with a small, non-technical example: how to produce scientific hypotheses when data are so numerous that direct inspection is out of the question.  Twenty-four individuals with 12 variables each (variables are phonetic segments) already constitute some challenge, but cluster analysis, by grouping speakers stepwise based on their similarity, distinguishes two main groups, which correspond to different places of residence.

The third chapter, ''Data'', first introduces the corpus on which the entire book is based (the Diachronic Electronic Corpus of Tyneside English, DECTE, see http://research.ncl.ac.uk/decte), and underlines its formal structures: 63 speakers are described by 156 variables, which encode the number of times a given phonetic segment is used.

The bulk of the chapter is then devoted to the mathematics underlying clustering methods. It proceeds first with a crash course in linear algebra (since speakers are vectors with the values of variables as coordinates), with a special emphasis on distance measurement (since clustering works by proximity). Data transformation is then addressed, beginning with standardization (variables should be reduced to the same scale); the author rejects the usual z-standardization in favor of mean-standardization. He then moves to the issue of varying document lengths and advocates normalization (roughly speaking, using relative rather than absolute frequency).

Data are often tens of variables per subject, and that can be a problem, since distance between points (subjects) becomes increasingly similar. For the analysis to remain feasible, variables must be eliminated. The author discusses approaches based on frequency (rare variables are less interesting), variability (variables with little variance do not discriminate much) and nonrandomness (interesting variation is systematic). There is no foolproof method, however, and the researcher's judgement is required.

Another approach is to lump variables together instead of eliminating them. Using Pearson's correlation coefficient, covariance can be measured and redundancy identified. Principal Component Analysis then allows the researcher to extract variables that capture most of the variance in the data, the main problem being that those variables aren't readily interpretable (they are mathematical transformations of the original ones).

The rest of the chapter is devoted to other techniques for dimensionality reduction, based on proximities between data objects. They include Multidimensional Scaling and its variants. Different methods are used, depending on the linearity or nonlinearity of the data, so the chapter ends with a discussion of how to identify (non)linearity, addressing the problem of overfitting (when a complex model doesn't reflect the complexity of the data but is actually fits noise).

Chapter Four addresses clusters proper, introducing methods to delimit them: nonhierarchical methods partition a data matrix into any number of groups, either by dimensionality reduction (similar to the approaches discussed in the previous chapter), or by iterative modification of an initial clustering (until some optimum is found), or by measuring density (clusters are separated by zones of sparse data).

Hierarchical methods construct clusters recursively, joining the two nearest data points into a cluster now treated as a single point, and repeating the process (the result can be pictured as a binary tree). Such methods are exhaustive and easy to understand (and are thus widely used), however it is not always easy to determine how many clusters the data contain (in other words, which level in the tree is relevant). Also, results depend on the criteria chosen to measure proximity, and with complex data there is no single solution: the researcher must exercise judgement.

Clusters can sometimes be found by algorithms where they do not really exist in the data; also, random data points aggregate ''naturally'', even though there is no structure underlying them (''chance is lumpy,'' according to Abelson, 1995). Hence, cluster must be validated. First, methods are introduced to assess beforehand whether data are significantly nonrandom. Then the author discusses cluster validation and cluster selection (different methods may yield different analyses); this step is essential to avoid misinterpretation of the results, and the author advocates using several concurrent methods.

Chapter 5, ''Hypothesis generation'', illustrates the preceding theoretical discussion with an extended example. Some of the methods are run on the MDECTE data to answer positively the research question about systematic phonetic variation; the variation is also correlated with sociolinguistic variables.

The last chapter is a literature review of works in quantitative corpus linguistics in general and cluster analysis in particular. The author outlines tens of papers in grammatical research as well as variationist and historical approaches.

The book concludes with a short assessment of cluster analysis as a scientific approach; an appendix lists available software (both proprietary and free) for cluster analysis.

EVALUATION

Quantitative approaches have gained momentum in recent years, and they are less and less circumscribed to the branches of linguistics that have been using them successfully for decades (like phonetics or sociolinguistics). Many books have been written to accompany and/or encourage this major change in modern linguistics (e.g. Baayen, 2008 or Gries, 2009, among many others), but most are generalist works aimed at non-specialist readers.

Hermann Moisl's book, on the other hand, is very specific, and although it targets all readership, it is nonetheless packed with equations, graphs and tables, the kind of paraphernalia that might repel an average linguist before s/he has even read a single sentence. As the author lucidly acknowledges in the Introduction, ''the arts/science divide is still with us, and many professional linguists have little or no background in and sometimes even an antipathy to mathematics and statistics.''

That is unfortunate, because this is an important book; and that is all the more unfortunate as the author couldn't have done otherwise: cluster analysis is a technical matter, and there is little point in introducing it without the necessary mathematical and methodological background. That said, the author has made every effort to make the text accessible, and the mindful reader will be able absorb the book. More importantly, s/he will be able to reread the book in years to come, and use it as a solid reference.

As for cluster analysis itself, it should be a basic tool in every quantitative linguist's toolbox. Massive data are a blessing of recent years, but they can be a curse too: one can feel overwhelmed and miss their import. With cluster analysis, one can feel confident that some kind of pattern will emerge (to be honest, it depends on the quality of the data, and of the variables one has chosen to encode---admittedly no small task in itself).

I have two main criticisms: First, there is a lot of cluster analysis in this book but, in the end, not much of linguistic analysis. The bulk of the volume (Chapters 3 and 4) discuss clusters in general, while Chapter 5 is a worked out example using the Tyneside corpus, but it mostly addresses technical issues. Second, the discussion is disconnected from any practical consideration relative to software usage, and the reader eager to try his/her hand at cluster analysis in the real world will probably feel disappointed.

Those criticisms, however, might be benign if it is understood that this book is a theoretical introduction to cluster analysis, not a practical one. As such, it will probably best stand the test of time. It is then aimed at experienced instructors and researchers willing to extend their knowledge and unafraid of working out the practicalities by themselves (e.g. because they know R well). 

REFERENCES

Abelson, Robert P. (1995), ''Statistics as Principled Argument,'' L. Erlbaum Associates, Hillsdale.

Baayen, R.H. (2008), ''Analyzing Linguistic Data: A Practical Introduction to Statistics using R,'' Cambridge University Press, Cambridge.

Gries, Stefan Th. (2009), ''Statistics for Linguistics with R,'' Mouton De Gruyter, Berlin.

ABOUT THE REVIEWER

Paul Isambert holds a PhD from the University of Paris 3, France. He is currently working on grammaticalization in French.

----------------------------------------------------------
LINGUIST List: Vol-26-3842	
----------------------------------------------------------