24.2555, Review: Discipline of Linguistics: Gomez (2013)

Mon Jun 24 15:49:36 UTC 2013

LINGUIST List: Vol-24-2555. Mon Jun 24 2013. ISSN: 1069 - 4875.

Subject: 24.2555, Review: Discipline of Linguistics: Gomez (2013)

Moderator: Damir Cavar, Eastern Michigan U <damir at linguistlist.org>

Reviews: Veronika Drake, U of Wisconsin Madison
Monica Macaulay, U of Wisconsin Madison
Rajiv Rao, U of Wisconsin Madison
Joseph Salmons, U of Wisconsin Madison
Mateja Schuck, U of Wisconsin Madison
Anja Wanner, U of Wisconsin Madison
       <reviews at linguistlist.org>

Homepage: http://linguistlist.org

Do you want to donate to LINGUIST without spending an extra penny? Bookmark
the Amazon link for your country below; then use it whenever you buy from
Amazon!

USA: http://www.amazon.com/?_encoding=UTF8&tag=linguistlist-20
Britain: http://www.amazon.co.uk/?_encoding=UTF8&tag=linguistlist-21
Germany: http://www.amazon.de/?_encoding=UTF8&tag=linguistlistd-21
Japan: http://www.amazon.co.jp/?_encoding=UTF8&tag=linguistlist-22
Canada: http://www.amazon.ca/?_encoding=UTF8&tag=linguistlistc-20
France: http://www.amazon.fr/?_encoding=UTF8&tag=linguistlistf-21

For more information on the LINGUIST Amazon store please visit our
FAQ at http://linguistlist.org/amazon-faq.cfm.

Editor for this issue: Rajiv Rao <rajiv at linguistlist.org>
================================================================  

Date: Mon, 24 Jun 2013 11:49:04
From: Natalia Levshina [natalevs at gmail.com]
Subject: Statistical Methods in Language and Linguistic Research

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=24-2555.html&submissionid=13485279&topicid=9&msgnumber=1

Discuss this message: 
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=13485279

Book announced at http://linguistlist.org/issues/24/24-330.html

AUTHOR: Pascual Cantos Gomez
TITLE: Statistical Methods in Language and Linguistic Research
PUBLISHER: Equinox Publishing Ltd
YEAR: 2013

REVIEWER: Natalia Levshina, University of Marburg

SUMMARY

The aim of the book, as the author formulates it in the preface, is to
''illustrate with numerous examples how quantitative methods can most
fruitfully contribute to linguistic analysis research'' (p. xi). It introduces
basic and intermediate-level statistical techniques that can be used by
linguists, especially in the domains of applied and corpus linguistics. The
techniques range from basic parametric and non-parametric tests, such as the
chi-squared test, to more advanced multivariate techniques, such as factor
analysis and multiple linear regression. The book explains, step-by-step, the
mathematical and conceptual apparatus behind various statistical methods. It
also contains chapters on fundamental corpus-linguistic topics, namely, word
frequency lists and collocations.

The book consists of six chapters, a list of references and an index. It also
includes a vast appendix, which contains tables with critical values of the
most important statistical distributions and a table with examples of
appropriate statistical tests for different types of variables.

Chapter 1 introduces basic descriptive statistics, such as measures of central
tendency (i.e. mean, median and mode) and dispersion (e.g. range, variance,
standard deviation. It also discusses z-scores and t-scores, which can be used
for data standardization. The author provides detailed explanations of how
these measures can be computed. The chapter also offers a brief introduction
to probability theory and gives examples of different types of distributions.

In Chapter 2, the reader learns about different types of variables, depending
on their level of measurement, or scale (i.e. interval, rational, nominal and
ordinal), and their role in a statistical model (i.e. dependent, independent,
moderator, control and intervening). This is the shortest chapter, which
comprises only seven pages.

Chapter 3 discusses univariate and bivariate parametric and non-parametric
tests that can be used to compare two or more groups or investigate
relationships between variables. The chapter begins with an overview of the
most important statistics, where the author explains how to use the tests
appropriately depending on specific research questions and characteristics of
the available data. Parametric tests include the t-test for independent and
paired samples, analysis of variance (ANOVA), Pearson's correlation
coefficient and simple linear regression, while the non-parametric section
deals with the Mann-Whitney U-test, the sign test, the chi-squared test, the
median test and Spearman's rank correlation. The author explains the
underlying assumptions and theoretical principles of each test, and provides
extensive illustrations.

Chapter 4 describes four multivariate statistical methods: cluster analysis in
its hierarchical and non-hierarchical (i.e. k-means) instantiations,
discriminant functions, factor analysis, and multiple linear regression. As in
the previous chapter, the assumptions that should be met are discussed for
each method. The author walks the reader through all the main conceptual steps
of each analysis. Most calculations in this chapter are done by the author
with the help of SPSS.

Chapters 5 and 6 deal with some fundamental issues in corpus linguistics
related to word frequency lists and collocation measures. Chapter 5 is
probably the most heterogeneous one of the book. First, it discusses at length
the usefulness of different ways of sorting frequency lists, and illustrates
Dunning's (1993) method of finding the keywords in a text or corpus. The
'keyness' is determined with the help of the log-likelihood test. The method
is illustrated by computing the keyness of words in one of Barack Obama's
speeches. The reference corpus, which is used to measure the degree of
unexpectedness of the words in Obama's speech is, somewhat surprisingly, the
British National Corpus. In addition, the author mentions different types of
corpus annotation, and suggests a method of comparing wordlists from different
domains with the help of meta-frequency lists, which are conceptually similar
to the popular Venn's diagrams. Next, the author moves on to discuss type and
token distribution in a corpus, as well as Zipf's law. Finally, he describes
how to measure dispersion of a word in a corpus by using Gries' (2008) DP
(i.e. Deviation of Proportions) measure.

Finally, Chapter 6 provides the reader with information about concordance,
KWIC (i.e. Key Words In Context) format and collocation. It discusses four
association measures (i.e. mutual information (MI) and its modified version,
MI3, z-score, and log-likelihood) and compares them in a case study of a small
list of collocates. After that, the author introduces the notion of lexical
constellations, which reflect hierarchical and asymmetric relationships
between collocates.

EVALUATION

''Statistical Methods in Language and Linguistic Research'' provides a useful
and accessible introduction to the world of statistics for beginners. The main
advantage of the book, in my opinion, is the fact that it offers a detailed
explanation of classical statistical techniques. The text contains many
examples, which will definitely help a novice to understand the logic behind
the statistical tests. The book can thus be used as a supplement to more
practically oriented textbooks, e.g., Baayen (2008) and Gries (2009, 2013).
Another strong point is the systematic discussion and comparison of parametric
and non-parametric methods offered in Chapter 3. Since linguistic data tend to
deviate from normality, this approach is very welcome.

That being said, there are a few concerns. First, I have some doubts that the
book fully achieves its goal formulated in the preface, namely, to demonstrate
how statistics can contribute to linguistic studies. Unfortunately, the
examples and topics covered in the book are too limited from a theoretical
point of view. Most illustrations come from foreign language acquisition (e.g.
the case studies that compare the effectiveness of different teaching methods,
or determine the weight of factors that influence students' motivation) and
'old-school' corpus linguistics (e.g. keywords and concordance analysis,
automatic text classification, etc.), with all due respect to those domains.
This is a bit odd, since the application of quantitative methods in
contemporary linguistic research has been extremely productive in many areas,
especially within the usage-based paradigm and in variationist research,
psycholinguistics and typology. In addition, the data in examples are often
fictional or come from an unnamed source, especially in the first chapters of
the book.

Second, in the age of the statistical software boom, it is somewhat surprising
to find no practical guidelines regarding how to perform statistical tests
with the help of existing packages (for instance, SPSS, which is extensively
used by the author). After all, these calculations are no longer done with
pencil and paper. It would be useful, therefore, if the book were to contain
at least an appendix with relevant codes.

Another problematic issue is the imprecise use of statistical terminology.
Consider the following, more small-scale errors: i. Figure 1.12 is called a
histogram (21), but is really a standard x-y plot without bars; ii. The term
'probability ratios' (26) should be substituted by simple 'probabilities' or
'proportions'; iii. The  'independent value' (63) in regression modelling is
normally called the ‘intercept’; iv. Gries's (2008) DP measure is not the
'Degree of Dispersion' (183), but rather the 'Deviation of Proportions'; v.
Finally, a normalized version of DP (Lijffijt & Gries 2012) might have been
more appropriate to include.

Some errors are, however, more serious on conceptual grounds: i. “To perform
multiple regression, the variables should either be interval or continuous and
they should be related linearly” (p. 122). This assumption is erroneous. In
fact, there exist perfectly legitimate solutions that enable one to
incorporate categorical predictors (e.g. dummy coding) and non-linear
relationships (e.g. power transformation) in a linear regression model; ii.
The strategy of fitting the regression model described on pp. 131-133 has a
serious flaw. A model with 8 independent variables and only 32 observations
runs a huge risk of overfitting. As a result, such a model cannot be
extrapolated to new data, which makes it useless (see Harrell 2001).

Finally, the book contains a few minor misprints, which can be a source of
confusion for beginners: i.The mean, median and mode should be in the reverse
order in Figure 1.16 (24); ii. Instead of 1000/1650 = 0.91, this calculation
should read 1000/1100 = 0.91 (29); iii. T2 = 1 + 5 + 6 + 7 + 10 = 29, not 27,
as in the text (70); iv. ''screen plot'' (correct: ''scree plot'') (117); v.
''beta-axis'' (correct: ''y-axis'') (122); vi. ''? the slope'' (correct:
''beta the slope'') (122); vii. The formula of (pointwise) mutual information
is not MI = P(w1, w2)/log2 P(w1)*P(w2), but rather MI = log2(P(w1,
w2)/P(w1)*P(w2)) (Manning & Schütze 1999: 68) (205).

REFERENCES

Baayen, R. Harald. 2008. Analyzing Linguistic Data. A Practical Introduction
to Statistics Using R. Cambridge: Cambridge University Press.

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and
coincidence. Computational Linguistics 19 (1). 61-74.

Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora.
International Journal of Corpus Linguistics 13(4). 403–437.

Gries, Stefan Th. 2009. Statistics for Linguistics with R. A practical
introduction. Berlin: De Gruyter Mouton.

Gries, Stefan Th. 2013. Statistics for Linguistics with R. A practical
introduction. 2nd rev. and ext. ed. Berlin: De Gruyter Mouton.

Harrell, Frank E. Jr. 2001. Regression Modeling Strategies. With Application
to Linear Models, Logistic Regression, and Survival Analysis. New York:
Springer.

Lijffijt, Jefrey, & Stefan Th.Gries. 2012. Correction to “Dispersions and
adjusted frequencies in corpora”. International Journal of Corpus Linguistics
17(1). 147–149.

Manning, Chris & Hinrich Schütze. 1999. Foundations of Statistical Natural
Language Processing. Cambridge, MA: MIT Press.

ABOUT THE REVIEWER

Natalia Levshina is a postdoctoral researcher at the Research Group 'Language
Typology and Quantitative Linguistics' at Philipps University of Marburg,
Germany. She obtained her PhD from the University of Leuven, Belgium, in 2011.
Her thesis was based on multivariate statistical analyses of periphrastic
causatives in Netherlandic and Belgian Dutch. Among her main interests are
multifactorial models of language use and spatial representations of natural
language semantics in Cognitive Linguistics and typology. She has been
teaching courses in Corpus Linguistics and quantitative methods of linguistic
analysis at the University of Jena and the University of Marburg in Germany.

----------------------------------------------------------
LINGUIST List: Vol-24-2555	
----------------------------------------------------------