27.1505, Review: Computational Ling; Text/Corpus Ling: Mikros, Macutek (2015)

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Thu Mar 31 17:58:01 UTC 2016


LINGUIST List: Vol-27-1505. Thu Mar 31 2016. ISSN: 1069 - 4875.

Subject: 27.1505, Review: Computational Ling; Text/Corpus Ling: Mikros, Macutek (2015)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Sara  Couture <sara at linguistlist.org>
================================================================


Date: Thu, 31 Mar 2016 13:57:35
From: Marina Santini [MarinaSantini.MS at gmail.com]
Subject: Sequences in Language and Text

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36104497


Book announced at http://linguistlist.org/issues/26/26-2205.html

EDITOR: George K. Mikros
EDITOR: Ján  Macutek
TITLE: Sequences in Language and Text
SERIES TITLE: Quantitative Linguistics [QL] 69
PUBLISHER: De Gruyter Mouton
YEAR: 2015

REVIEWER: Marina Santini, Uppsala University

Reviews Editor: Helen Aristar-Dry

SUMMARY

The volume “Sequences in Language and Text” is an edited collection of 14
chapters.  The book also includes: a Foreword by the editors G. Mikros and J.
Mačutek , a Subject Index and an Authors’ Index.

In the Foreword, the editors briefly outline the structure of the volume,
which is roughly divided into theoretically oriented chapters on one hand, and
chapters more focused on real-world problems on the other hand. The aim of the
book is to document “the latest results of the language-in-the-line
quantitative linguistics, an approach that is less prominent to the
language-in-the-mass approach, but that apparently is gaining more and more
visibility.” (p. v). 

1) In the Introduction chapter, G. Altmann describes what sequences are:
“Sequences occur in text, written or spoken, and not in language which is
considered system.  However, the historical development of some phenomenon in
language can be considered sequence, too. ” (p. 1).  Many different forms of
sequences are known from text analyses, but it is not always possible to
explain the rise of a given sequence because “sequences are secondary units
and their establishment is a step in concept formation.” (p. 3).  A building
block in the sequential study of texts is “repetition”. There are several
types of repetition, such as uninterrupted sequences of identical elements,
aggregative repetitions, or cyclic repetitions.  Other types of sequences in
text are: symbolic sequences (eg. nominal classes), numerical sequence (e. g.
distances between neighbours), and musical sequences (that, for example,
characterize styles). Textual sequences are supposed to be regulated by laws.
The effort of quantitative linguistics is to establish laws and systems of
laws and to establish theories.  But Altman warns us that an overall theory
does not exist. Instead, quantitative linguists “look at the running text, try
to capture its properties and find the links among these sequential
properties.” (p. 6).

2) In the chapter “Linguistic Analysis Based on Fuzzy Similarity Models”, S.
Andreev and V. Borisov discuss the relevance of building fuzzy similarity
models for linguistic analysis. These models have a complex structure and are
characterized by a hierarchy of interconnected characteristics. The models aim
at solving a wide range of tasks under conditions of uncertainty, such as the
estimation of the degree of similarity of the original text and its
translations; the estimation of the similarity of parts of the compared texts;
and the analysis of individual style development in fragments of texts. The
authors apply the models to the original poem by Coleridge “The Rime of the
Ancient Mariner” and to two translations in Russian -- one by Gumilv and the
other one by Levik -- and provide a detailed qualitative interpretation of the
numerical results (shown in Table 6 and in the charts in Figures 7, 8, 9 and
10) returned by a similarity models based on parts of speech of the original
text by Coleridge.

3) In the third chapter “Textual navigation and autocorrelation”, F. Bauvaud,
C. Cocco, and A. Xanthos introduce a unified formalism for textual
autocorrelation. The term “textual autocorrelation” is defined as the tendency
for neighbouring textual position to be more or less similar than randomly
chosen positions” (p. 54). The presented approach is applicable to sequences
and useful for text analysis, and is based on two factors: neighborhood-ness
between textual position and (dis-)similarity between positions. The authors
present and discuss case studies to illustrate the flexibility of their
approach by addressing lexical, morpho-syntactic and semantic property of a
text. The case studies include the autocorrelation between lines of a play
(i.e. “Sgranelle ou le cocu imaginaire” by Molière”), free navigation within
documents (example text: “Sgranelle ou le cocu imaginaire” by Molière”),
hypertext navigation (example text: the “WikiTractatus”) and semantic
autocorrelation (example text: “The masque of the red death” by Poe). In the
conclusions, the authors argue that their approach is applicable “to any form
of sequence and text analysis that can be expressed in terms of dissimilarity
between positions or types, especially semantically-related problems” (p. 54).

4) In the chapter “Menzerath-Altmann law versus random model”, the authors --
M. Benešová and R. Čech – build three random models to show that the MAL
(Menzerath-Altmann law) controls human language behaviour. Menzerath's law, or
Menzerath–Altmann law (named after Paul Menzerath and Gabriel Altmann), is a
linguistic law according to which the increase of a linguistic construct
results in a decrease of its constituents, and vice versa. For example, the
longer a sentence (measured in terms of the number of clauses) the shorter the
clauses (measured in terms of the number of words), or the longer a word (in
syllables or morphs) the shorter the syllables or words in sounds. The authors
argue that MAL is a law that governs human language but not randomness. In
order to support this claim, they present three random models, each of which
takes into account different text characteristics and defines randomness
differently. “The results returned by the experiment show that the data
generated by the three random models does not fulfill the MAL.” (p. 65). These
results support the claim that  randomness and human language are governed by
different laws. In all the three random models both the number of constructs
and the number of constituents correspond to a real text (i.e. the essay “The
power of the powerless” by Havel). The constructs are represented by sentences
and the constituents by clause. The length of the sentences is defined as a
sequence of words which ends with a full stop, the cause as a unit containing
a predicate represented by a finite verb, and the word is defined graphically
as a sequence of letters between spaces. 

 5) In “Text length and the lambda frequency structure of a text” R. Čech
presents a study that shows the dependence of the lambda indicator (the lambda
indicator is a measure of the frequency structure) on text length. The
frequency structure of a text is accounted for by methods such as type-token
and other vocabulary richness measures, which are affected by text length.
Although normalization methods have been suggested to address this problem, it
seems that when certain languages are analyzed separately (namely Czech and
English in the present study), a dependence of lambda on text length emerges.
The chapter presents a method for the empirical determination of the interval
in which lambda should be independent from text length. The author argues that
within this interval, lambda can be safely used for the comparison of genre,
authorship, style etc.

6) In “Linguistic Motifs”, R. Köler presents a new unit, called “motif” that
“can give information about the sequential organization of a text with respect
to any linguistic unit and to any of its properties – without relying on a
specific linguistic approach or grammar” (p. 89). Köler defines the linguistic
motif as “the longest continuous sequence of equal or increasing values
representing a quantitative property of a linguistic unit”. Linguistic motifs
are subdivided into L-motif (i.e. a continuous series of equal or increasing
length values), F-motifs (i.e. a continuous series of equal or increasing
frequency values), P-motif (i.e. a continuous series of equal or increasing
polysemy values) and T-motifs (i.e. a continuous series of equal or increasing
politextuality values). The author uses one of the end-of-year speeches of
Italian presidents in which words have been replaced b their lengths measured
in syllables to show what L-motifs are. Then he converts the same speech into
R-motifs using POS instead of words. In both cases, the result of the fitting
is excellent. The advantages of motifs are: segmentation is unambiguous,
exhaustive, scalable with respect to granularity and, last but not least,
motifs show a rank-frequency distribution of the Zipf-Mandelbort type, that is
they behave like other more traditional linguistic units. Interestingly,
motifs provide a means to analyse text in their sequential structure with
respect to all kinds of linguistic units and properties; even categorical
properties can be studied in this way. The authors admit that the full
potential of the proposed approach has not been explored yet.

7) In “Linguistic Modelling of Sequential Phenomena: The role of laws”, R.
Köhler and A. Tuzzi argue that data alone does not give an answer to a
research question. It is, instead, a theoretically grounded hypothesis, tested
on appropriate data, that produces new knowledge. For a linguistically
meaningful and valid analysis of linguistic objects, linguistic models are
required, with their laws of language and texts. The authors illustrate the
usefulness of linguistic laws in a practical example (p. 111). They use a
corpus of 63 end-of-year messages delivered by all the president of the
Italian Republic over the period from 1949 to 2011. Since the corpus is a set
of texts representing an Italian political-institutional discourse, the
authors set the hypothesis that the temporal behavior of the frequency of a
word is discourse specific. Since a ready-made model of this kind of
phenomenon is not available, they use the Piotrowski law. It comes out that
some selected works follow the logistic growth function that is typical of
this law. 

8) The chapter “Manzerath-Altmann Law for Word Length” by J. Mačutek and G.
Mikros resumes the investigations of motifs. The authors emphasize that motifs
are relatively new linguistic units that make possible an in-depth
investigation of sequential properties of texts. For instance, a word length
motif is a continuous series of equal or increasing word lengths (often
measured in syllables, morphemes or other length units) for which the MAL is
valid. As explained earlier, the MAL describes the relations between the size
of the construct (e.g., a word) and its constituents (eg syllables) and states
that the larger the construct (the whole), the smaller its constituents
(parts). The authors use a corpus of Modern Greek literature and also randomly
generated data. For their data the following is true: the longer the motif (in
number of words), the shorter the mean length of words (in the number of
syllables). This chapter provides another confirmation that word-length motifs
behave in the same way as other more traditional linguistic units. It remains
an open question whether the parameters of the MAL can be used as
characteristics of languages, genres or authors. If the answer is  positive,
they might be applied to language classification, authorship attribution and
similar fields. 

9) In the chapter “Is the Distribution of L-Motifs Inherited from the Word
Length Distribution?” J. Milička points out that word length sequences can be
successfully analyzed by means of L-motifs, which he considers to be a very
promising attempt to discover syntagmatic relations of word lengths in a text.
An L-motif is a “text segment which, beginning with the first word of the
given text, consists of word lengths which are greater or equal to the left
neighbor.” The main advantage of such segmentation is that it can be applied
iteratively, i.e. L-motifs of the L-motifs can be obtained (so called LL
motifs). Although applying the method several times may result in unintuitive
sequences, these sequences follow lawful patterns and they can be useful for
practical application, such as automatic text classification. However, even if
L-motifs follow lawful patterns, this does not imply that L-motifs reflect
syntagmatic relations, since these could be inherited from the word length
distribution in a text. In order to prove that L-motifs reflect syntagmatic
relation of the word lengths, the author tests the following null hypothesis:
“the distribution of L-motifs measured on the text T is the same as the
distribution of L-motifs measured on a pseudo text T’. The pseudo text T’ is
created by the random transposition of all tokens of the text T within the
text T.” (footnote 4).  The author’s hypothesis is tested on three Czech texts
and six Arabic texts. The null-hypothesis is rejected for the L-motifs (all
texts) and for LL-motifs (except one text), but it is not rejected for
L-motifs of higher order (LLL-motifs, etc.) in Czech, although it is not
rejected for LLL-motifs in Arabic (except one text). In conclusion, the
experiment carried out by the author shows that L-motifs can be useful to
examine the syntagmatic relations in most cases. 

 10) In “Sequential structures in ‘Dalimil’s Chronicle”, A. Pawłowski and M.
Eder carry out a quantitative analysis of style variation. They focus on the
difference between orality and literacy. The objective of their study is to
investigate the phenomenon of “prosaisation”, which was put forward by
Woronczak in 1963, by means of tests performed on a variety of sequential text
structures in the Chronicle of Dalimil, the first chronicle written in the
Czech language at the beginning of the 14th century. The following data are
analyzed: a series of chapter lengths (in letters, syllables and words); a
series of verse lengths (in letters, syllable letters and words); alternations
and correlations of rhyme pairs, quantity-based series of syllables (binary
coding); stressed-based series of syllables (binary coding). In their tests,
they verify the presence of latent rhythmic patterns and this partially
confirms the hypothesis advanced by Woronczak. However, it also appears that
the bare opposition of orality vs. literacy does not suffice to explain the
quite complex stylistic shift in the Chronicle. 

In “Comparative Evaluation of String Similarity Measures for Automatic
Language Classification”, T. Rama and L. Borin present “the first known
attempt to apply more than 20 different similarity (or distance) measures to
the problem of genetic classification of languages on the basis of
Swadesh-style core vocabulary lists” (p. 189). The Swadesh list is a
compilation of basic concepts and it is used in historical-comparative
linguistics. The authors present experiments performed on the Automated
Similarity Judgment Program (ASJP) database that contains 40-item word lists
of all the world's languages. The authors examine the various measures in two
respects, namely: (1) the capability of distinguishing related and unrelated
languages and (2) the performance as measures for internal classification of
related languages. Results show that the string similarity measure (i.e.  a
sequence-based measure) does not contribute to improving internal
classification, but it helps in discriminating related languages from
unrelated ones. 

12) In the chapter “Predicting Sales Trends. Can sentiment analysis on social
media help?”, V. Rentoumi, A. Krithara, and N. Tzanos present a two-stage
approach based on tweets’ sequential data for the prediction of sales trends’
in products. In this approach the sentiment values expressed through the
tweets’ sequences are taken into account. The authors present experiments
based on a structure model, namely Conditional Random Field (CRF), and
emphasize the benefits of their approach with respect to other approaches that
are based on bag-of-words representations. CRF is an undirected graph model
where joint probabilities for the existence of possible sequences given an
observation are specified. The motivation for using CFR for a sentiment
analysis task such as predicting sales trends based on social network data, is
based “on the principle that the meaning a sentence can imply, is tightly
bound to the ordering of its constituent words” (p. 205). The CRF model was
trained using tweets derived from Sander’s collection (a corpus of 2471
manually annotated tweets).  The test set was a corpus of four subsets of
tweets on four different topics, namely: ipad, sony experia, Samsung galaxy
and Kindle fire. The authors show that since CRF exploits structural
information concerning a tweets’ data, it can capture non local dependencies
that play an important role in the task of sales prediction, thus confirming
the assumption that the representation of structural information, exploited by
the CRF, simulates the semantic and sequential structure of the data. 

13) In “Where Alice meets Little Prince”, A. Rovenchak describes a method to
analyze text using an approach inspired by a statistical-mechanical analogy.
This method is applied to study translations of two novels: Alice in the
Wonderland and The Little Prince. The method is based on a model where a set
of parameters can be obtained to describe texts. These parameters are related
to the grammar type, intended as the “analycity level of a language” (p. 217).
Results confirm that there exists a correlation between the level of language
analyticity and the values of parameters calculated using the proposed
approach. More specifically, the presented study shows that, within the same
language, “the dependence on a translator of a given text appears much weaker
than the dependence on the text genre” (p. 228). To date, however, the exact
attribution of a language with respect to parameter values has not been
provided, since the influence of genre has not been studied in detail. 

14) In “A Probabilistic model for the Arc length in Quantitative Linguistics”,
P. Zörnig argues that the arch length measure is a good alternative to the
usual statistical measures of variations. The author illustrates the formulae
for the two most important characteristics of the random variable arc length,
namely expectation and variance. Using the sequential representation of texts
– in the form  (x1… xn), where xi represents the length of the i-th world of a
text -- he studies the sequences for 32 texts in 14 languages.  

EVALUATION

This collection makes a good contribution to quantitative linguistics and to
computational linguistics in general. 

The volume presents an interesting set of models, laws and experiments that
focus on linguistic sequences. Linguistic sequences are intended as linguistic
units based on linear (or syntagmatic) sequence of symbols. The chapters in
the book present several linguistic sequences and the different aspects that
highlight their properties, the laws they are governed by, and their potential
in linguistic and textual analysis. Some linguistic sequences are well-known,
for example the type-token relation that accounts for the lexical richness of
a text. Other linguistic sequences – such as ‘motifs’ -- are more recent and
are introduced and explained in this volume. 

Linguistic sequences and the laws they are governed by appear to have a high
potential for many areas in computational linguistics and linguistic
applications. For instance, I am thinking about the possible use of linguistic
motifs for text classification, automatic genre identification, stylometry,
authorship attribution and the like. Motifs seem to have two important
qualities: on the one hand, they are easy to compute and extract automatically
(I would call them easily-extractable, computationally-light features) and on
the other hand, they are linguistically motivated (which is not always the
case with light features such as character n-grams). Since their potential has
not been explored to date, it is worth including them in the list of
linguistic features that can be used in future experiments in text analysis
and classification. 

Also the idea of discovering mathematically-based language laws and systems of
laws may have a good potential in computational modelling (although to date
empirical studies based on these laws seem to be still limited). The concept
of law is understood as “the class of law hypotheses which have been deduced
from theoretical assumptions, are mathematically formulated, are interrelated
with other laws in the field, and have sufficiently and successfully been
tested on empirical data” (Wikipedia). As mentioned above, a law, like the
Menzerath-Altmann law, states that the sizes of the constituents of a
construction decrease with increasing size of the construction under study. 
For example, the longer a sentence (measured in terms of the number of
clauses) the shorter the clauses (measured in terms of the number of words.
Intuitively, this mathematically-based law could be exploited not only in
quantitative linguistics, but also in related areas such as distributional
semantics and machine learning for language technology. 

Although the volume is certainly valuable, I personally missed a few elements
that would have given me a more comprehensive view of the added value the
book. For instance I missed a general overview describing the state-of-the-art
of Quantitative Linguistics (QL), its main purposes and motivations, and the
importance of linguistic sequences and laws in this context. QL is considered
to be a subfield of general linguistics and, more specifically, of
mathematical linguistics. QL is related to Computational Linguistics, Corpus
Linguistics and Applied Linguistics but overlaps with these. Therefore, it
would have been helpful to point out explicitly its specificity and its
similarities and differences with the neighboring fields. 

I also missed an abstract at the beginning of each chapter delivering quick
information about the aim, the motivation and the results of the study
presented in the chapter itself. Since the volume is an edited collection, the
presentation style of the different chapters varies a lot, and in some
chapters the identification of purpose, motivation and results was not
straightforward. An abstract would probably have helped gather these elements
more quickly. 

In conclusion, the volume is a good reading not only for linguists working
within QA, but also for computational linguists and language technologist who
are interested in exploring and experimenting with new features and with
language laws that could help model language applications. 

REFERENCES

Köhler R., Altmann G., and Piotrowski R. (eds.) (2005). QUANTITATIVE
LINGUISTIK /QUANTITATIVE LINGUISTICS -- Ein internationales Handbuch / An
International Handbook, DE GRUYTER MOUTON.

IQLA - International Quantitative Linguistics Association
(http://www.iqla.org) 

Journal of Quantitative Linguistics 
(http://www.tandfonline.com/toc/njql20/current) 

Quantitative linguistics entry from Wikipedia
(https://en.wikipedia.org/wiki/Quantitative_linguistics) 

Language in the Line vs. Language in the Mass: On the Efficiency of Sequential
Modelling in the Analysis of Rhythm Author: Pawlowski, Adam Source: Journal of
Quantitative Linguistics, Volume 6, Number 1, April 1999, pp. 70-77(8)


ABOUT THE REVIEWER

I am a computational linguist with a strong interest in textual and linguistic
features, machine learning and intensive textual data processing. My personal
challenge is to extract ''contextualized'' information from big unstructured
textual data leveraging on the concept of ''genre''. The word ''genre'' means
''type of text''. Nowadays all kinds of businesses, enterprises and customer
care services produce huge amount of data in the form of many different
''genres'', i.e. emails, memos, notes from call-centers, news, user groups,
chats, reports, tweets, Facebook pages, blogs, forums, marketing material and
so on. All these textual genres contain valuable but unstructured data. The
exploitation of unstructured data is recognized as a challenge in information
technology that engenders a huge economic loss and poor decision-making.
Computational linguistics and machine learning can certainly help meet the
challenge.





------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and 
as such can receive donations through the eLinguistics Foundation, 
which is a registered 501(c) Non Profit organization. Our Federal 
Tax number is 45-4211155. These donations can be offset against 
your federal and sometimes your state tax return (U.S. tax payers only). 
For more information visit the IRS Web-Site, or contact your financial 
advisor.

Many companies also offer a gift matching program, such that 
they will match any gift you make to a non-profit organization. 
Normally this entails your contacting your human resources department 
and sending us a form that the eLinguistics Foundation fills in and 
returns to your employer. This is generally a simple administrative 
procedure that doubles the value of your gift to LINGUIST, without 
costing you an extra penny. Please take a moment to check if 
your company operates such a program.

Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-1505	
----------------------------------------------------------







More information about the LINGUIST mailing list