31.1728, Review: Computational Linguistics; Semantics; Syntax: Parmentier, Waszczuk (2019)

Fri May 22 02:32:43 UTC 2020

LINGUIST List: Vol-31-1728. Thu May 21 2020. ISSN: 1069 - 4875.

Subject: 31.1728, Review: Computational Linguistics; Semantics; Syntax: Parmentier, Waszczuk (2019)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================

Date: Thu, 21 May 2020 22:31:58
From: Viatcheslav Yatsko [iatsko at gmail.com]
Subject: Representation and parsing of multiword expressions

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36579677

Book announced at http://linguistlist.org/issues/30/30-2630.html

EDITOR: Yannick  Parmentier
EDITOR: Jakub  Waszczuk
TITLE: Representation and parsing of multiword expressions
SUBTITLE: Current trends
SERIES TITLE: Phraseology and Multiword Expressions
PUBLISHER: Language Science Press
YEAR: 2019

REVIEWER: Viatcheslav Yatsko, Katanov State University of Khakasia

SUMMARY

This book is a collection of papers created within Workgroup 2 of the European
PARSEME COST Action. The aim of the workgroup was to develop techniques and
methodologies for detection, presentation, and parsing of multiword
expressions (MWEs). From the linguistic point of view multiword expressions
are constructions, whose components have lost their original lexical meaning,
though they admit of some variation, cf. ''to take a haircut'' meaning   ''to
agree to accept less money for something'' that admits of modifiers (to take a
serious/70% haircut) and different verb forms (takes/is taking/has taken a
haircut).

The book comprises Preface and 10 chapters grouped into three parts: MWE
presentations (Chapters 1-5), MWE parsing (Chapters 6-8), and Multilingual NL
applications for MWEs (Chapters 9 and 10). 

In the Preface, Yannick Parmentierand Jakub Waszczuk, the editors of the book,
substantiate the relevance of the investigations presented in the book to the
field of automatic natural language processing, pointing to their
idiosyncratic properties and frequency of occurrence (MWEs cover up to 30% of
all words in human language utterances). They are right, because tokenization
based on the traditional bag-of-words-approach usually fails to detect
phraseological constructions, which can negatively affect results of text
processing. Obviously, if the MWE mentioned above is divided into separate
words its meaning as well as the meaning of a part of the text or even the
whole text will be lost. Hence the importance of specific techniques for MWEs
detection, presentation, and processing. 

Part I of the books opens with the chapter ''Lexical encoding formats for
multiword expressions. The challenge of 'irregular' regularities'' written by
Timm Lichte, Simon Petitjean, Agata Savary, and Jakub Waszczuk. The chapter
gives a clear and logical interpretation of the notions of regularity and
irregularity in terms of the set theory. A property ''p'' is considered
regular with respect to a set of objects ''E'' if ''p'' is shared by at least
two members in ''E''. If it is associated with only one member of ''E'', this
property is irregular. If a given property is shared by a subset of ''E'', it
is considered non-trivially regular. It is trivially regular if it is shared
by all objects in ''E''.

This interpretation proposed by the authors reminded me of the methodologies
developed within the scope of cluster theory that also involve correlation
between properties and objects. In cluster theory, a cluster is defined as ''a
set of objects that share some property'' [1, p. 495]. A somewhat similar
methodology is used in componential analysis, where semantic features are
assigned to some linguistic units (usually words) to distinguish between words
with similar meaning. Table 1 (p. 5) is similar with the tables used in
distributional analysis. The authors should have provided references to imbed
their research into a larger theoretical framework. I was surprised not to
find such references.

Most of the chapter focuses on various encoding formats that may be used to
represent the structure of MWEs.

The second chapter ''Verbal multiword expressions: Ideomacity and
flexibility'' is written by Tali Arad Greshler, Nurit Melnik, and Shuly
Wintner. The authors focus on the work of Nunberg et al who differentiated
between decomposable and non-decomposable MWEs basing on the degree of their
flexibility. On analyzing results of psycholinguistic experiments and
investigations of MWEs in languages other than English the authors of this
chapter come to a conclusion that correlation between decomposability of  MWEs
and their transformational flexibility is language specific since different
languages admit of different variants of such correlation. They think the
notion of decomposability to be  fuzzy and  difficult to apply to idioms
classification. The authors suggest an alternative categorization of MWEs
based on the notions of FIGURATION and TRANSPARENCY. They conjecture that
transformational productivity depends on transparency and figurativeness, the
more transparent and figurative is an idiom, the more transformationally
productive it is. Fifteen verbal MWs were selected and examples of variations
in their structure retrieved from a billion-token Hebrew corpus.

Assessing the approach suggested by Greshler et al I'd like to note the
following. 1) I don't see any essential difference between this conception and
Nunberg's conception that they criticize. Both are of pragmatic character,
being speaker-focused. And, as such, both need experimental data to test their
validity. Criticizing Nunberg's conception the authors refer to
psycho-linguistic data, but they never provide any experimental data to
substantiate their own conclusions. For me 'shoot the breeze' is figurative
and transparent in the same way as ''saw logs'', because it creates in my mind
a ''vivid picture'' of a person who speaks so fast that words go from his
mouth like shots of a gun. Perhaps other speakers' perception will be
different. Without psycholinguistic data the authors' concepts of
figurativeness and transparency will be fuzzy and unconvincing. 2) As the
analysis is limited to 15 phrases, the authors, as they admit, were not able
to obtain reliable statistical data to corroborate dependency between
transparency and flexibility. 3) The analysis is limited to verbal idiomatic
expressions that, apparently, are flexible by nature. It is not clear whether
the concepts of transparency and figurativeness can be applied to other types
of idioms that do  not feature verbal components. 4) The analysis of
productivity of idioms performed by the authors is interesting and may be used
in other investigations.

The third chapter entitled ''Multiword expressions in an LFG grammar for
Norwegian'' is written by Helge Dyvik, Gyri Smordal Losnegaard, and Victoria
Rosen. It focuses on the methods for presenting MWEs in NorGram, a
computational grammar of Norwegian, developed on the basis of
Lexical-Functional Grammar. The authors distinguish between fixed, semi-fixed,
and syntactically flexible MWEs. LFG analysis involves two levels of syntactic
representation: constituent structure (c-structure) and functional structure
(f-structure). First with the help of phrase structure rules and lexicon the
c-structure is revealed, and the f-structure is derived from the c-structure.
The chapter gives a detailed description of methodologies for presenting the
three types of MWEs in NorGram. The authors distinguish between eight main
types of complementation patterns of phrasal verbs and discuss specific
realizations of these patterns.

On the whole, this chapter provides  valuable material about integration of
MWEs in LFG analysis that may be of interest for investigation of MWEs not
only in Norvegian but also in  other languages.

The fourth chapter ''Issues in Parsing MWEs in an LFG/XLE framework'' is
authored by Stella Markantonatou, Niki Samaridi, and Panagiotis Minos. It
deals with the system for parsing Modern Greek multiword expressions with
LFG/XLE expressions. The general idea that underlies the chapter is
differentiation between fixed and flexible parts of MWEs, the former treated
as words with spaces (WWS), i. e. single syntactic and semantic units. The
system comprises four modules, viz. part-of-speech tagger, lexicographic tool
for formal description of MWEs, filter, and LFG/XLE grammars.

The general idea that underlies this chapter (to differentiate between
flexible and fixed parts of MWEs) is sound but its realization is far from
being perfect. Modern Greek is a morphologically rich language with relatively
free word order, and it is not clear why the authors decided to employ LFG
that has been developed for English that represents a different group of
morphologically poor languages with rigid word order. They point to some
problems they faced applying LFG (pp. 111-113), without providing convincing
arguments for LFG choice.

The scheme that represents parsing system's architecture (p. 110) is
inadequate as it lacks any preprocessing module that performs lexical and
syntactic decomposition, as well as the formatter module, to which the authors
refer (p. 119) without giving any description. A general requirement for a
paper that hinges upon functioning of a computer system is its representation
in a data flow diagram. This chapter lacks such a diagram. The screenshots
that illustrate the functioning of main modules (figures 3-5) are of low
quality, some of their parts are not discernable. These drawbacks
significantly diminish the scientific quality of the chapter.

The fifth chapter written by Krasimir Angelov  ''Multiword expressions in
multilingual applications within the Grammatical Framework'' focuses on the
ways MWEs are represented in the Grammatical Framework, a programming language
for developing multilingual applications, such as machine translation and
question answering systems. It describes methods for encoding linguistic units
in Grammatical Framework. The author suggests factorization as a methodology
for analyzing MWEs with non-compositional meaning.

It was difficult for me to grasp the aim of the paper because the author
constantly refers to difficulties that Grammatical Framework faces coming to
the conclusion that ''translation via the vanilla resource grammar is far from
perfect'' (p 144), and ''current case by case solution does not scale well for
open domain applications'' (ibid). The so called ''factorization'' is
illustrated by the examples of sentences that don't have any idiomatic
expressions; moreover, the statement that the translation is non compositional
is incorrect, because the German equivalent for ''My name is John'' may be
''Mein name ist John'', which is quite acceptable in modern German. Generally,
many languages that have simple verbal predicates to express this idea also
have predicative variants, cf. ''me llamo Alex'' and ''mi nombre es Alex'' in
Spanish. The author didn't provide any evidence of the usefulness  of
Grammatical Framework for interpretation of MWEs.

The sixth chapter ''Statistical MWE-aware parsing'' written by Mahieu
Constant, Guelsen Eryigit, Mike Rosner, and Gerold Sneider focuses on
different approaches that have been developed for statistical MWE-aware
parsing. The chapter opens with a brief overview of main approaches to
statistical parsing, statistical and dependency formalisms, transition-based
and graph-based approaches to dependency parsing. The chapter outlines
chunking, subtree, and multilayer presentations of MWEs. Identification of
MWEs may be performed before or after sentence parsing, thus there are two
main approaches based on their preprocessing and post-processing.
Preprocessing approach involves two types of methodologies, concatenation,
when an MWE is identified as a single token during tokenization, and
substitution, when the MWE is substituted by its head word. The authors show
advantages and disadvantages of these methodologies. Discussing
post-processing approaches they soundly distinguish between MWEs
identification and discovery. Identification is the process of recognizing
MWEs in context, while discovery aims at creating a lexicon of MWEs types from
some other lexicon. The chapter demonstrates how the use of T-score and Yule's
K filters allows effective recognition of  MWEs estimating the degree of
non-modifiability of candidate expressions. Precision, recall and F-score
metrics are used to evaluate this type of parser. In case of dependency
parsing the number of dependencies produced by the parser should equal the
number of total dependencies in the gold standard parse tree. Common metrics
to evaluate this type of parser include the percentage of tokens with correct
head and the percentage of tokens with correct head and dependency label. The
authors show how the identification of MWEs affects the quality of parsing.

This chapter is a substantial review that provides useful information about
MWE representation, orchestration, and external resource integration. It can
be of interest to many experts in the field of natural language processing.

Chapter Seven entitled ''Investigating the effect of automatic MWE recognition
on CCG parsing'' is written by Myriam de Lhoneux, Omri Abend, and Mark
Steedman. It focuses on the impact of MWE recognition on parsing with
Combinatory Categorial Grammar (CCG). CCG is a strongly lexicalized formalism
that allows for dealing with long range dependencies and presents syntax and
lexicon as interacting modules. The chapter opens with a review of experiments
that prove the positive effect of correct MWE recognition on syntactic
parsing. To test how MWE recognition affects CCG parsing the authors suggest
first recognizing MWEs in the unlabeled version of CCG bank, and then
collapsing MWEs to one lexical item in the annotated version of the treebank
and in the unlabeled test data. The experiments conducted by the authors
involved matching results obtained on the annotated treebank and unlabeled
test data against reference data. Results of the experimentation show that
MWEs automatic recognition has a positive impact on parsing accuracy and
produces a good training effect. The results also show that collapsing MWE
units to one token is most useful for MWEs made up of proper nouns.

This chapter provides a valuable and a far broader insight than the existing
works into the impact of automatic MWEs recognition on parsing quality. The
authors developed an original experimental methodology using the whole array
of tools to distinguish, for the first time, between parsing and training
effects. This methodology can be used to assess not only CCG parsing, but also
parsing within the scope of other formalisms.

The eighth chapter ''Multilingual parsing and MWE detection'' is written by
Vasiliki Foufi, Luka Nerima, and Eric Wehrli. It focuses on collocations
consisting of content words (in contrast to stop words). The authors argue
that the identification of collocations and parsing are interrelated
processes. The common approach  of treating MWEs as words-with-spaces doesn't
work well as far as collocations are concerned because they have a high
morphosyntactic flexibility. A separate section of the chapter is devoted to
the Fips parser, a multilingual parser that works on a manually built lexicon
designed to detect collocations of various types, including nominal and verbal
ones. Due to the built-in anaphora resolution module, it copes with
recognition of pronominal substitution, and can detect collocations whose
elements are separated by many intervening words. The parser processed the
English corpus first with the collocation detection module switched off and
then with this module switched on. It turned out that collocation knowledge
significantly improves parts-of-speech recognition.

In this chapter the authors have succeeded in demonstrating close
interrelation between collocation identification and syntactic parsing. On
condition that collocation identification is a part of the parsing process, it
can improve parsing quality solving lexical and syntactic ambiguities.

The ninth chapter ''Extracting and aligning multiword expressions from
parallel corpora'' written by Nasredine Semmar, Christophe Servan, Meriama
Laib, Dhouha Bouramor, and Morgane Marchand addresses the task of extracting
and aligning MWEs from parallel corpora. The authors adopt Sag's (2002)
classification, according to which MWEs are divided into lexicalized and
institutionalized. The former are classified into semi-fixed, fixed, and
syntactically flexible expressions. Semi-fixed expressions include
non-decomposable idioms, compound nominals and proper names. Syntactically
flexible ones comprise verb-particle constructions and decomposable idioms.
Institutionalized phrases include anti-collocations. It should be noted at
once, that to put fixed expressions between semi-fixed and syntactically
flexible ones (fig. 1, p. 242) is not quite logical. Arranged according to the
degree of flexibility the order should be ''fixed'' - ''semi-fixed'' -
''syntactically flexible'', or  ''syntactically
flexible''-''semi-fixed''-''fixed''. The specific expressions that the authors
give to exemplify different types of MWEs are not good. Stating that fixed
expressions do not admit of morphological and syntactic variations, they give
such examples as ''nest of vipers'' and ''pomme de terre'' that actually can
be used in the plural form and cannot be considered fixed. Exemplifying
semi-fixed expressions the authors again give the ''pomme de terre'' phrase,
pointing to the fact that it can take the plural ending (p. 243). Having
included anti-collocations into institutionalized phrases (fig. 1, p.242),
they state that institutionalized phrases ''often refer to 'collocations'...
''(p. 244). Why these phrases are termed ''anti-collocations'' remains
completely unclear. The main part of the chapter falls into two distinct
sections. The first one deals with MWEs extraction and alignment. The other
section hinges upon impact of MWEs alignment on Moses machine translation
system. The authors suggest three methods of such an evaluation. The
''corpus'' methods, the ''table'' method, and the ''feature'' method. It
turned out that the best improvement was achieved by using the ''feature''
method. The main drawback of the main part of the chapter is that the authors
often do not give information about the source material they use. They
exemplify the statistical approach by the English and equivalent French
sentences (Table 1, p. 246) without giving any information about them. Why did
the authors select these specific sentences? Where were the sentences taken
from?  re the sentences exemplar for the given task? The authors didn't
provide information to answer these questions. The same goes to 12 phrases in
Table 3 (p. 249), text material in table 4 (p. 252), figures 3, 4 (p. 253),
figure 5 (p. 255). Lack of information about source material significantly
diminishes the scientific quality of the chapter and undermines validity of
the experimental results.

The last chapter ''Cross-lingual linking of multi-word entities and
language-dependent learning of multiword entity patterns'' written by
Guillaume Jacquet, Maud Ehrmann, Jakub Piskorski, Hristo Tanev, and Ralf
Steinberger deals with recognition of names of organizations (NOO) in ''Europe
Media Monitor'' (EMN), a meta-news platform that gathers about 300,000 news
articles per day in about 70 languages.  Recognition of NOOs presents
difficulties because of a large number of acronyms that have to be associated
with long forms. Long (expanded) forms may differ in lengths (cf. ''Space
Station'' and ''International Space Station'') and may take inflections. Thus,
one acronym may correspond to several or more expanded variants. As EMN is
very big, using traditional linguistic tools such as POS tagging that
underlies parsing was problematic, and the authors decided to develop an
original methodology that does not imply their use. The authors developed four
aggregation methods, monolingual expansion aggregation, multilingual expansion
aggregation, aggregation based on similar tokens, and aggregation based on
translated tokens. To the last two methods they applied cosine and CombMNZ
similarity measures. Efficiency of the developed methods was assessed against
a gold standard in terms of precision and recall. Multilingual expansion
aggregation showed the best result. A special section of the chapter focuses
on the task of learning MWEs structural patterns to facilitate recognition of
new, not previously mentioned MWEs. To collect source material the authors
used BableNet, a semantic network that contains about 7.7 million of
named-entity related synsets. They developed a metalanguage to encode the NOOs
structural patterns. Each pattern includes a natural language unit (surface
form) and a token class element. Basing on combination of these parameters the
authors performed filtering to significantly reduce the number of patterns. To
assess the quality of NOOs recognition the authors matched their patterns
against two existing named-entity annotated corpora to get promising results.

This chapter is an example of research that relies on term weighting and
similarity metrics without using sophisticated and resource consuming
linguistic techniques, such as POS tagging and parsing. This approach
resembles in a way the one I suggested earlier [2]. The authors have done lots
of work developing numerous methodologies for extraction and recognition of
names of organizations that may be of interest to researchers investigating
problems in named-entities processing. 

EVALUATION 

The book comprises chapters that differ in size and quality, the longest being
Chapter 3 (40 pages), while the smallest one is Chapter 5 (20 pages). The
latter is a paper rather than a chapter that falls out of book's scope. 

MWE is an umbrella term used to denote various linguistic units that can be
classified according to semantic, syntactic, pragmatic and functional
criteria. According to the functional criterion, parenthetical and connective
constructions (''on the one hand'', ''because of'') can be distinguished; 
numerical expressions, light verbs (''give a laugh'', ''have a meal''), verbs
with postpositions (''take off'') are expressions that can be differentiated
by the syntactic criterion; according to the semantic criterion multiword
expressions that denote one object fall into one group (''New York'', ''hot
dog''); greetings, farewells, and metaphoric constructions (''as thin as a
rail'') may be differentiated by the pragmatic criterion. I give this brief
classification (basing on the idea of the authors of Chapter 7) to show how
vast is the domain of multiword expressions research. And I can't say that the
book gives a full picture of this domain. It hasn't a single comment on the
apparent metaphorical nature of pragmatic phraseological constructions that
are used to produce stylistic effect on the listener. Metaphor processing [3]
has been intensively developing during the last decades and can provide
valuable information about the structure of these constructions that might
have been of use to the authors of the second chapter as they use
''figuration'' as a distinction of such constructions. The chapters of the
book heavily rely on existing tools for MWEs processing instead of creating
new ones. The only exception is Chapter 8. Nevertheless, many chapters present
original experimentation methodologies that may be of interest to experts and
researchers in various fields of natural language processing. 

Acknowledgement

This review was written thanks to the support from Russian Foundation for
Basic Research, grant 20-07-00124

REFERENCES 

1. Tan, P.N. et al (2005) Cluster analysis: basic concepts and algorithms.
URl: https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf

2. Yatsko, V. A. (2013). The algorithms for proper names recognition. In:
Nauchno-technicheskay informatsia. Series 2, no5, pp. 34-39. (In Russian). 

3. Shutova, E. et al (2012) Statistical metaphor processing. In: Computational
linguistics, vol. 39, no 2., pp. 301-353. URL:
https://www.aclweb.org/anthology/J13-2003.pdf

ABOUT THE REVIEWER

Viatcheslav Yatsko is an independent researcher, ScD, an expert in
computational linguistics http://yatsko.zohosites.com/

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-31-1728	
----------------------------------------------------------