27.2006, Review: Computational Ling; General Ling: Harispe, Ranwez, Janaqi, Montmain (2015)

Mon May 2 16:10:02 UTC 2016

LINGUIST List: Vol-27-2006. Mon May 02 2016. ISSN: 1069 - 4875.

Subject: 27.2006, Review: Computational Ling; General Ling: Harispe, Ranwez, Janaqi, Montmain (2015)

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Sara  Couture <sara at linguistlist.org>
================================================================

Date: Mon, 02 May 2016 12:09:21
From: Emiel van Miltenburg [emiel.van.miltenburg at vu.nl]
Subject: Semantic Similarity from Natural Language and Ontology Analysis

Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36138197

Book announced at http://linguistlist.org/issues/26/26-4929.html

AUTHOR: Sebastian  Harispe
AUTHOR: Sylvie  Ranwez
AUTHOR: Stefan  Janaqi
AUTHOR: Jacky  Montmain
TITLE: Semantic Similarity from Natural Language and Ontology Analysis
SERIES TITLE: Synthesis Lectures on Human Language Technologies
PUBLISHER: Morgan & Claypool Publishers
YEAR: 2015

REVIEWER: Emiel van Miltenburg, Vrije Universiteit Amsterdam

Reviews Editor: Helen Aristar-Dry

INTRODUCTION

In the words of the authors, “this book proposes an extended introduction to
semantic measures targeting both students and domain experts” (p. xiii) in the
field of Natural Language Processing. The emphasis of the book is on semantic
measures of similarity and relatedness, as derived from natural language (text
corpora) and knowledge bases (WordNet, domain ontologies, thesauri and
encyclopedias). 

SUMMARY

Chapter 1 -- introduction to semantic measures.
This chapter opens with an overview of application areas for ‘semantic
measures’, after which the authors shortly discuss psychological models of
similarity (spatial models, feature models, alignment models, and
transformational models).  Following this, the authors start to formally and
mathematically define the notions of ‘semantic measures’ (an umbrella term
covering all measures that quantify some semantic relation), ‘relatedness’ and
‘similarity’, leading up to a classification of semantic measures according to
the following four aspects:

1. The type of elements that the measure aims to compare.
2. The semantic proxies used to extract the semantics required by the measure.
3. The semantic evidence and assumptions considered during the comparison.
4. The canonical form adopted to represent an element and how to handle it.”
(p. 22)

The main aspect used to structure the book is the second one, with Chapter 2
devoted to measures using unstructured or semi-structured texts, and Chapter 3
devoted to knowledge-based measures. The other aspects are secondary, and
referenced in the discussion of the relevant measures.

Chapter 2 -- corpus-based semantic measures.
This chapter starts out with the observation that most corpus-based semantic
measures are either implicitly or explicitly based on the ‘distributional
hypothesis’, the idea that words occurring in similar contexts convey similar
meanings. (The authors restrict themselves to distributional measures that
compare words, excluding measures that compare texts or sentences.) It
continues with a general description of how count-based distributional models
work. Next, the chapter discusses the meaning of words and the difference
between syntagmatic and paradigmatic contexts, after which the authors explain
the field of distributional semantics. This explanation is followed by an
overview of different ‘distributional measures’. The authors briefly mention
the set-based approach (Bollegala 2007 a.o.) and the probabilistic approach
(Dagan 1999 a.o.), but the emphasis is on ‘the geometric or spatial approach.’
The authors highlight LSA, ESA, HAL, Schütze wordspace, Random Indexing and
COALS (Deerwester et al. 1990; Gabrilovich and Markovitch 2007; Lund and
Burgess 1996; Schütze 1993; Kanerva et al. 2000; Rohde et al. 2006) as being
the most popular approaches. The chapter closes off with a list of the main
advantages and limitations of corpus-based measures, and a final summary.

Chapter 3 -- knowledge-based semantic measures.
This is the longest chapter in the book. It starts out by explaining the idea
of representing ontologies as graphs, and introducing the necessary formal
notation. The authors follow up by making a distinction between cyclic and
acyclic graphs, and exploring different semantic measures for cyclic graphs.
Before discussing semantic measures for acyclic graphs, however, the authors
take a detour to discuss graph properties that can be used to compute semantic
measures. This discussion is followed up by an extensive overview of pair- and
groupwise semantic similarity measures, making use of structured taxonomies.
The authors dedicate a short section to other knowledge-based measures before
providing a list of the main advantages and limitations of knowledge-based
measures. Finally, the authors dedicate a section to hybrid approaches mixing
knowledge-based and corpus-based approaches, followed by a short conclusion.

Chapter 4 -- methods and datasets for the evaluation of semantic measures.
This chapter provides an overview of the datasets that have been used to
evaluate semantic measures. It starts out with a general introduction to
semantic measure evaluation, and goes on to explain different criteria that
one may have for an evaluation (including e.g. computational complexity of the
evaluation). These criteria may help researchers in selecting the right
dataset to evaluate their own semantic measure. After discussing direct versus
indirect evaluation strategies, the authors first list a large number of
datasets, and then provide additional details for all of them (e.g. how the
evaluation data was created). The chapter closes with a discussion, noting
that more research needs to be done to find out how to better evaluate
semantic measures. 

Chapter 5 -- conclusion and research directions.
This chapter first summarizes what has been covered in the preceding chapters,
and then presents several suggestions for future research. These are:
 “* Better characterize semantic measures and their semantics; 
* Provide theoretical and software tools for the study of semantic measures; 
* Standardize ontology handling; 
* Improve models for compositionality; 
* Study current models of semantic measures w.r.t. language specificities; 
* Promote interdisciplinary efforts; 
* Study algorithmic complexity of measures; 
* Support context-specific selection of semantic measures.” (pp. 159-160)
It is beyond the scope of this summary to cover all these suggestions in more
detail, but I would like to commend the authors on their extensive list of
suggestions for further research.

The book has four appendices, for which the titles are mostly
self-explanatory:
Appendix A -- examples of syntagmatic contexts (5 pages).
Appendix B -- a brief introduction to Singular Value Decomposition (2 pages).
Appendix C -- a brief overview of other models for representing units of
language (7 pages). This appendix covers two different kinds of language
models: n-gram models and Neural Network Language-based Models (NNLMs),
discussing only the basic intuitions behind these models. The chapter closes
with a discussion of compositionality in distributional semantics (the idea of
building a sentence representation by mathematically combining word vectors),
offering some references to further explore the subject.
Appendix D -- software tools and source code libraries (9 pages).

EVALUATION

Production

Production-wise, there are two (minor) issues with this book. First, the style
of the figures is not uniform, and the images frequently suffer from
compression artifacts. Second, due to the fact that the authors are not native
speakers of English, the prose is sometimes a bit awkward (e.g. “the
hypothesis that has been aforementioned” (p. 42)) and the style is at times a
bit dense. 

Coverage & audience

The authors note that “[c]ommunities of Psychology, Cognitive Sciences,
Linguistics, Natural Language Processing, Semantic Web, and Biomedical
informatics are among the most active contributors” to the field (pp. 2-3).
But in their discussion of semantic similarity, they restrict themselves
mostly to the latter three. This is of course excusable for a book in a series
on human language technologies, but readers mostly interested in Psychology,
Cognitive Science or Linguistics-related aspects of similarities should look
elsewhere. The authors note that their section on the Psychology of similarity
is based on (Hahn 2011), but readers interested in a written overview could
also consult (Hahn and Heit 2001), which not only covers the same ground but
also makes the connection with later work in distributional semantics. For an
introduction to distributional semantics and Latent Semantic Analysis, see
(Landauer and Dumais 1997). (For a book-length introduction to distributional
semantics, see Widdows 2004.) One might accompany this with Gärdenfors’ (2000)
seminal work on Conceptual Spaces, or either one of  (Margolis and Laurence
1999) or (Murphy 2002) for a general overview of theories of conceptual
representation. More experimental (and cross-cultural) work has been done by
Malt et al. (1999), Khetarpal et al. (2010) and others (see also their
references). A book connecting this body of work with current advances in
human language technologies is yet to be written. 

The core of the book (Chapters 2 and 3) is about corpus-based and
ontology-based semantic measures. The expertise of the authors clearly lies in
the field of ontology analysis. This book can be best understood as an attempt
to put their knowledge (as exemplified in Chapter 3) in a broader perspective.
That also explains the fact that the 29-page chapter on corpus-based semantic
measures has a relatively narrow focus “for the sake of clarity and due to
space constraints” while the chapter on knowledge-based semantic measures
totals at 72 pages. The extensive coverage of these measures does make Chapter
3 a solid reference for ontology-related matters.

So what about the coverage of corpus-based semantic measures? Chapter 2
provides a decent introduction to what Baroni et al. (2014) call ‘count
models’ (i.e. distributional models that build up a matrix of (word-document
or word-word) co-occurrence counts, and then transform that matrix to get
vector representations corresponding to word meanings), but even the newest
model the authors discuss in detail is almost ten years old. Meanwhile,
‘predict models’ (based on neural networks) have taken the field by storm
since Mikolov (2013) et al. released their word2vec tool. Participants of
EMNLP 2015 (Empirical Methods in Natural Language Processing) were even joking
that the ‘E’ in EMNLP now stands for ‘embeddings’! Right now, predict models
are only covered in appendix C, which almost seems to have been added as an
afterthought. Given the ubiquity of these models, this is a mistake. Other
recent developments, such as multimodal distributional semantics (MDS; Bruni
et al. 2014; Baroni 2015), aren’t even mentioned. This is a missed opportunity
for a textbook that wishes to “stimulate creativity toward the development of
new approaches” (pp. xiv), because MDS leads us to rethink what we mean by the
‘context’ of a word, and discuss whether distributional models are (or could
be made) biologically plausible. This is an exciting avenue of research in
Cognitive Science that sadly isn’t touched upon. The second chapter, then, is
a no-frills introduction to the very basics of distributional semantics. It is
still usable in a classroom context, but might be supplemented with additional
literature. With respect to neural network language models, Yoav Goldberg’s
excellent ‘Primer on Neural Networks for Natural Language Processing’ (draft
available through: http://u.cs.biu.ac.il/~yogo/nnlp.pdf) provides a good
introduction. If Goldberg’s introduction is too much, Chris Olah’s blog
(http://colah.github.io/) has some very accessible explanations of language
models and deep learning. (Also see the reading list at
http://deeplearning.net/reading-list/)

The remaining chapters give a good overview of the field, and include all the
necessary references to embark on a research project on semantic similarity.
Despite its shortcomings, this book does make a solid reference work on
knowledge-based similarity measures, and it provides good overview of
evaluation protocols that are currently out there. 

REFERENCES

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a
systematic comparison of context-counting vs. context-predicting semantic
vectors. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Vol. 1, pp. 238-247).

Baroni, M. (2015). Grounding Distributional Semantics in the Visual World.
Language and Linguistics Compass.

Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). An Integrated Approach to
Measuring Semantic Similarity between Words Using Information Available on the
Web. In HLT-NAACL (pp. 340-347).

Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal Distributional
Semantics. J. Artif. Intell. Res.(JAIR), 49, 1-47.

Dagan, I., Lee, L., & Pereira, F. C. (1999). Similarity-based models of word
cooccurrence probabilities. Machine Learning, 34(1-3), 43-69.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman,
R. A. (1990). Indexing by latent semantic analysis. JAsIs, 41(6), 391-407.

Gärdenfors, P. (2000). Conceptual spaces : the geometry of thought. Cambridge,
Mass.: MIT Press.

Gabrilovich, E., & Markovitch, S. (2007). Computing Semantic Relatedness Using
Wikipedia-based Explicit Semantic Analysis. In IJCAI (Vol. 7, pp. 1606-1611).

Hahn, U. (2011) What makes things similar? Invited talk at  the 1st
International Workshop on Similarity-based Pattern Analysis URL:
http://videolectures.net/simbad2011_hahn_similar/

Hahn, U. and E. Heit (2001) Semantic Similarity, Cognitive Psychology of, In
International Encyclopedia of the Social & Behavioral Sciences, edited by Neil
J. SmelserPaul B. Baltes, Pergamon, Oxford, 2001, Pages 13878-13881.

Kanerva, P., Kristofersson, J., & Holst, A. (2000). Random indexing of text
samples for latent semantic analysis. In Proceedings of the 22nd annual
conference of the cognitive science society (Vol. 1036). Mahwah, NJ: Erlbaum.

Khetarpal, N., Majid, A., Malt, B. C., Sloman, S., & Regier, T. (2010).
Similarity judgments reflect both language and cross-language tendencies:
Evidence from two semantic domains. In S. Ohlsson, & R. Catrambone (Eds.),
Proceedings of the 32nd Annual Conference of the Cognitive Science Society
(pp. 358-363). Austin, TX: Cognitive Science Society.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The
latent semantic analysis theory of acquisition, induction, and representation
of knowledge. Psychological review, 104(2), 211.

Lund, K., & Burgess, C. (1996). Hyperspace analogue to language (HAL): A
general model semantic representation. In Brain and Cognition (Vol. 30, No. 3,
pp. 5-5).

Malt, B. C., Sloman, S. A., Gennari, S., Shi, M., & Wang, Y. (1999). Knowing
versus naming: Similarity and the linguistic categorization of artifacts.
Journal of Memory and Language, 40(2), 230-262.

Margolis, E., & Laurence, S. (1999). Concepts: core readings. Mit Press.

Mikolov, T, Chen, K., Corrado, G., and Dean, J.. Efficient Estimation of Word
Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

Murphy, G. L. (2002). The big book of concepts. MIT press.

Rohde, D. L., Gonnerman, L. M., & Plaut, D. C. (2006). An improved model of
semantic similarity based on lexical co-occurrence. Communications of the ACM,
8, 627-633.

Schütze, H. (1993). Word space. In Advances in Neural Information Processing
Systems 5.

Widdows, D. (2004). Geometry and meaning (Vol. 773). Stanford: CSLI
publications.

ABOUT THE REVIEWER

Emiel van Miltenburg is a PhD candidate working at the Vrije Universiteit
Amsterdam, under the supervision of Piek Vossen. His research interests
include conceptual representation, semantic similarity, pragmatics and natural
language processing.

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and 
as such can receive donations through the eLinguistics Foundation, 
which is a registered 501(c) Non Profit organization. Our Federal 
Tax number is 45-4211155. These donations can be offset against 
your federal and sometimes your state tax return (U.S. tax payers only). 
For more information visit the IRS Web-Site, or contact your financial 
advisor.

Many companies also offer a gift matching program, such that 
they will match any gift you make to a non-profit organization. 
Normally this entails your contacting your human resources department 
and sending us a form that the eLinguistics Foundation fills in and 
returns to your employer. This is generally a simple administrative 
procedure that doubles the value of your gift to LINGUIST, without 
costing you an extra penny. Please take a moment to check if 
your company operates such a program.

Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-2006	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.org/