29.3243, Review: Computational Linguistics; Text/Corpus Linguistics: Ide, Pustejovsky (2017)

The LINGUIST List linguist at listserv.linguistlist.org
Wed Aug 22 17:49:04 UTC 2018


LINGUIST List: Vol-29-3243. Wed Aug 22 2018. ISSN: 1069 - 4875.

Subject: 29.3243, Review: Computational Linguistics; Text/Corpus Linguistics: Ide, Pustejovsky (2017)

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Jeremy Coburn <jecoburn at linguistlist.org>
================================================================


Date: Wed, 22 Aug 2018 13:48:37
From: Emmanuel Schang [emmanuel.schang at univ-orleans.fr]
Subject: Handbook of Linguistic Annotation

 
Discuss this message:
http://linguistlist.org/pubs/reviews/get-review.cfm?subid=36364037


Book announced at http://linguistlist.org/issues/28/28-2969.html

EDITOR: Nancy  Ide
EDITOR: James  Pustejovsky
TITLE: Handbook of Linguistic Annotation
PUBLISHER: Springer
YEAR: 2017

REVIEWER: Emmanuel Schang, University of Orléans

SUMMARY

This handbook is edited by Nancy Ide (Vassar College) and James Pustejovsky
(Brandeis Univ.) and gathers 54 chapters in two volumes,  for a total of 1459
pages. The first volume (438 pages) collects  papers on methodological and
theoretical aspects (15 chapters plus an introduction) while the second volume
presents detailed case studies. It is aimed at a large audience of students
and scholars in linguistics and/or computer science, with no prerequisites on
programming.

 Volume 1 ('The Science of Annotation') starts with an introduction by Nancy
Ide. This introduction presents an overview of the volumes and describes the
context of emergence of linguistic annotations as an important matter for
linguistics and natural language processing (NLP). Initially, linguistic
annotations were conceived to validate corpus linguistic theories. But as
Nancy Ide claims '' Over the past three decades, advances in computing power
and storage together with development of robust methods for automatic
annotation have made linguistically-annotated data increasingly available in
ever growing quantities. As a result, these resources now serve not only
linguistic studies, but also the field of natural language processing (NLP),
which relies on linguistically-annotated text and speech corpora to evaluate
new human language technologies and crucially, to develop reliable statistical
models for training these technologies''. She indicates that ''the goal of
this volume is to provide a comprehensive survey of the development and
state-of-the-art for linguistic annotation of language resources, including
methods for annotation scheme design, annotation creation, physical format
considerations, annotation tools, annotation use, evaluation, etc.''

The volume continues with theoretical papers ['Designing Annotation Schemes:
>From Theory to Model' (James Pustejovsky, Harry Bunt and Annie Zaenen) ; 
'Designing Annotation Schemes: From Model to Representation' (Nancy Ide,
Christian Chiarcos, Manfred Stede and Steve Cassidy) and 'Community Standards
for Linguistically-Annotated Resources' (Nancy Ide, Nicoletta Calzolari,
Judith Eckle-Kohler, Dafydd Gibbon, Sebastian Hellmann, Kiyong Lee, Joakim
Nivre and Laurent Romary)] which offer both an overview of the field and an
historical approach of this recent domain. Chapter 1 presents the MATTER
methodology (an acronym for Model Annotate Train Test Evaluate Revise), which
aims at improving the design of annotation schemes in a back-and-forth
exchange between the data and the model. Chapter 2 presents an overview of the
representation formats and discusses the issues related to the choice of
format (from SGML to XML and TEI) and Chapter 3 presents the history and key
concepts of the standards for linguistic resources (ISO, TEI, LAF, etc.).

This volume also collects chapters on annotation tools and procedure [Overview
of Annotation Creation: Processes and Tools (Mark A. Finlayson and Tomaž
Erjavec) ; The Evolution of Text Annotation Frameworks (Graham Wilcock) ;
Tools for Multimodal Annotation (Steve Cassidy and Thomas Schmidt) ;
Collaborative Web-Based Tools for Multi-layer Text Annotation (Chris Biemann,
Kalina Bontcheva, Richard Eckart de Castilho, Iryna Gurevych and Seid Muhie
Yimam) ]. In particular, Section 4 of ''Overview of Annotation Creation:
Processes and Tools'' goes over the features of a large number of annotation
tools, which is very handy. G. Wilcock's chapter is more technical and much
more useful to computer engineers than to linguists. It mainly discusses the
difference between an annotation pipeline and an annotation framework.

The following chapters focus on techniques and methods: Iterative Enhancement
(Markus Dickinson and Dan Tufiş) ; Crowdsourcing (Massimo Poesio, Jon
Chamberlain and Udo Kruschwitz) ; Machine Learning for Higher-Level Linguistic
Tasks (Anna Rumshisky and Amber Stubbs) ; Sustainable Development and
Refinement of Complex Linguistic Annotations at Scale (Dan Flickinger, Stephan
Oepen and Emily M. Bender) ; Linguistic Annotation in/for Corpus Linguistics
(Stefan Th. Gries and Andrea L. Berez). Poesio & al. wrote a chapter dedicated
to crowdsourcing (web collaboration for annotation) and
'games-with-a-purpose'. Delegating the linguistic annotation task to unknown
contributors (be they gamers seeking enjoyment or distant workers, as with
Amazon Mechanical Turk) is not a harmless choice. This chapter honestly
weights the pros and cons of these techniques.

Two chapters take on the difficult matter of evaluation [Inter-annotator
Agreement (Ron Artstein) and Ongoing Efforts: Toward Behaviour-Based Corpus
Evaluation (Takenobu Tokunaga)]. Ron Artstein raises the issue of the
reliability of the annotation and clearly explains the philosophy and the math
behind measures of inter-annotator agreement. The chapter is punctuated by
useful technical reminders and tips. The author made considerable efforts to
remain clear and accessible to non specialists. Tokunaga presents a different
and complementary approach, which is based on the analysis of the annotator's
behavior during the annotation task.

Finally, this volume ends with a paper discussing the links between linguistic
theory and corpus-based studies [Developing Linguistic Theories Using
Annotated Corpora (Marie-Catherine de Marneffe and Christopher Potts)].
 
Volume 2 ('Case Studies') gathers 39 chapters which describe corpus-based
projects. The chapters therein provide both an overview of the content
(purpose and method) of the projects and  the 'lessons learned' of the team.
These case studies  offer an opportunity to evaluate the design of the
experiments and the annotations schemes. Among the projects, one can cite
MULTEX-East, OntoNotes, ISO-TimeML and several treebanks (Prague Dependency
Treebank, German Treebank, Sinica Treebank and Hindi/Urdu Treebank) to name
but a few of these. The reader will find here a description of the projects
mentioned in Volume 1 and can go back and forth between the two volumes. I
will provide here two examples:

- ISO-TimeML is mentioned many times in Volume 1 as an example of the MATTER
methodology. The reader will find a precise description of the project and the
annotation scheme in a dedicated chapter (pp. 941-968),

- the reader who is interested in crowdsourcing annotation projects can
navigate between a theoretical paper in Volume 1 and a project on Named Entity
Recognition using crowdsourcing.   

EVALUATION

With its 54 chapters, this handbook covers the wide field of linguistic
annotation (and linguistic resource creation). Interestingly, this book
reverses the usual perspective in which just one chapter is dedicated to
linguistic annotation in an NLP handbook (see Palmer and Xue (2010) for
instance).

In recent years we have seen an important growth of Machine Learning (ML)
techniques, and NLP tends to be more and more a matter of engineers to the
detriment of linguists. But ML techniques crucially require resources
(annotated corpora). The building of reliable resources in thus an important
matter that cannot be neglected and granted a subsidiary ranking.

In this context, this book is an important effort towards giving linguistic
annotation full attention.  

Here, the annotation work in its various facets is put forward and the
technical or practical tools are in the background (in Volume 1).  In Volume
2, the major projects  and resources are detailed and one can appreciate that
the choice of the projects is well balanced between Europe and the USA.

The chapters on method and theory are written by renowned specialists and the
case studies provide the reader with interesting lessons  learned. The authors
had to follow a guideline, which provides a  certain consistency to Volume 2,
despite the great disparity of the  domain.  This makes this handbook
interesting for both computer  scientists and linguists. Both will find a rich
variety of examples  and technical information (tools, methods, etc.). Of
course certain  chapters about tools or machine learning are more aimed at
computer  scientists than linguists, but overall, this book can be read by 
linguists without precise technical skills, except perhaps a basic  knowledge
of XML and document formats. Each chapter can be read alone,  as is usually
the case in handbooks, but this sometimes leads to  repetitions. For instance,
the MATTER cycle is presented several  times: p. 22, p. 170 and p. 335. This
probably could have been  avoided, but this is not a major flaw since these
repetitions are  drowned in the mass of information provided in these
chapters. Incidentally, an index would be useful. The search for a technical
term would have been facilitated.

For the reader who is still reluctant to take an interest in corpus
linguistics I recommend, as a starter, the reading of De Marneffe and  Potts'
chapter at the end of Volume 1. They provide a clever review of  the arguments
and counter-arguments against corpus linguistics in  Section 2 (Intuition and
Experiment, Corpora and Experimental Methods,  Competence and performance...)
and argue that ''corpus, introspective,  and psychological methods all
complement each other; far from being in tension methodologically or
philosophically, they can be brought  together to strengthen linguistic theory
and increase its scope and  scientific relevance'' (p.431).

For the enthusiastic reader willing to start his/her first project in
linguistic annotation, I also recommend the reading of Gries (2013), Reinhardt
(2013) and Pustejovsky & Stubbs (2012). Indeed, this handbook will give you
all you need  to conceive your annotation scheme and assess its quality, but
the correct interpretation of your results requires a prior (basic) knowledge
of statistics (power curve, confidence intervals, etc.), which falls outside
the scope of this book.

To summarize, this book undoubtedly finds its place in every linguistics
department library as a major reference on linguistic annotation. The price
makes it probably inaccessible to linguists in most parts of the world (the
number of pages has its price) but since linguistic annotation projects are
supposed to be made by teams and not by individuals, this is not a serious
problem.

REFERENCES

Gries, S. T. (2013). Statistics for linguistics with R: A practical
introduction. Walter de Gruyter.

Palmer, M., & Xue, N. (2010). Linguistic annotation. Handbook of Computational
Linguistics and Natural Language Processing.  

Reinhart, A. (2015). Statistics done wrong: The woefully complete guide. No
starch press.

Pustejovsky, J., & Stubbs, A. (2012). Natural Language Annotation for Machine
Learning: A guide to corpus-building for applications. ''O'Reilly Media,
Inc.''.


ABOUT THE REVIEWER

Emmanuel Schang is an associate professor in syntax at the University of
Orléans (France). He's in charge of the SEEPiCLa (Structure, Emergence and
Evolution of Pidgin and Creole Languages) International Research Group (CNRS)
and he has led several projects on linguistic annotation.





------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-3243	
----------------------------------------------------------






More information about the LINGUIST mailing list