15.1589, Review: Computational Linguistics: Abeillé (2003)

LINGUIST List linguist at linguistlist.org
Wed May 19 02:41:48 UTC 2004


LINGUIST List:  Vol-15-1589. Tue May 18 2004. ISSN: 1068-4875.

Subject: 15.1589, Review: Computational Linguistics: Abeillé (2003)

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Ogasawara <naomi at linguistlist.org>
 ==========================================================================
What follows is a review or discussion note contributed to our Book
Discussion Forum.  We expect discussions to be informal and
interactive; and the author of the book discussed is cordially invited
to join in.

If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for review." Then contact
Sheila Dooley Collberg at collberg at linguistlist.org.

=================================Directory=================================

1)
Date:  Tue, 18 May 2004 18:01:06 -0400 (EDT)
From:  Verginica Mititelu <vergi at racai.ro>
Subject:  Treebanks: Building and Using Parsed Corpora

-------------------------------- Message 1 -------------------------------

Date:  Tue, 18 May 2004 18:01:06 -0400 (EDT)
From:  Verginica Mititelu <vergi at racai.ro>
Subject:  Treebanks: Building and Using Parsed Corpora

EDITOR: Abeillé, Anne
TITLE: Treebanks
SUBTITLE: Building and Using Parsed Corpora
SERIES: Text, Speech and Language Technology, volume 20
PUBLISHER: Kluwer Academic Publishers
YEAR: 2003
Announced at http://linguistlist.org/issues/15/15-176.html


Verginica Barbu Mititelu,
Institute for Artificial Intelligence, Romanian Academy

The book is a collection of 21 papers on building and using parsed
corpora, most of them formerly presented at workshops and conferences
(ATALA, LINC, LREC, EACL).

The objective of the book, as stated in the Introduction, is to
present an overview of the work being done in the field of treebanks,
the results achieved so far and the open questions. The addressees are
linguists, including computational linguists, psycholinguists, and
sociolinguists.

The book is organized in two parts: Building treebanks (15 chapters,
pp. 1-277) and Using treebanks (6 chapters, pp.  279-389), each of
them having subparts. It also contains a preface (pp. xi), an
introduction (pp. xiii-xxvi), a list of contributing authors and their
affiliation (pp. 391- 397), and an index of topics (pp. 399-405).

The organization of the Introduction (signed by Anne Abeillé) is
similar to the structure of the whole book, namely it has two parts,
entitled Building treebanks and Using treebanks, respectively. After
making the terminological distinction between tagged corpora and
parsed corpora (or treebanks), the author emphasizes the reasons for
the need of the existence of treebanks and makes a general
presentation of the topics to be covered by the papers in the volume,
stressing the fact that the problems encountered for each language
are, at great extent, the same, thus a certain redundancy in the
papers collected in this volume.

PART I. BUILDING TREEBANKS

The chapters of the first part are grouped according to the language
or language families for which the approaches to building treebanks
are presented: the first four chapters are dedicated to English
treebanks, the next two to German ones; there are two papers on Slavic
treebanks, four on Romance parsed corpora and the last three chapters
of the first part address to treebanks for other languages (Sinica,
Japanese, Turkish).

ENGLISH TREEBANKS

Ch. 1. The Penn Treebank: an Overview.
Ann Taylor, Mitchell Marcus, and Beatrice Santorini.

The authors present the annotation schemes and the methodology used
during the 8-year Treebank project. The part-of-speech (POS) tagset is
based on that of the Brown Corpus, but adjusted to serve the
stochastic orientation of Penn Treebank and its concern with sparse
data, and reduced to eliminate lexical and syntactic
redundancies. More than one tag can be associated to a word, thus
avoiding arbitrary decisions. POS tags also contain syntactic
functions, thus serving as a basis for syntactic bracketing, which was
modified during the project from a skeletal context free bracketing
with limited empty categories and no indication of non-contiguous
structures and dependencies to a style of annotation which aimed at
clearly distinguishing between arguments and adjuncts of a predicate,
recovering the structure of discontiguous constituents, and making use
of null elements and coindexing to deal with wh-movement, passive,
subjects of infinitival constructions. The first objective was not
always easy to achieve via structural differences, that is why a set
of easily identifiable roles were defined, although sometimes these
ones proved difficult to apply, too. The Penn Treebank (PTB) project
also produced dysfluency annotation of transcribed conversations,
labeling complete and incomplete utterances, non-sentence elements
(filters, explicit ending terms, discourse markers, coordinating
conjunctions) and restarts (with or without repair).

For all these 3 annotation schemes a 2-step methodology was adopted:
an automatic step (represented by PARTS and Brill taggers for POS
tagging, the Fidditch deterministic parser for syntactic bracketing,
and a mere Perl script identifying common non-sentential elements)
followed by hand correction.

Chapter 2. Thoughts on Two Decades of Drawing Trees.
Geoffrey Sampson.

The author exploits the idea that the annotation of both written and
(transcribed) oral corpora makes obvious the deficiencies of
theoretical linguistics and may even contradict some widely accepted
conventional linguistic wisdom. For instance, sentences of the form
subject- intransitive verb are rather infrequent in English corpus,
contrary to what can be found in some linguistics textbooks.

Chapter 3. Bank of English and beyond.
Timo Järvinen.

The aim of this paper is twofold. On the one hand, the author
describes the four modules (pre-processing -- i.e.  segmentation and
tokenization --, POS assignment, POS tagging, functional analysis) of
the English Constraint Grammar (ENGCG) system used for annotating
corpora for compiling the second edition of the Collins COBUILD
Dictionary of English, and also the methodology adopted taking into
consideration the huge amount of data that was to be dealt with; thus,
manual inspection was possible only for some random fragments of the
data and automatic methods were created for monitoring them.

As clearly stated, the CG system was chosen for its morphological
accuracy. However, syntactic ambiguity was too high. That is why,
Järvinen pleas for a Functional Dependency Grammar (FDG) parser,
which better deals with long-distance dependencies, ellipses and other
complex phenomena. He points out the need for a deep parsing, instead
of the shallow one, his reason being, besides the lower ambiguity, the
practical orientation of the former.

Chapter 4. Completing Parsed Corpora.
Sean Wallis.

A more challenging title for this paper could have been: ''Do we need
linguists for constructing treebanks?'' For answering this question,
S. Wallis starts by giving us a brief overview of the phases of the
annotation employed on International Corpus of English ? British
Component (ICE- GB) and by pointing out the fact that the use of two
parsers (i.e., TOSCA and Survey parser) increased the number of
inconsistencies in the corpus, thus the necessity of a
post-correction. He provides two arguments against Sinclair (1992),
who found human annotators a source of errors in the treebank.

In order to ensure the cleanness of the parsed corpus, one has two
problems to solve: the decision (i.e. the correctness of the analysis)
and the consistency (of the analysis throughout the corpus)
ones. S. Wallis draws a distinction between longitudinal (that is,
working through a corpus sentence-by-sentence, until it is completed)
and transverse (i.e. working through a corpus construction-by-
construction) correction, bringing arguments in favor of the latter:
less time-consuming, control of the accuracy of the analysis and of
its consistency. The price paid is difficulty in implementation and in
managing the process.  But once the tool for grammatical queries
search facility (Fuzzy tree Fragment) is created, it can also be used
not only for correction, but also for searching and browsing the
corpus for linguistic queries, so a post-project use of the tool.

As clearly stated in the Critique section of Wallis's paper, the
question formulated above receives an affirmative answer if the final
aim of the corpus is not a study of the parser performance, but of
language variation.

GERMAN TREEBANKS

Chapter 5. Syntactic Annotation of a German Newspaper Corpus.
Thorsten Brants, Wojciech Skut, Hans Uszkoreit.

This paper is a presentation of the syntactic annotation of the NEGRA
newspaper corpus. Language-specific reasons (free word order, among
others), corpus structure (frequently elliptical constructions) and
the characteristics of the formalism contributed to the choosing of
Dependency Grammar for the annotation. However, it was modified so
that to take advantage of phrase-structure grammar, too: flat
structures, no empty categories, treatment of the head as a
grammatical function expressed by labeling, not by the syntactic
structure, allowance of crossing branches (which give rise to a large
number of errors), a more explicit annotation of grammatical
functions, encoding of predicate- argument information.

A characteristic of this project is the interactive annotation process
which makes use of the TnT statistical tagger and second order Markov
models for POS tagging.  Syntactic structure is built incrementally,
using cascaded Markov models. A graphical user interface allows for
manual tree manipulation and runs taggers and parsers in the
background. Human annotators need to concentrate only on the
problematic cases, which are assigned different probabilities by
statistical tagger and parser. Accuracy is ensured by annotating the
same set of sentences by two different annotators. Differences are
discussed and after agreeing on them, modifications are applied to the
annotation.

The design of the corpus and the annotation scheme make it usable for
different linguistic investigations and also for training taggers and
chunkers.

Chapter 6. Annotation of Error Types for German Newsgroup Corpus.
Markus Becker, Andrew Bredenkamp, Berthold Crysmann, Judith Klein.

This paper contributes to the presentation of the applications used
for the development of controlled language and grammar checking
applications for German.  The corpus in the FLAG project consisted of
email messages (as they present the characteristics needed: high error
density, accessibility, electronic availability). Their annotation was
3-phased: developing of a typology of grammatical errors in the target
language (German), manual annotation on paper, and annotation by means
of computer tools.

The first phase relied on traditional grammar books and its outcome
was a type hierarchy of possible errors, also containing error domains
(i.e. it tries to define the relations between the affected words)
useful in guiding the detection of errors. Although the hierarchy was
a fine- grained one, in the annotation process only a pool of 16 error
types were to be detected and classified.  After being manually
annotated, the same set of sentences was annotated in turn with two
tools: Annotate and DiET.  The annotation with the former one has a
tree-format: the nodes are the error types, and the edges are
descriptive information on these types; thus, a rich representation of
the structure of errors in terms of relations. However, this
representation is built bottom-up, the error-type being added
last. DiET offers a better method for configuring an annotation
schema, that is why the annotation was performed with this latter
tool.  The overwhelming type of errors were the orthographical ones
(83%), followed, at huge distance, by grammatical ones (16%).

TREEBANKS FOR SLAVIC LANGUAGES

Chapter 7. The Prague Dependency Treebank.
Alena Böhmová, Jan Hajic, Eva Hajicová, Barbora Hladká

For the annotation of the Czech newspaper corpus, a 3-level structure
was used. At the morphological level, the automatic analyzer produces
ideally for each token in the input data the lemma and the associated
MTag. Whenever more than one lemma and/or an MTag are produced, manual
disambiguation is needed. For the analytical (syntactic) level of
annotation the dependency structure was used. It is based on a
dependency/determination relation. Solutions were found for
problematic structures, as coordination, ellipses, ambiguity, and
apposition. Two modes of annotation were employed: first, manual
annotation, then the Collins parser was trained on such annotated data
and used further to generate the structure, while syntactic functions
went on being manually assigned. The separately produced morphological
and analytical syntactic annotations are then merged together, all
possible discrepancies being manually solved. The third level of
annotation, the tectogramatical one, describes the meaning of the
sentences by means of tectogrammatical functions and the information
structure of sentences. Analytic trees are transduced to
tectogrammatical ones in two phases: an automatic one (which makes the
necessary changes to syntactic trees, as merging the auxiliary nodes
with verbs) and a manual one.

Chapter 8. An HPSG-Annotated Test Suite for Polish.
Malgorzata Marciniak, Agnieszka Mykowiecka, Adam Przepiórkowski, Anna Kupsc.

The aim of the paper is to present the construction of a test-suite
for Polish, consisting of written sentences, both correct and
incorrect ones, the latter being manually annotated with correctness
markers. Each of these two types are further classified into three
subgroups, according to their complexity. Moreover, each sentence is
hand annotated with the list of linguistic phenomena they display,
choosing from nine groups of hierarchies of such phenomena.  Sentences
are annotated with attribute-value matrices (AVMs), whose content is
restricted by an HPSG signature.  The result is a database of
sentences, the correct ones augmented with their HPSG structures, and
a database of wordforms. The aim of the former database is to evaluate
computational grammars for Polish.

TREEBANKS FOR ROMANCE LANGUAGES

Chapter 9. Developing a Syntactic Annotation Scheme and Tools for a
Spanish Treebank.
Antonio Moreno, Susana López, Fernando Sánchez, Ralph Grishman.

The paper reports on building an annotated Spanish corpora, based on
newspaper articles. Problems specific to Spanish are presented:
dealing with multiword constituents and with amalgams or portmanteau
words, with null subjects and ellipses, ''se''-constructions,
etc. There are three levels of annotations: syntactic categories,
syntactic functions, morpho-syntactic features and some semantic
features. The annotation and debugging tools are also presented in the
paper, alongside with some error statistics, current state of the
Spanish treebank and future development.

Chapter 10. Building a Treebank for French.
Anne Abeillé, Lionel Clément, François Toussenel.

A newspaper corpus, representative of contemporary written French, was
subject to automatic tagging (segmentation with special attention to
compounds, tagging relying on trigram method, and retagging making use
of contextual information) and parsing (surface and shallow
annotation, theory- neutral, with the aim of identifying sentence
boundaries and limited embedding). Each annotation with morphosyntax,
lemmas (based on lexical rules), compounds and sentence boundaries was
followed by manual validation. The resulting treebank was used for
evaluating lemmatizers and for training taggers.

Chapter 11. Building the Italian Syntactic-Semantic Treebank.
Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta
Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli,
Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili,
Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana,
Fabio Pianesi, Rodolfo Delmonte.

The paper presents the syntactic-semantic annotation of a balanced
corpus and of a specialized one. Four levels of annotations were
adopted: morpho-syntactic annotation (POS, lemma, morpho-syntactic
features), syntactic annotation made up of constituency annotation
(identification of phrase boundaries and labeling of constituents) and
functional annotation (with functional relations), lexico- semantic
annotation (distinguishing among single lexical items, semantically
complex units and title sense units; specification of senses for each
word - relying on ItalWordNet -, along with other lexico-semantic
information, such as figurative usage, idiomatic expressions,
etc.). The first two types of annotations were performed
semi-automatically, while the other two were performed manually. There
are two innovations brought about by this treebank: sense tagging
(which resembles a semantic annotation of the corpus) and two distinct
layers of syntactic annotation, the constituency and the functional
ones, grounded by language specific phenomena (such as free
constituent order and pro-drop property) and by further usages of the
obtained treebank which is compatible with different approaches to
syntax.

In the second part of the article the annotation tool, GesTALt, is
presented: its consisting applications and the architecture of the
tool. In the end the usages of the obtained data are presented:
improvement of a translation system, enrichment of dictionaries,
improvement at the level of analysis.

Chapter 12. Automated Creation of a Medieval Portuguese Partial Treebank.
Vitor Rocio, Mário Amado Alves, J.  Gabriel Lopes, Maria Francisca
Xavier, Gracia Vicente.

The novelty of the approach presented in this paper arises from the
use of tools and resources developed for Contemporary Portuguese to
the annotation of a corpus of Medieval Portuguese. The differences
between these two phases of the language are presented.

The neural-network based POS tagger was trained on a set of words
manually tagged for each of the texts in the Medieval Portuguese
corpus. It was then used to extract a dictionary and to tag the rest
of the texts. Manual correction followed. For the lexical analysis, a
morphocentric lexical knowledge-base (LKB) was used. The lexical
analyzer uses as input the output from the POS tagger and applies to
it the knowledge in the LKB. Its output serves as input for the
syntactic analyzer.

The authors present the resources used and the adaptations required to
deal with the corpus. A similar method for dealing with corpora of
other Romance languages is envisaged.

TREEBAKNS FOR OTHER LANGUAGES

Chapter 13. Sinica Treebank.
Keh-Jiann Chen, Chi-Ching Lou, Ming-Chung Chang, Feng-Yi Chen,
Chao-Jan Chen, Chu-Ren Huang, Zhao-Ming Gao.

The paper reports on the construction of a treebank for Mandarin
Chinese, relying on Sinica Corpus, already annotated at the moment of
starting the treebank, so its resources could be used for the
latter. The authors provide reasons for their choosing of the grammar
formalism used for the representation of lexico-grammatical
information, namely Information-based Case Grammar. They also present
the concepts they work with: the principles of inheritance, the
phrasal categories, etc.

Sinica treebank is not a mere syntactically annotated corpora, but
also a semantically annotated one, containing thematic
information. The automatic annotation process was followed by a manual
checking, as in most cases. The language-specific phenomena (for
instance, constructions with nominal predicates) are given a short
presentation, along with the solution adopted in the annotation
process.

The treebank aims at being used as a reliable resource by
(theoretical) linguists, but not only by them, so tools for extracting
information from it were developed.

Chapter 14. Building a Japanese Parsed Corpus.
Sadao Kurohashi, Makoto Nagao.

The morphological and syntactic annotation of a Japanese newspaper
corpus is presented in this paper. It developed in parallel with the
improvement of the morphological analyzer JUMAN and of the dependency
structure analyzer KNP (chosen in accordance with the characteristics
of Japanese). The dependency relation is defined on bunsetsu, the
traditional Japanese linguistic unit. The free word order of Japanese
raised a problem which remained unsolved: predicate-argument relation
in embedded sentences.

Chapter 15. Building a Turkish Treebank.
Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür.

The aims of realizing the Turkish treebank is to be representative and
to contain all the relevant information for its potential users.

There are two levels of annotation: morphological and syntactical
ones. Both take into consideration the characteristics of Turkish,
especially its rich inflectional and derivational morphology. Thus,
each word is annotated for each of its morphemes, as this information
may be necessary for syntax. The free word order and the
discontinuities favor the usage of the dependency framework. Its
typical problems (pro-drop phenomenon, verb ellipsis, etc.) are given
the solution adopted in the annotation process.

PART II. USING TREEBANKS

Chapter 16. Encoding Syntactic Annotation.
Nancy Ide, Laurent Romary.

The emerge of treebanks, alongside with the proliferation of
annotation schemes, triggered the need for a general framework to
accommodate these annotation schemes and the different theoretical and
practical approaches. The general framework (built within XCES)
presented in this paper is an abstract model, theory and tagset
independent, that can be instantiated in different ways, according to
the annotator's approach and goal. This abstract model uses two
knowledge sources: Data Category Registry (an inventory of data
categories for syntactic annotation) and a meta-model (a
domain-dependent abstract structural framework for syntactic
annotation). Two other sources are used for the project-specific
formats of the annotation scheme: Data Category Specification (DCS)
(the description of the set of data categories used within a certain
annotation scheme) and Dialect Specification (defining the
project-specific format for syntactic annotation). Combining the
meta-model with the DCS, a virtual annotation markup language (AML)
can be defined for comparing annotations, for merging them or for
designing tools for visualization, editing, extraction, etc. A
concrete AML results from the combination of a virtual AML and Dialect
Specification.  The abstract model ensures the coherence and
consistency of the annotation schemes.

Chapter 17. Parser Evaluation.
John Carroll, Guido Minnen, Ted Briscoe.

The emergence of syntactic parsers triggered the need for methods
evaluating them. In fact, this has become a real branch in the field
of NLP research.  In this paper we are presented a corpus annotation
scheme that can be used for the evaluation of syntactic parsers.  The
scheme makes use of a grammatical relation hierarchy, containing types
of syntactic dependencies between heads and dependents. Based on
EAGLES lexicon/syntax standards (Barnett et al. 1996), this hierarchy
aims at being language- and application- independent.

The authors present a 10,000 words corpus semi- automatically marked
up. For its evaluation three measures are calculated: precision (the
number of bracketing matches with respect to the total number of
bracketings returned by the parser), recall (the number of bracketing
matches with respect to the number of bracketings in the corpus) and
F- score (this is a measure combining the previous two measures: (2 x
precision x recall)/(precision + recall)).  This last measure can be
used to illustrate the parser accuracy. The evaluation of grammatical
relations provides information about levels of precision and recall
for groups or single relations. Thus, they are useful for indicating
the areas where more effort should be concentrated for bettering.

Chapter 18. Dependency-based Evaluation of MINIPAR.
Dekang Lin.

The author presents a dependency-based method for evaluating parsers
performance.  To represent a dependency tree he makes use of a set of
tuples for each node in the tree, specifying the word, its grammatical
category, its head (if the case, and also its position with respect to
this head) and its relationship with the head (again, if the case). To
perform the evaluation, for the parser generated trees (called here
answers) and the manually constructed trees (called keys) dependency
trees are generated and compared on a word-by- word basis. Very
important, a selective evaluation is also possible: one can measure
the parser performance with respect to a certain type of dependency
relation or even to a certain word. Two scores are calculated: recall
and precision.

The author goes on with the presentation of MINIPAR, a principle-based
broad coverage English parser (Berwick et al. 1991). The
dependency-based method presented above is used for evaluating this
parser. One interesting outcome of this evaluation is that the parser
performs better on longer sentences than on shorter ones. This may be
the outcome of having trained the parser on press reportage, with long
sentences, while the shorter sentences are found in fiction, the genre
against which the parser is tested.

GRAMMAR INDUCTION WITH TREEBANKS

Chapter 19. Extracting Stochastic Grammars from Treebanks.
Rens Bod.

The assumption (see Scha 1990, 1992, Bod 1992, 1995, 1998)
constituting the basis of this article is that ''human language
perception and production processes may very well work with
representations of concrete past language experiences, and that
language processing models could emulate this behavior if they
analyzed new input by combining fragments of representations from
annotated corpus''. So, the idea is to use an already annotated corpus
as a stochastic grammar. The idea is not new, but the aim of the
article is to answer the question: in what measure can constraints be
imposed on the used subtrees without decreasing the performance of the
parser?

The results reported here were obtained using a data- oriented parsing
(DOP) model (presented in section 2 of the paper) which was applied to
two corpora of phrase structure trees: Air Travel Information System
(ATIS) and the Wall Street Journal (WSJ) part from PTB. The conclusion
drawn from the experiments is that almost all constraints decrease the
performance of the model: the most probable parse (which takes into
consideration overlapping subtrees) gives better results than the most
probable derivation (which does not takes it into consideration); the
larger the subtrees, the better predictions (as larger subtrees
capture more dependencies than small ones); the larger the lexical
context (up to a certain depth, which seems to be corpus-specific),
the better accuracy (as more lexical dependencies are taken into
account); the low frequency subtrees have an important contribution to
the parse accuracy (as they seem to be larger, thus to contain more
lexical/structural context useful for further parsing); the use of
subtrees with non-headwords have a good impact on the performance of
the model (as they contain syntactic relations for those
non-headwords, which cannot be found in other subtrees).

Chapter 20. A Uniform Method for Automatically Extracting Stochastic
Lexicalized Tree Grammars from Treebanks and HPSG.
Günter Neumann.

As the title states it, the paper presents a uniform method for
automatically extraction of stochastic lexicalized tree grammars
(SLTG) from treebanks (allowing corpus-based analysis of grammars) and
HPSG (allowing extraction of domain-independent and phenomena-oriented
subgrammars), with the future aim at merging the two SLTGs to improve
the coverage of treebank grammars on unseen data and to ease
adaptation of treebanks to new domains.

The major operation in the extraction of SLTG is the recursive
top-down tree decomposition according to the head principle, thus each
extracted tree is automatically lexically anchored. The path from the
lexical anchor to the root of the tree is called a head-chain. There
are two more additional operations involved: each subtree of the head-
chain is copied and the copied tree is processed individually by the
decomposition operation, thus allowing a phrase to occur both in head
and in non-head positions; for each SLTG-tree having a modifier phrase
attached, a new tree is created with the modifier unattached, thus
using the extracted grammar for recognizing sentences with less or no
modifiers than the seen ones. There results a SLTG which is processed
by a two-phase stochastic parser.  The rest of the paper describes the
extraction of SLTG from PTB and from NEGRA treebank, on the one hand,
and from a set of parse trees with an English HPSG, on the other, and
some experiments results of the use of an extracted SLTG.

Chapter 21. From Treebank Resources to LFG F-Structures.
Anette Frank, Louisa Sadler, Josef van Genabith, Andy Way.

This paper presents two methods for automatic f-structure
annotation. The first one consists in extracting a Context- Free
Grammar (CFG) from a treebank, according to Charniak 1996. A set of
regular expression based annotation principles are then developed and
applied to the CFG, resulting an annotated CFG. The annotated rules
are rematched against the treebank trees, the result being
f(unctional)-structures. The second method uses flat tree
descriptions. Annotation principles define projection constraints
which associate partial c(onstituent)- structures with their
corresponding partial f-structures.  When these principles are applied
to flat set-based encoding of treebank trees, they induce the
f-structure.  The two methods are characterized by robustness, due to
the following facts: principles are partial, underspecified and match
unseen configurations, partial annotations are generated instead of
failure, the constraint solver cope with conflicting information.

DISCUSSION

Although this was not the objective of the book, its first part can be
used as a textbook for those venturing to construct a treebank. As the
papers here focus on different types of languages, displaying
grammatical phenomena and different ways of dealing with them, these
papers can serve as a repository of solutions to various problems
encountered when trying to design a corpus, to establish a certain
annotation scheme to be used for a treebank, to develop annotation
tools. The style in which the papers were written is helpful in this
respect: they are clear, accessible and the information is introduced
gradually.  The second part of the book has a more reduced group of
addressees than the first one, due to its technical details involved
by the presentation of different application in computer linguistics:
lexicon induction (Järvinen), grammatical induction (Frank et al.,
Bod) parser evaluation (Carroll et al.), checker evaluation (Becker et
al.).

REFERENCES

Barnett, R., N. Calzolari, S. Flores, P. Hellwig, P.  Kahrel,
G. Leech, M. Melera, S. Montemagni, J. Odijk, V.  Pirrelli,
A. Sanfilippo, S. Teufel, M. Villegas, L. Zaysser (1996) EAGLES
Recommemdations on Subcategorisation. Report of the EAGLES Working
Group on Computational Lexicons,
ftp://ftp.ilc.pi.cnr.it/pub/eagles/lexicons/synlex.ps.gz.

Berwick, R.C., S.P. Abney, C. Tenny (Eds.) (1991) Principle-Based
Parsing: Computation and Psycholinguistics.  Kluwer Academic
Publishers.

Bod, R. (1992) Data Oriented Parsing (DOP), Proceedings COLING '92,
Nantes, France.

Bod, R. (1995) Enriching Linguistics with Statistics: Performance
Models of Natural Language, ILLC Dissertation Series 1995-14,
University of Amsterdam.

Bod, R. (1998) Spoken Dialogue Interpretation with the DOP Model,
Proceedings COLING-ACL'98, Montreal, Canada.

Charniak, E. (1996) Tree-bank Grammars. AAAI-96.  Proccedings of the
Thirteenth national Conference of Artificial Intelligence,
p. 1031-1036. MIT Press.

Scha, R. (1990) Taaltheorie en Taaltechnologie; Competence en
Performance, in Q.A.M. de Kort and G.L.J. Leerdam (Eds.),
Computertoepassingen in de Neerlandistiek, Almere: Landelijke
Vereniging van Neerlandici (LVVN-jaarboek).

Scha, R. (1992) Virtuele Gramatica's en Creatieve Algoritmen,
Gramma/TTT 1(1).

Sinclair, J. (1992) The automatic analysis of corpora. In J. Svartvik
(Ed.) Directions in Corpus Linguistics.  Proceeedings of Nobel
Symposium 82. Berlin: Mouton de Gruyter, pp. 379-397.

ABOUT THE REVIEWER

Verginica Barbu Mititelu is a researcher at the Romanian Institute for
Artificial Intelligence and a PhD candidate at the Bucharest
University. She has been involved in the development of a treebank for
Romanian for a very short period of time.


---------------------------------------------------------------------------

If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.

---------------------------------------------------------------------------
LINGUIST List: Vol-15-1589



More information about the LINGUIST mailing list