[Corpora-List] Syntactic zeros in a corpus: possible solutions
Mikhail Kopotev
mihail.kopotev at helsinki.fi
Mon Aug 28 11:23:32 UTC 2006
Dear List-members.
Thanks to all who answered me.
Summarizing the answers, I will provide some possible solutions.
Syntactic zeros are, with no doubts, a question of a theory we use to
annotate material. The spectrum of the opinions differs from a
“complete” list of syntactic zeros to the negation of the phenomenon. As
far as our corpus (as many other corpora) is used by many users like
teachers, interpreters, students etc. that might be not familiar with
modern syntactic theories we should consider a more “traditional”
annotation scheme. In other words, speaking of syntactic annotation we
should follow a principle, formulated by G. Leech. I mean “consensual,
theory-neutral analysis of the data”. In case of the Russian language
the matter seems to be even more complicated than that of English,
because there are at least three predominant theories circulating in
Russian linguistics. All three postulate syntactic zeros and all three
have different lists of them.
Thus, as far as the theoretical question has no common answer I think it
would be better stop discussing it in order not to flame here. Let’s
consider that “the Holy Grail” does exist at least within certain
theoretical frames. So, how to locate it?
Two approaches seem to be relevant in this respect.
1. A tag-based approach postulates a list of zeros or (pre)formulated
rules, according to which a NLP system can (automatically or manually)
recognize a zero element and insert a special “zero”-tag into text. This
is, in fact, a commonly used way to work with zeros. Its advantages are:
systematic way of annotation that can be introduced in a user-friendly
form; and a possibility (?) to tune up a system for recognizing clauses
that contain zeros.
Its weakness is that a user should be familiar (and should agree, in all
probability) with the theory an annotation scheme is based on. As far
as a theory-neutral annotation scheme does not exist, such a corpus will
be rather a field of a battle, then a place to search and collect material.
2. A search-based approach is grounded on using a query language, that
allows users searching clauses NOT containing some elements (such as
{SELECT “all clauses” FROM “the text” WHERE “verb” <> “y”} for the verb
ellipsis). This approach is the more usable, the more accurate and clear
an annotation is. Its advantage is a theory-independent search (to be
more precise, a user can search according to his/her own theoretical
background). The main disadvantage is that a query will return (a lot
of) irrelevant examples. Another weakness is that in a rather big corpus
such a query takes a lot of time to respond, but it is a technical not
linguistic problem.
Of course, it is possible to create a corpus that integrates both
approaches.
Any comments will be warmly appreciated.
Mikhail Kopotev
Researcher
Department of Slavonic
and Baltic Languages and Literatures
University of Helsinki
More information about the Corpora
mailing list