[Corpora-List] Syntactic zeros in a corpus: possible solutions

Mikhail Kopotev mihail.kopotev at helsinki.fi
Mon Aug 28 11:23:32 UTC 2006


Dear List-members.

Thanks to all who answered me.

Summarizing the answers, I will provide some possible solutions.

Syntactic zeros are, with no doubts, a question of a theory we use to 
annotate material. The spectrum of the opinions differs from a 
“complete” list of syntactic zeros to the negation of the phenomenon. As 
far as our corpus (as many other corpora) is used by many users like 
teachers, interpreters, students etc. that might be not familiar with 
modern syntactic theories we should consider a more “traditional” 
annotation scheme. In other words, speaking of syntactic annotation we 
should follow a principle, formulated by G. Leech. I mean “consensual, 
theory-neutral analysis of the data”. In case of the Russian language 
the matter seems to be even more complicated than that of English, 
because there are at least three predominant theories circulating in 
Russian linguistics. All three postulate syntactic zeros and all three 
have different lists of them.
Thus, as far as the theoretical question has no common answer I think it 
would be better stop discussing it in order not to flame here. Let’s 
consider that “the Holy Grail” does exist at least within certain 
theoretical frames. So, how to locate it?

Two approaches seem to be relevant in this respect.
1. A tag-based approach postulates a list of zeros or (pre)formulated 
rules, according to which a NLP system can (automatically or manually) 
recognize a zero element and insert a special “zero”-tag into text. This 
is, in fact, a commonly used way to work with zeros. Its advantages are: 
systematic way of annotation that can be introduced in a user-friendly 
form; and a possibility (?) to tune up a system for recognizing clauses 
that contain zeros.
Its weakness is that a user should be familiar (and should agree, in all 
probability) with the theory an annotation scheme is based on.  As far 
as a theory-neutral annotation scheme does not exist, such a corpus will 
be rather a field of a battle, then a place to search and collect material.

2. A search-based approach is grounded on using a query language, that 
allows users searching clauses NOT containing some elements (such as 
{SELECT “all clauses” FROM “the text” WHERE “verb” <> “y”} for the verb 
ellipsis). This approach is the more usable, the more accurate and clear 
an annotation is. Its advantage is a theory-independent search (to be 
more precise, a user can search according to his/her own theoretical 
background). The main disadvantage is that a query will return (a lot 
of) irrelevant examples. Another weakness is that in a rather big corpus 
such a query takes a lot of time to respond, but it is a technical not 
linguistic problem.
Of course, it is possible to create a corpus that integrates both 
approaches.

Any comments will be warmly appreciated.

Mikhail Kopotev
Researcher
Department of Slavonic
and Baltic Languages and Literatures
University of Helsinki



More information about the Corpora mailing list