[Corpora-List] Syntactic zeros in a corpus: possible solutions

Mon Aug 28 16:45:27 UTC 2006

While I would agree that syntactic zeros are not theory-neutral and that 
it is necessary to itemize them in some way, I don't think the tag-based 
approach is so dire as suggested.  The parser that I use, developed by 
Ned Irons (a co-inventor of syntax-directed compiling), envisions their 
recognition in quite a regular fashion.  The parser is an augmented 
transition network, in which a key part of a transition to a next parse 
state is "additional processing".  The additional processing takes two 
primary forms: (1) tests that particular conditions are met (e.g., 
subject-verb agreement) and (2) annotations to be attached to 
(potential) nodes of the parse tree.  Among the many annotations, there 
is one labeled "filler".  Attached to this label are sublabels, giving 
specifications of whether the filler should be optional, an object, or 
an adjective (say a question filler).  In the tests that are performed, 
checks are made on whether the fillers are indeed filled elsewhere in 
the sentence (and perhaps they're not, but rather constitute an 
elision).  It seems to me that this approach can be primarily 
data-driven.  If the parser doesn't grok, a good likelihood is that 
syntactic zeroes are present and the grammar needs to be modified 
accordingly.  (In parsing hundreds of thousands of sentences, where I 
generally only have time to get an "impression" of what's going wrong, 
these cases seem quite prevalent.  Unfortunately, I can't give a more 
precise estimate.)

	Ken

Mikhail Kopotev wrote:

> Dear List-members.
> 
> Thanks to all who answered me.
> 
> Summarizing the answers, I will provide some possible solutions.
> 
> Syntactic zeros are, with no doubts, a question of a theory we use to 
> annotate material. The spectrum of the opinions differs from a 
> “complete” list of syntactic zeros to the negation of the phenomenon. As 
> far as our corpus (as many other corpora) is used by many users like 
> teachers, interpreters, students etc. that might be not familiar with 
> modern syntactic theories we should consider a more “traditional” 
> annotation scheme. In other words, speaking of syntactic annotation we 
> should follow a principle, formulated by G. Leech. I mean “consensual, 
> theory-neutral analysis of the data”. In case of the Russian language 
> the matter seems to be even more complicated than that of English, 
> because there are at least three predominant theories circulating in 
> Russian linguistics. All three postulate syntactic zeros and all three 
> have different lists of them.
> Thus, as far as the theoretical question has no common answer I think it 
> would be better stop discussing it in order not to flame here. Let’s 
> consider that “the Holy Grail” does exist at least within certain 
> theoretical frames. So, how to locate it?
> 
> Two approaches seem to be relevant in this respect.
> 1. A tag-based approach postulates a list of zeros or (pre)formulated 
> rules, according to which a NLP system can (automatically or manually) 
> recognize a zero element and insert a special “zero”-tag into text. This 
> is, in fact, a commonly used way to work with zeros. Its advantages are: 
> systematic way of annotation that can be introduced in a user-friendly 
> form; and a possibility (?) to tune up a system for recognizing clauses 
> that contain zeros.
> Its weakness is that a user should be familiar (and should agree, in all 
> probability) with the theory an annotation scheme is based on.  As far 
> as a theory-neutral annotation scheme does not exist, such a corpus will 
> be rather a field of a battle, then a place to search and collect material.
> 
> 2. A search-based approach is grounded on using a query language, that 
> allows users searching clauses NOT containing some elements (such as 
> {SELECT “all clauses” FROM “the text” WHERE “verb” <> “y”} for the verb 
> ellipsis). This approach is the more usable, the more accurate and clear 
> an annotation is. Its advantage is a theory-independent search (to be 
> more precise, a user can search according to his/her own theoretical 
> background). The main disadvantage is that a query will return (a lot 
> of) irrelevant examples. Another weakness is that in a rather big corpus 
> such a query takes a lot of time to respond, but it is a technical not 
> linguistic problem.
> Of course, it is possible to create a corpus that integrates both 
> approaches.
> 
> Any comments will be warmly appreciated.
> 
> Mikhail Kopotev
> Researcher
> Department of Slavonic
> and Baltic Languages and Literatures
> University of Helsinki
> 
> 
> 
> 

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com