Corpora: New Parser

Gojol gojol at sunu.rnc.ro
Wed Dec 6 13:51:00 UTC 2000


   Dear Colleagues ,

   Those interested in a new parser ( based on an original
philosophy ) , shortly introduced below , are invited to
contact me personally ( gojol at sunu.rnc.ro ) . Any sugges-
tions , comparisons with existing parsers etc. will be wel-
come . Thank you ,
                    Vlad V. Gojol

............................................................

   After learning from a 46,000 words pos-tagged corpus and
a 32,000 words parsed ( treebank ) corpus , a 2,000 words
text ( not included in any of the two corpora ) is parsed
( tagging excluded ) in 18 seconds ( on a 200 MHz machine )
with 4% incomplete trees ( but for these declared failures ,
are also provided well formed trees sufficient for a subse-
quent translator ) - the extracted grammar having cca 12,000
rules . The Negra corpus of German is used . After learning
from a 17,000 words parsed corpus and from the same 46,000
words pos-tagged one , a 2,000 words text included into the
first ( but excluded from the second ) , to warrant that the
grammar is complete relative to it ( i.e. contains all the
rules necessary for its correct parsing ) , is processed in
4 seconds with no incomplete tree - the extracted grammar
having cca 7,000 rules . The parsing is 2-3 times slower on
the English corpus Susanne . The system is language indepen-
dent , with wide character support .
   The parser may accept a set of rules intended to refine
the statistical grammar deduced from the corpus . Moreover ,
it can take as input only a context-free grammar ( in which
case it ceases to be a statistical parser ) , but in this
operating mode it requires much time and memory ( during the
learning , not during the parsing as such ) if the grammar
is over-dimensioned . The statistical grammar is refined not
by simply adding the proposed rules , but by modifying the
corpus , to exploit all the real contexts possible for them .



More information about the Corpora mailing list