Corpora: New parsing technology

Vlad V. Gojol gojol at rnc.ro
Mon Jan 21 17:08:38 UTC 2002


   Dear list members ,

   May I announce the new version of my parser . It is based on
a part-of-speech tagger considered by those who tested it as the
best for German ( it is currently licensed in Germany ) . On the
Negra corpus , it gets an error rate of 2% , compared to the 3.4%
reported by DFKI's TnT under comparable conditions . For the
English corpus Susanne , with the tagset reduced to a manageable
size , the error rate drops down to 1.5% .
   The parser is the only statistic one ( according to a Sigparse
official ) competitive in terms of accuracy with the top grammar-
based ones . The comparison with such a state-of-the-art parser
( regarded by many as the best on the market ) , web-testable ,
showed a slight advantage for mine in terms of accuracy and a
big one ( cca ten times ) in terms of speed . This was got by
training , for German , on a quite small corpus : the first
2,000 sentences ( 35,000 words ) of Negra ( from which 10,000
rules were deduced ) ; for English - on 60,000 words of Susanne
( 20,000 rules ) . This software is actually a parser generator
permitting the creation of specific parsers for any language
within a short while : just as required for annotating a text
equivalent to 40-60 pocket-book pages ( i.e. a student-level
work for a couple of months ) . At present it runs even without
any lexicon ( except the tiny one extracted from the respective
corpora : 33,000 word forms for German , 12,000 for English ) .
There are three output files : treebank , dependency-oriented
and graphic . It is licensed or under the process of licensing
in several institutes / universities from Switzwerland , Italy
and Germany .
   Linux and Windows demos exist for German and English ,
deliverable on demand at gojol at sunu.rnc.ro , with a limited
operating availability ( three days ) . It may be discussed :
building versions for other tagsets or languages ( French ,
Spanish , Italian ) , prolonging the system towards integration
into specific customer applications .
   Would you reply only personally ( at gojol at sunu.rnc.ro ) .
   Regards ,
             Dr.ing. Vlad Gojol



More information about the Corpora mailing list