Soft: 2 annonces

alexis nasr alexis.nasr at
Tue Feb 20 15:46:47 UTC 2001

/1  Bigram Statistics Package (v0.3)


/1  Bigram Statistics Package (v0.3)

A newly improved version of the Bigram Statistics Package (v0.3) is
now available!

This is an easy to use suite of Perl tools for counting and analyzing
bigrams in text. It includes a number of the standard tests of
association, and also allows users to easily implement others without
knowing very much about Perl at all. And it allows users to perform
pairwise comparisons of the lists of bigrams found by these tests.

This is free software. Get it (or more information) at:

Coming Soon - Unicode support!

# Ted Pedersen                   #
# Department of Computer Science                      tpederse at #
# University of Minnesota Duluth                                         #
# Duluth, MN 55812                                        (218) 726-8770 #


   Dear Readers ,

   Those interested in the two pieces of NLP software presented below
are welcome to contact me directly ( gojol at ) .
   Thank you ,
      Vlad Gojol

Senior Research Engineer
Institutul National de Informatica
Bucuresti , Romania



   I created a part-of-speech tagger with an unusual capacity of dealing
with large contexts , especially for German . I used Negra ( seemingly the
best known German corpus , with free obtainable licence ) . The tagger
currently reputed as being the most accurate for German is perhaps TnT . It
reports upon this corpus an error rate of 3.4% . But I have found a syste-
matic error in Negra : all the occurences of the auxilliary verbs are tagged
as auxilliary ( VAFIN ) , though in 50% of the cases they function as finite
verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens ) .
In this more correct environment ( where the performance of TnT should be
probably around 4.5% ) , my tagger gets 1.7% .
   On another German corpus ( I call it X ) , with comparable contents ( news-
paper articles ) and tagset , but with attached EXTERIOR lexicon ( i.e. not
extracted from the corpus ) , the result is 2.4% .
   I also used Susanne ( the only English corpus I could get free ) . The re-
ported result for TnT is 3.8% . Mine is 2.8% . On the "A" texts , best paral-
lelable with those in Negra , as journalistic , it's 2.3% . By restricting
the tagset to a more normal size ( cca 100 tags , determined as optimal af-
ter lots of test runs ) , it's 1.3% .
   Initially I had used a Romanian corpus , with a result of 0.9% ( compared
to 1.7% , 2.5% and 4.2% respectively got by the Xerow , Birmingham and Brill
taggers ) .

   The speed is comparable to that of TnT and modifiable by parameter setting ,
in reverse proportion to the accuracy ( but without affecting it much ) .
   The incremental operating mode and the data structures segmentation allow
running on very small memory computers .
   There is the advantage of an intuitive output ( no hostile binary matrix ) ,
in a form analogue to the input of some expert systems .
   Alternative taggings are output , with scores : unlike with other taggers ,
they don't refer to individual words , but to whole sentence parts ( repre-
senting somehow phrase surfaces of minimum energy ) .
   Special facilities exist , such as virtual tags , or context essentialisa-
tion ( permitting to get the minimal contexts set characteristic to a certain
linguisic style , useful not only for maximum accuracy and speed ) etc.
   Recently added features : wide-character support , the possibility of
being called as a simple library routine , or suspending the notion of a
file ( a complex files system is emulated into the memory ) . For example
( for the second feature above ) , you can create an instance of the tagger
( let's say for English with the tagset Lancaster ) , call it to tag a cer-
tain text buffer ( by writing the resulting tags into another buffer ) and
finally kill the instance , all this without using any disc file :
      t = GojolTagger_new("lancaster");
      error_code = GojolTagger_tag(t,input_buffer,output_buffer);

   All is built on two essentially new concepts : organicity and context pro-
pagation . I didn't publish anything about them , to keep up their commercial
appeal . The accuracy comparable to that of manual tagging made me find many
errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G. Sampson
replied gratefully , saying that it's the first time somebody reports more
than 2 errors , and that my findings make necessary a new version of Susanne .
The handling of very large contexts could even modify the current tagsets de-
sign , by cancelling some unnatural decisions ( motivated only by the incapa-
city of the existing taggers to see beyond a 3-tokens neighborhood ) , such as
those concerning the auxilliary verbs , participles etc. - so removing some
burden from the subsequent stages of text processing .

   It is written in C ( Linux ) . Demos for German ( Negra ) and English
( Susanne ) are available .



   After learning from a 46,000 words pos-tagged corpus and
a 32,000 words parsed ( treebank ) corpus , a 2,000 words
text ( not included in any of the two corpora ) is parsed
( tagging excluded ) in 6 seconds ( on a 200 MHz machine )
with 2% incomplete trees ( but for these declared failures ,
are also provided well formed trees sufficient for a subse-
quent translator ) - the extracted grammar having cca 12,000
rules . The Negra corpus of German was used . After learning
from a 17,000 words parsed corpus and from the same 46,000
words pos-tagged one , a 2,000 words text included into the
first ( but excluded from the second ) , to warrant that the
grammar is complete relative to it ( i.e. contains all the
rules necessary for its correct parsing ) , is processed in
2 seconds with no incomplete tree - the extracted grammar
having cca 7,000 rules . The system is language independent
- for English , upon the Susanne corpus , comparable results
are obtained . To have an acceptable parser for any other i-
diom , you need essentially simply a corpus with 30,000 tag-
ged words , from which only 20,000 parsed as well - and for
optimal results , 50,000 and 30,000 respectively .

   The parser may accept a set of rules intended to modify
the statistical grammar deduced from the corpus . Moreover ,
it can take as input only a context-free grammar ( in which
case it ceases to be a statistical parser ) , but in this
operating mode it requires much time and memory ( during the
learning , not during the parsing as such ) if the grammar
is over-dimensioned . The statistical grammar is refined not
by simply adding the proposed rules , but by modifying the
corpus , to exploit all the real contexts possible for them .

   Semantic processing could be easily inserted at rule re-
duction points . Actually this generalized parser can also
work as a compiler generator : by appending specific semantic
routines , you get efficient compilers for C , Pascal etc.
This versatile system has more than 40 parameters which tune
the accuracy and speed according to the target language sam-
ple . The output is in treebank format and optionally in gra-
phic ( with the trees effectively drawn ) one .

   Linux demos exist for German and English . As only the mi-
nimal definition of C is used , it is easily adaptable to any
machine ( for other Unix-like operating systems , probably a
simple recompilation would be sufficient ) .

Message diffusé par la liste Langage Naturel <LN at>
Informations, abonnement :
English version          :
Archives                 :

More information about the Ln mailing list