12.511, Software: Language Independent Part-of-Speech Tagger

The LINGUIST Network linguist at linguistlist.org
Fri Feb 23 17:39:40 UTC 2001


LINGUIST List:  Vol-12-511. Fri Feb 23 2001. ISSN: 1068-4875.

Subject: 12.511, Software: Language Independent Part-of-Speech Tagger

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	Lydia Grebenyova, EMU		Jody Huellmantel, WSU
	James Yuells, WSU		Michael Appleby, EMU
	Marie Klopfenstein, WSU		Ljuba Veselinova, Stockholm U.

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: James Yuells <james at linguistlist.org>

=================================Directory=================================

1)
Date:  Mon, 19 Feb 2001 15:46:43 +0000
From:  "Vlad V. Gojol" <gojol at sunu.rnc.ro>
Subject:  Language Independent Part-of-Speech Tagger

-------------------------------- Message 1 -------------------------------

Date:  Mon, 19 Feb 2001 15:46:43 +0000
From:  "Vlad V. Gojol" <gojol at sunu.rnc.ro>
Subject:  Language Independent Part-of-Speech Tagger


   Dear Readers ,

   Those interested in the two pieces of NLP software presented below
are welcome to contact me directly ( gojol at sunu.rnc.ro ) .
   Thank you ,
               Dr.ing. Vlad Gojol

- --------------------------------
Senior Research Engineer
Institutul National de Informatica
Bucuresti , Romania


............................................................................

   LANGUAGE INDEPENDENT PART-OF-SPEECH TAGGER

   I created a part-of-speech tagger with an unusual capacity of dealing
with large contexts , especially for German . I used Negra ( seemingly the
best known German corpus , with free obtainable licence ) . The tagger
currently reputed as being the most accurate for German is perhaps TnT . It
reports upon this corpus an error rate of 3.4% . But I have found a syste-
matic error in Negra : all the occurences of the auxilliary verbs are tagged
as auxilliary ( VAFIN ) , though in 50% of the cases they function as finite
verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens ) .
In this more correct environment ( where the performance of TnT should be
probably around 4.5% ) , my tagger gets 1.7% .
   On another German corpus ( I call it X ) , with comparable contents ( news-
paper articles ) and tagset , but with attached EXTERIOR lexicon ( i.e. not
extracted from the corpus ) , the result is 2.4% .
   I also used Susanne ( the only English corpus I could get free ) . The re-
ported result for TnT is 3.8% . Mine is 2.8% . On the "A" texts , best paral-
lelable with those in Negra , as journalistic , it's 2.3% . By restricting
the tagset to a more normal size ( cca 100 tags , determined as optimal af-
ter lots of test runs ) , it's 1.3% .
   Initially I had used a Romanian corpus , with a result of 0.9% ( compared
to 1.7% , 2.5% and 4.2% respectively got by the Xerow , Birmingham and Brill
taggers ) .

   The speed is comparable to that of TnT and modifiable by parameter setting ,
in reverse proportion to the accuracy ( but without affecting it much ) .
   The incremental operating mode and the data structures segmentation allow
running on very small memory computers .
   There is the advantage of an intuitive output ( no hostile binary matrix ) ,
in a form analogue to the input of some expert systems .
   Alternative taggings are output , with scores : unlike with other taggers ,
they don't refer to individual words , but to whole sentence parts ( repre-
senting somehow phrase surfaces of minimum energy ) .
   Special facilities exist , such as virtual tags , or context essentialisa-
tion ( permitting to get the minimal contexts set characteristic to a certain
linguisic style , useful not only for maximum accuracy and speed ) etc.
   Recently added features : wide-character support , the possibility of
being called as a simple library routine , or suspending the notion of a
file ( a complex files system is emulated into the memory ) . For example
( for the second feature above ) , you can create an instance of the tagger
( let's say for English with the tagset Lancaster ) , call it to tag a cer-
tain text buffer ( by writing the resulting tags into another buffer ) and
finally kill the instance , all this without using any disc file :
      t = GojolTagger_new("lancaster");
      error_code = GojolTagger_tag(t,input_buffer,output_buffer);
      GojolTagger_free(t);

   All is built on two essentially new concepts : organicity and context pro-
pagation . I didn't publish anything about them , to keep up their commercial
appeal . The accuracy comparable to that of manual tagging made me find many
errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G. Sampson
replied gratefully , saying that it's the first time somebody reports more
than 2 errors , and that my findings make necessary a new version of Susanne .
The handling of very large contexts could even modify the current tagsets de-
sign , by cancelling some unnatural decisions ( motivated only by the incapa-
city of the existing taggers to see beyond a 3-tokens neighborhood ) , such as
those concerning the auxilliary verbs , participles etc. - so removing some
burden from the subsequent stages of text processing .
   It is written in C ( Linux ) . Demos for German ( Negra ) and English
( Susanne ) are available .

..............................................................................

   LANGUAGE INDEPENDENT STATISTIC PARSER

   After learning from a 46,000 words pos-tagged corpus and
a 32,000 words parsed ( treebank ) corpus , a 2,000 words
text ( not included in any of the two corpora ) is parsed
( tagging excluded ) in 6 seconds ( on a 200 MHz machine )
with 2% incomplete trees ( but for these declared failures ,
are also provided well formed trees sufficient for a subse-
quent translator ) - the extracted grammar having cca 12,000
rules . The Negra corpus of German was used . After learning
from a 17,000 words parsed corpus and from the same 46,000
words pos-tagged one , a 2,000 words text included into the
first ( but excluded from the second ) , to warrant that the
grammar is complete relative to it ( i.e. contains all the
rules necessary for its correct parsing ) , is processed in
2 seconds with no incomplete tree - the extracted grammar
having cca 7,000 rules . The system is language independent
- for English , upon the Susanne corpus , comparable results
are obtained . To have an acceptable parser for any other i-
diom , you need essentially simply a corpus with 30,000 tag-
ged words , from which only 20,000 parsed as well - and for
optimal results , 50,000 and 30,000 respectively .
   The parser may accept a set of rules intended to modify
the statistical grammar deduced from the corpus . Moreover ,
it can take as input only a context-free grammar ( in which
case it ceases to be a statistical parser ) , but in this
operating mode it requires much time and memory ( during the
learning , not during the parsing as such ) if the grammar
is over-dimensioned . The statistical grammar is refined not
by simply adding the proposed rules , but by modifying the
corpus , to exploit all the real contexts possible for them .
   Semantic processing could be easily inserted at rule re-
duction points . Actually this generalized parser can also
work as a compiler generator : by appending specific semantic
routines , you get efficient compilers for C , Pascal etc.
This versatile system has more than 40 parameters which tune
the accuracy and speed according to the target language sam-
ple . The output is in treebank format and optionally in gra-
phic ( with the trees effectively drawn ) one .
   Linux demos exist for German and English . As only the mi-
nimal definition of C is used , it is easily adaptable to any
machine ( for other Unix-like operating systems , probably a
simple recompilation would be sufficient ) .

---------------------------------------------------------------------------
LINGUIST List: Vol-12-511



More information about the LINGUIST mailing list