Corpora: Parsing morphologically rich languages
Gabriel Pereira Lopes
gpl at
Tue Jan 16 14:43:44 UTC 2001
Dear Alexander,
To work with morphologically rich languages does not necessarily require
a large number of POS-tags, as this would require very large hand tagged
corpora in order to train taggers and have text automatically
We work with approximately 40 tags for Portuguese and that does not
inhibit our parser from starting with automatically POS-tagged text and
find out the required morpho-syntactic information. Of course we are
working with a kind of PROLOG (DyALog) that enables the construction of
chart parsers quite effective (250 words/sec.) that moreover can be used
for fault finding and fault repairing. Have alook at the web page by
Vitor Rocio ( from where you can pick
up some of our publications (Publicações) on this subject matter. See:
Vitor Rocio and J.G.P. Lopes. 1999. "An infra-structure for diagnosing
causes for partially
parsed natural language input". In: ACTAS-I VI Simposio
Internacional de Comunicación
Social (Proceedings of the 6th International Symposium on Social
Communication). Santiago
de Cuba, January 25-28, 1999. Santiago de Cuba: Editorial Oriente
(ISBN 959-11-0250-X). pp.
V. Rocio and J.G.P. Lopes. 1998. "Partial Parsing, Deduction and
Tabling". In: B. Lang (ed.)
Actes des premières Journées sur la Tabulation en Analyse Syntaxique
et Déduction, April
2-3,1998. Paris. Rocquencourt, France: INRIA. pp. 52-59.
The paper:
V. Rocio, E. de la Clergerie and J.G.P.Lopes. 2001. "Tabulation for
multi-purpose partial
parsing". Grammars. 4.1. Kluwer Academic Publishers. (to appear).
you may ask it from Vitor Rocio.
The work on POS tagging, using a neural-net based POS-tagger generator,
just requires a small hand tagged corpus (5,000 words were enough) and a
large lexicon. The precision we have got was aproximately 94% precise
for very badly written Portuguese (without diacritics) and 98% precision
for text more carefully written (for this experiemnt we used 20,000 hand
corrected automatically POS-tagged text). This work was used for
extracting subcategorization patterns. The literature we produced on
this subject matter, written in English can be found at:
Marques and Lopes and Coelho. (2000b). "Mining Subcategorization
Information by Using Multiple Feature Loglinear Models". In Paola
Monachesi (ed.) Computational Linguistics in the Netherlands 1999:
selected papers from the Tenth CLIN Meeting. Amsterdam-Atlanta, GA 2000:
Rodopi. Electronic version:
Marques and Lopes and Coelho (1998a). Learning Verbal Transitivity
using LogLinear Models. In: Claire Nédelec and Céline Rouveirol (eds.).
Machine Learning: ECML-98, 10th European Conference on Machine
Learning, Chemitz, Germany April 21-23, 1998, Proceedings.Lecture Notes
in Artificial Intelligence 1398. Berlin: Springer Verlag. pp. 19-24.
Marques and Lopes and Coelho (1998b). Using Loglinear Clustering for
subcategorization identification. In: J Zytkov and M.Quafafou (eds.)
Principles of Data Mining and Knowledge Discovery, 2nd European
Symposium, PKDD'98, Nantes, France September, 1998, Proceedings. Lecture
Notes in Artificial Intelligence 1510. Berlin: Springer Verlag. pp.
Marques e J.G.P. Lopes.1996. "Using Neural Nets for Portuguese
Part-of-Speech Tagging". In: Proceedings of the Fifth International
Conference on The Cognitive Science of Natural Language Processing
Dublin City University, September 2-4, 1996.
Best regards,
Gabriel Pereira Lopes
Best regards,
Gabriel Pereira Lopes
"Alexander Mikhailian
> Hello,
> I am looking for references to syntactic parsers
> that deal with morphologically rich flexive languages.
> In particular, I am interested in :
> 1. Approaches to deal with the number of POS tags
> (terminals) that would supposedly be larger
> than for English or French, e.g if one tries
> to build a list of POS tags for a morphologically
> rich language in order to follow approaches
> developed for English, this list may easily grow up
> to thousands of entries which implies that grammars
> using such a huge list of terminals would be quite
> complicated.
> 2. Approaches to deal with the free or loosely
> restricted order of words that is often proper to
> morphologically rich languages and which requires
> different parsing techniques than for English,
> where a common shift/reduce parser is often sufficient.
> Thanks in advance,
> --
> Alexander Mikahilian
More information about the Corpora
mailing list