Corpora: Parsing morphologically rich languages

Gabriel Pereira Lopes gpl at di.fct.unl.pt
Tue Jan 16 14:43:44 UTC 2001


Dear Alexander,

To work with morphologically rich languages does not necessarily require
a large number of POS-tags, as this would require very large hand tagged
corpora in order to train taggers and have text automatically
POS-tagged.

We work with approximately 40 tags for Portuguese and that does not
inhibit our parser from starting with automatically POS-tagged text and
find out the required morpho-syntactic information. Of course we are
working with a kind of PROLOG (DyALog) that enables the construction of
chart parsers quite effective (250 words/sec.) that moreover can be used
for fault finding and fault repairing. Have alook at the web page by
Vitor Rocio (http://pc-gpl.di.fct.unl.pt/~vjr/) from where you can pick
up some of our publications (Publicações) on this subject matter. See:

Vitor Rocio and J.G.P. Lopes. 1999. "An infra-structure for diagnosing
causes for partially
    parsed natural language input". In: ACTAS-I VI Simposio
Internacional de Comunicación
    Social (Proceedings of the 6th International Symposium on Social
Communication). Santiago
    de Cuba, January 25-28, 1999. Santiago de Cuba: Editorial Oriente
(ISBN 959-11-0250-X). pp.
    550-554.
and

V. Rocio and J.G.P. Lopes. 1998. "Partial Parsing, Deduction and
Tabling". In: B. Lang (ed.)
    Actes des premières Journées sur la Tabulation en Analyse Syntaxique
et Déduction, April
    2-3,1998. Paris. Rocquencourt, France: INRIA. pp. 52-59.

The paper:
V. Rocio, E. de la Clergerie and J.G.P.Lopes. 2001. "Tabulation for
multi-purpose partial
    parsing". Grammars. 4.1. Kluwer Academic Publishers. (to appear).

you may ask it from Vitor Rocio.

The work on POS tagging, using a neural-net based POS-tagger generator,
just requires a small hand tagged corpus (5,000 words were enough) and a
large lexicon. The precision we have got was aproximately 94% precise
for very badly written Portuguese (without diacritics) and 98% precision
for text more carefully written (for this experiemnt we used 20,000 hand
corrected automatically POS-tagged text). This work was used for
extracting subcategorization patterns. The literature we produced on
this subject matter, written in English can be found at:

Marques and Lopes and Coelho. (2000b). "Mining Subcategorization
Information by Using Multiple Feature Loglinear Models". In Paola
Monachesi (ed.) Computational Linguistics in the Netherlands 1999:
selected papers from the Tenth CLIN Meeting. Amsterdam-Atlanta, GA 2000:
Rodopi. Electronic version:
http://www-uilots.let.uu.nl/publications/clin1999/papers.html.

Marques and Lopes and Coelho (1998a).  “Learning Verbal Transitivity
using LogLinear Models”. In: Claire Nédelec and Céline Rouveirol (eds.).
Machine Learning: ECML-98, 10th European Conference on Machine
Learning,  Chemitz, Germany April 21-23, 1998, Proceedings.Lecture Notes
in Artificial Intelligence 1398. Berlin: Springer Verlag. pp. 19-24.

Marques and Lopes and Coelho (1998b). “Using Loglinear Clustering for
subcategorization identification”. In: J Zytkov and M.Quafafou (eds.)
Principles of Data Mining and Knowledge Discovery, 2nd European
Symposium, PKDD'98, Nantes, France September, 1998, Proceedings. Lecture
Notes in Artificial Intelligence 1510. Berlin: Springer Verlag. pp.
379-387.

Marques e J.G.P. Lopes.1996. "Using Neural Nets for Portuguese
Part-of-Speech Tagging". In: Proceedings of the Fifth International
Conference on The Cognitive Science of Natural Language Processing
Dublin City University, September 2-4, 1996.


Best regards,

Gabriel Pereira Lopes

Best regards,

Gabriel Pereira Lopes

"Alexander Mikhailian

> Hello,
>
> I am looking for references to syntactic parsers
> that deal with morphologically rich flexive languages.
>
> In particular, I am interested in :
>
> 1. Approaches to deal with the number of POS tags
> (terminals) that would supposedly be larger
> than for English or French, e.g if one tries
> to build a list of POS tags for a morphologically
> rich language in order to follow approaches
> developed for English, this list may easily grow up
> to thousands of entries which implies that grammars
> using such a huge list of terminals would be quite
> complicated.
>
> 2. Approaches to deal with the free or loosely
> restricted order of words that is often proper to
> morphologically rich languages and which requires
> different parsing techniques than for English,
> where a common shift/reduce parser is often sufficient.
>
> Thanks in advance,
>
> --
> Alexander Mikahilian



More information about the Corpora mailing list