[Corpora-List] Morphology of Spanish, Portuguese, Italian - Summary
Sergei A. Koval
skoval at online.ru
Fri Mar 28 14:33:33 UTC 2003
Dear colleagues,
Some ten days ago I posted to this list an enquiry about morphological resources on Spanish, Portuguese, Italian.
I was interested in any data freely available in the Internet that would cover the inventory of inflectional classes (or paradigms) for those three languages in any format (should it be lists of desinences for cut-and-paste implementations, or lexica implemented as finite-state machines).
I am very much obliged to all who posted me very useful links and suggestions. Nearly all those postings were sent directly to me, and now I have the pleasure to summarise them for the Corpora list.
******************
Spanish morphology
******************
What can be described more appropriately by the term "cut-and-paste" morphology than the Porter stemmer and ISPELL affix files?! So, no wonder that I was guided to the ISPELL site (by Mike Maxwell maxwell at ldc.upenn.edu),
which has the Spanish section at
http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Spanish-dicts
Jonathan Young pointed me to the Spanish verb conjugator compjuga by Daniel M.German at http://compjugador.sourceforge.net/ .
Of especial interest for my purposes are the data files in the archive text-compjugador-0.1.tar.gz available there, which support the functionality described as follows:
"It is able to conjugate all the verbs in the official Spanish (as in the Diccionario de la Real
Academia). It contains close to 10,000 verbs"
Another valuable link from Jonathan Young was to the software Verba (Ilya Braud & Perry Rapp) that creates for input files the HTML-style markup supporting float-overs with translation at
http://www.geocities.com/getverba/verba.html . To date the Verba works with the Latin-English, Spanish-English, and English-Spanish language pairs. What was important for me:
"It understands simple Latin and Spanish inflections, recognizing, for example, "stellarum" (Latin), or "hablaste" (Spanish). "
The morphological data of Spanish can be downloaded separately, as the file spdata.2003-01-03.zip
*********************
Portuguese morphology
*********************
David Matos (David.Matos at ACM.org) directed me to the Linguateca site:
http://www.linguateca.pt/
which is an inestimable collection of links to various resources on the Portuguese language.
The pages of this site contain, among others, such sections as "Ajuda a redaccao" (that includes references to the ISPELL dictionaries for the Portuguese of Portugal and for Brazilian Portuguese), "Componentes basicos de um sistema de Processamento de Linguagem Natural: analisadores ou geradores da lingua", "Conjugadores verbais", as well as links to numerous "Dicionarios gerais", among which I can quite well find more accounts of the Portuguese inflectional system, which I am looking for.
With great interest and gratitude I accepted from Viviane Orengo (V.Orengo at mdx.ac.uk)
the data accompanying her own stemming algorithm for Portuguese. As she wrote:
"It contains the most common Portuguese suffixes and rules to remove them."
The information I referred to in my original enquiry is part of the C++ source code in those data.
*****************
Italian morphology
*****************
Surprisingly enough (at least for me), I did not get much for Italian.
Following Mike Maxwell's links I arrived at the ISPELL dictionary and affix file for Italian:
http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Italian-dicts
...and that's it! (the only exception was made by the data supplied from one contributor who asked me not to disclose his details for fear of spam)
Thanks a lot again to all who supplied these precious links!
Sergei Koval
Doctoral Researcher
St. Petersburg State University
skoval at online.ru
or
englearner at yahoo.com
Mike Maxwell of the Linguistic Data Consortium
maxwell at ldc.upenn.edu
assured me that it is no big sweat to type in all the endings from a common Spanish dictionary or introductory grammar as, for example, for verbs there are only three basic paradigms + fifty-odd "irregular verbs".
More information about the Corpora
mailing list