Aqui envio un resumen de respuestas acerca de recursos linguisticos existentes
para el espanol.
Here I send a summary of answers about available spanish resources.
Gracias a / Thanks to:
Gerardo Arrarte
Fernando Sanchez Leon
Ruthanna Barnett
Alice Carlberger
Rodrigo Santurio
James L. Fidelholtz
Cesar Romani
Joerge Koch
Jose L. Rodrigo
Martin Beaumont Franowsky
Steve Halmreich
Eduardo A. Martinez Labrada
Mon Alameda
Erik Oltmans
...and many more
- ------------------------------------------------------------------
El Instituto Cervantes, ente pu'blico espan~ol dedicado
principalmente a la difusio'n en el mundo de la lengua espan~ola
y de la cultura de los pueblos de habla hispana, lleva a cabo
diversas actividades destinadas a fomentar la investigacio'n de
la lengua espan~ola.
Entre otras actividades relacionadas con el campo de la
Tecnologi'a Lingu"i'stica, estamos poniendo en marcha una oficina
cuyo objetivo sera' la promocio'n de las Industrias de la Lengua
aplicadas al espan~ol. Para ello, se ha considerado esencial
realizar una labor de recogida y diseminacio'n de informacio'n
sobre actividades en curso y recursos lingu"i'sticos disponibles
en distintos centros de investigacio'n.
Hasta el momento, hemos realizado una encuesta sobre corpus de
espan~ol existentes o en desarrollo en centros de investigacio'n
espan~oles, y hemos recogido los datos resultantes de esta
encuesta en un informe de 56 pa'ginas que tendre' mucho gusto en
hacerte llegar. En el futuro, esta' previsto ampliar este
inventario con datos correspondientes a otros tipos de recursos
lingu"i'sticos, asi' como con los procedentes de proyectos en
marcha en otros pai'ses.
: Gerardo Arrarte Carriquiry : E-mail: :
: Programas de Tecnologia Linguistica : g.arrarte at cervantes.es :
: Instituto Cervantes : :
: Libreros, 23 : Tel: +34 1 885 62 03 :
: E-28801 ALCALA DE HENARES (Madrid) : Fax: +34 1 883 50 10 :
- ------------------------------------------------------------------
El corpus ITU est'a disponible en el corpus de ECI (European Corpus
Initiative), que puede conseguirse a trav'es de la ELSNET. La direcci'on es
la siguiente:
email: elsnet at let.ruu.nl
mail : OTS, Trans 10, 3512 JK, Utrecht, The Netherlands
tel : +31 30 53 6039
fax : +31 30 53 6000
www : http://www.cogsci.ed.ac.uk/elsnet/home.html
Es un corpus triling"ue (espa~nol, ingl'es, franc'es). La versi'on que
estamos elaborando nosotros incluye etiquetado morfosint'actico, corregido
a mano, de 1 mill'on de palabras del corpus. Esta versi'on estar'a en el
dominio p'ublico a partir de octubre de este a~no.
Asimismo, la versi'on espa~nola del etiquetador de Xerox estar'a tambi'en
en el dominio p'ublico en esa fecha.
En nuestro laboratorio tenemos otros corpus, como habr'as visto en la lista
CORPORA (te incluyo parte de un anuncio en ingl'es):
There are some Spanish corpora that you can retrieve from our
laboratory. They are all documented. The corpora can be downloaded from
the following address:
Host: lola.lllf.uam.es
Login: anonymous
Password: <send your e-mail address>
At this moment, we have a corpus of spoken Spanish in orthographic
Directory: pub/corpus/oral
And a corpus of written Spanish texts from Argentine and Chile
Directory: pub/corpus/argentina
All the corpora include texts in one of the topics you are interested
in. Note that the oral corpus is compressed using UNIX command
'compress' while the other two are .zip files produced with DOS compress
utilities (take a look at README files).
Fernando Sanchez Leon
fsanchez at ccuam3.uam.es
NOTA: Mas informacion sobre el tagger de XEROX se puede conseguir en:
email: lexical at crl.nmsu.edu
ftp:// clr.nmsu.edu
Ftp Directory: members-only/tools/ling-analysis/syntax/xerox-tagger/
This part-of-speech tagger, designed by Doug Cutting and Jan Pederson
at Xerox, was written in ANSI Common Lisp. Its development was done
in Franz Allegro Common Lisp version 4.1 on SunOS4.x and MacIntosh
Common Lisp 2.0p2. The following code is provided: source code, a
tokenizer for plain ASCII English, an English lexicon enduced from the
Brown corpus, a table of mappings for word suffixes to likely
ambiguity classes, and an HMM trained on the odd numbered sentences in
the Brown corpus. More Info: info/XEROX.
ftp ://parcftp.xerox.com/pub/tagger
If you need to install Common Lisp to run it, several good free implementations
- --------------------------------------------------------------------
European Corpus Initiative corpora available on CD-ROM:
Information technology, EU, 26,000 words
El Diario Sur, local newspaper from Malaga, belongs to national publisher, in
existence for 40 years.
Different writing styles, 500,000 words.
Telecommunication user manual, several 100,000 words.
Xerox ScanWorx user manual, 45,000 words.
Civil law, Switzerland, 600,000 words.
Minimally processed by ECI; contains errors and duplication but the CLEAN and F
files are clean(?)
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words
The national newspaper ABC has just released a CD-ROM with last year's literary
supplement that can be purchased
for under $50. +4 million words of clean, high-quality written text.
Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM.
Charles Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley
The EU MULTEXT Project of collecting a corpus which will contain parallel texts
from the European
Parliament and financial newspaper articles (Spanish from Expansion newspaper).
Still finalizing licence agreements for these data.
The RELATOR language resources server, supports distribution of NLP resources.
Currently available through RELATOR speech and text corpora, lexicons, NLP
programs and tools,
and related databases and systems.
Multilingual Web pages: http://www.XX.relator.research.ec.org (XX=3Dtwo-letter
country codes of
the EU countries such as de, uk, etc.) Only speech materials.=0D
Alice Carlberger
alice at speech.kth.se
- --------------------------------------------------------------------
We have been working on a Spanish to English Machine Translation
system and so have access to a large corpus of Spanish text and have
developed a tagger for general newspaper articles. Although the
tagger uses proprietary information (Collins Spanish-English on-line
dictionary), we will shortly make the results available on-line. That
is, you will be able to e-mail Spanish texts and they will be returned
tagged with part of speech.
Steve Helmreich
shelmrei at crl.nmsu.edu
- --------------------------------------------------------------------
CMSFI52 at vmesa.cpd.uniovi.es
- --------------------------------------------------------------------
Quizas pueda serte util la lista Terminometro electronico en espanhol.
La direccion de la lista es LATIN-TE at FRMOP11.CNUSC.FR
El servidor electonico de la lista es LISTSERV at FRMOP11.CNUSC.FR
Martin Beaumont Franowsky
- --------------------------------------------------------------------
Desde hace mucho existe el trabajo de El Colegio de Me'xico (el
Diccionario del espan~ol de Me'xico), proyecto cuyo investigador
principal es Luis Fernando Lara. E'l tiene cuenta en Internet, pero no la
tengo a la mano, asi' que te doy su direccio'n de snail-mail:
Dr. Luis Fernando Lara
El Colegio de Me'xico
Camino al Ajusco
Me'xico, D. F.
Han hecho recuentos por frecuencia segu'n un corpus de aproximadamente 2
millones (si no mal recuerdo) de palabras, y tienen un programa de
asignacio'n de palabras segu'n su parte de la oracio'n.
James L. Fidelholtz
jfidel at udlapvms.pue.udlap.mx
jfidel at unm.edu
- --------------------------------------------------------------------
Nosotros tratamos corpus de lengua de gran tamano, y hemos creado herramientas
para la extraccion de informacion linguistica:
- programa de busqueda y extraccion automatica de lemas con su contexto: REAL
- programa de segmentacion y etiquetado morfologico de lemas, SMORPH.
Jose L. Rodrigo
jose at gril.univ-bpclermont.fr
34 Av. Carnot, F - 63037 Clermont-Ferrand Cedex
rodrigo at eucmax.sim.ucm.es
Facultad de Filologia
Universidad Complutense de Madrid
- --------------------------------------------------------------------
You might want to check out the AGFL Grammar WorkLab which
also contains a small grammar for the Spanish Noun Phrase.
The author, Paula Maria Santalla, can be contacted through
paula at cs.kun.nl. The URL of the AGFL home page is:
Erik Oltmans
Department of Computer Science
University of Nijmegen
Nijmegen, The Netherlands
- --------------------------------------------------------------------
The Autonomous University of Nuevo Leon College of Medicine,
Monterrey, Mexico and California State University at
Fullerton (CSUF) make available "Spanish 92" (the first
2,000 most frequent words of Spanish) based on ESPA~NOL 92
(E92), computational linguistic analysis of a million-
word corpus of contemporary Spanish carried out between
1986 and 1992 under a grant from the Secretariat of Public
Education of the Mexican government.
"Spanish 92" is available from the ftp server at CSUF:
ftp wintermute.fullerton.edu
user> anonymous
pw> username at host.domain
FTP> cd/pub/research/chandler
Prof. R. M. Chandler-Burns
College of Medicine
Autonomous University of Nuevo Leon
Monterrey, MEXICO
Gabriel Amores
Departamento de Lengua Inglesa
Universidad de Sevilla
La direccion del Prof. Chandler-Burns es rchandlr at ccr.dsi.uanl.mx
- --------------------------------------------------------------------
email: lexical at crl.nmsu.edu
ftp:// clr.nmsu.edu
Parallel Text in English and Spanish
Pan American Health Organization
Ftp Directory: members-only/corpora/PAHO/
The Pan American Health Organization (PAHO), Conferences and General
Services Division, has kindly allowed this group of sample parallel
texts to be released for nlp research purposes. There are 180 pairs
of text, 360 individual files, which amount to about 8 Mb of data.
The documents cover the general domains of Public Health and Latin
America, but vary greatly in content and in length. Some are short
memos or letters, most are longer reports and conference proceedings.
The Spanish documents do contain the Spanish character encoding.
Other formatting commands, such as tabs, centering, italicizing, etc.
have been removed. Special thanks to Dr. Marjorie Leon for her
assistance in making these texts available.
- --------------------------------------------------------------------
The PAPPI System: A Principle-Based Parser
Announcing the first public release of PAPPI, a Prolog-based
natural language parser for theories in the Principles-and-
Parameters framework. PAPPI is designed to run on Sun Sparc-
stations with Quintus Prolog. The PAPPI system includes:
* An X-Window system-based user interface to the
underlying Prolog-based parser.
* A sample implementation of classic GB-theory, based
on theory described in Lasnik and Uriagereka's textbook
"A Course in GB Syntax". The implementation also includes
sets of example sentences and sample parameterization for
six languages. Currently, these are English, Japanese,
Dutch, French, Spanish and German. (This software was
recently demoed at COLING '94.)
PAPPI is a parser that is designed to be a high-level research
tool for experimenting with and learning about linguistic
theory. This release represents just one possible instantiation
within the Principles-and-Parameters framework. Users are
encouraged to experiment with and modify the sample principles.
The PAPPI system represents code written to support research
work. It is still very much under development. Alternate
theories (and more sophisticated parsing models) will be made
publically available at a later stage. Upcoming releases may
also support other platforms and may not need Quintus Prolog.
This is free software developed at the NEC Research Institute,
Inc., an institute for conducting long-term, fundamental
research in computer and physical sciences. Comments and
suggestions for improvement to the system will be gratefully
accepted! I would like to also hear from those interested in
extending the system. The PAPPI project also welcomes unencumbered
software contributions, including (but not limited to) support
for additional languages, theory and debugging tools.
The system is available for anonymous ftp as:
[Note: X is an alphabetic character denoting the current
minor release.]
A .gz compressed version of the same tar file is also
available as:
This version is recommended for those for those installations
having GNU compress.
Current requirements:
Sun Sparcstation
SunOS 4.1.3 or 5.3 (aka Solaris 2.3)
Quintus Prolog 3.1.4 or 3.1.1 (June 1992)
Approx. 35MB of disk space (55-70MB to install)
Contact address:
Dr. Sandiway Fong
NEC Research Institute, Inc.
Princeton NJ 08540
Email: sandiway at research.nj.nec.com
Fax: (609) 951-2482
- --------------------------------------------------------------------
Cualquier otra informacion sobre recursos para el espanol, por
favor envienla a mi direccion de e-mail (no voy a estar suscrito
a la lista).
Please, send any other information about spanish resources to
my e-mail address (I'll be no longer subscribed to the list).
Muchas gracias !!
Thank you very much !!
Pablo Accuosto
Facultad de Ingenieria
Universidad de la Republica
Montevideo - Uruguay
e-mail: accuosto at fing.edu.uy
