6.990, Sum: Recursos para el espanol (spanish resources)

The Linguist List linguist at tam2000.tamu.edu
Thu Jul 20 06:07:45 UTC 1995


---------------------------------------------------------------------------
LINGUIST List:  Vol-6-990. Thu Jul 20 1995. ISSN: 1068-4875. Lines:  481
 
Subject: 6.990, Sum: Recursos para el espanol (spanish resources)
 
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
 
Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>
 
Software development: John H. Remmers <remmers at emunix.emich.edu>
 
Editor for this issue: dizdar at tam2000.tamu.edu (Ann Dizdar)
 
---------------------------------Directory-----------------------------------
1)
Date:  Wed, 19 Jul 1995 22:14:00 GMT
From:  accuosto at fing.edu.uy (Pablo Accuosto)
Subject:  Sum: Recursos para el espanol (spanish resources)
 
---------------------------------Messages------------------------------------
1)
Date:  Wed, 19 Jul 1995 22:14:00 GMT
From:  accuosto at fing.edu.uy (Pablo Accuosto)
Subject:  Sum: Recursos para el espanol (spanish resources)
 
Aqui envio un resumen de respuestas acerca de recursos linguisticos existentes
para el espanol.
 
Here I send a summary of answers about available spanish resources.
 
Gracias a / Thanks to:
 
Gerardo Arrarte
Fernando Sanchez Leon
Ruthanna Barnett
Alice Carlberger
Rodrigo Santurio
James L. Fidelholtz
Cesar Romani
Joerge Koch
Jose L. Rodrigo
Martin Beaumont Franowsky
Steve Halmreich
Eduardo A. Martinez Labrada
Mon Alameda
Erik Oltmans
 
...and many more
 
- ------------------------------------------------------------------
 
El Instituto Cervantes, ente pu'blico espan~ol dedicado
principalmente a la difusio'n en el mundo de la lengua espan~ola
y de la cultura de los pueblos de habla hispana, lleva a cabo
diversas actividades destinadas a fomentar la investigacio'n de
la lengua espan~ola.
 
Entre otras actividades relacionadas con el campo de la
Tecnologi'a Lingu"i'stica, estamos poniendo en marcha una oficina
cuyo objetivo sera' la promocio'n de las Industrias de la Lengua
aplicadas al espan~ol.  Para ello, se ha considerado esencial
realizar una labor de recogida y diseminacio'n de informacio'n
sobre actividades en curso y recursos lingu"i'sticos disponibles
en distintos centros de investigacio'n.
 
Hasta el momento, hemos realizado una encuesta sobre corpus de
espan~ol existentes o en desarrollo en centros de investigacio'n
espan~oles, y hemos recogido los datos resultantes de esta
encuesta en un informe de 56 pa'ginas que tendre' mucho gusto en
hacerte llegar.  En el futuro, esta' previsto ampliar este
inventario con datos correspondientes a otros tipos de recursos
lingu"i'sticos, asi' como con los procedentes de proyectos en
marcha en otros pai'ses.
 
.................................................................
: Gerardo Arrarte Carriquiry          :  E-mail:                :
: Programas de Tecnologia Linguistica :  g.arrarte at cervantes.es :
: Instituto Cervantes                 :                         :
: Libreros, 23                        :  Tel:  +34 1 885 62 03  :
: E-28801  ALCALA DE HENARES (Madrid) :  Fax:  +34 1 883 50 10  :
.................................................................
 
 
- ------------------------------------------------------------------
 
 
El corpus ITU est'a disponible en el corpus de ECI (European Corpus
Initiative), que puede conseguirse a trav'es de la ELSNET. La direcci'on es
la siguiente:
 
email:   elsnet at let.ruu.nl
mail :   OTS, Trans 10, 3512 JK, Utrecht, The Netherlands
tel  :   +31 30 53 6039
fax  :   +31 30 53 6000
www  :   http://www.cogsci.ed.ac.uk/elsnet/home.html
 
Es un corpus triling"ue (espa~nol, ingl'es, franc'es). La versi'on que
estamos elaborando nosotros incluye etiquetado morfosint'actico, corregido
a mano, de 1 mill'on de palabras del corpus. Esta versi'on estar'a en el
dominio p'ublico a partir de octubre de este a~no.
 
Asimismo, la versi'on espa~nola del etiquetador de Xerox estar'a tambi'en
en el dominio p'ublico en esa fecha.
 
En nuestro laboratorio tenemos otros corpus, como habr'as visto en la lista
CORPORA (te incluyo parte de un anuncio en ingl'es):
 
There are some Spanish corpora that you can retrieve from our
laboratory. They are all documented. The corpora can be downloaded from
the following address:
 
Host:   lola.lllf.uam.es
Login:  anonymous
Password: <send your e-mail address>
 
At this moment, we have a corpus of spoken Spanish in orthographic
transcription
 
Directory:      pub/corpus/oral
 
And a corpus of written Spanish texts from Argentine and Chile
 
Directory:      pub/corpus/argentina
                pub/corpus/chile
 
All the corpora include texts in one of the topics you are interested
in. Note that the oral corpus is compressed using UNIX command
'compress' while the other two are .zip files produced with DOS compress
utilities (take a look at README files).
 
 
Fernando Sanchez Leon
fsanchez at ccuam3.uam.es
 
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
 
NOTA: Mas informacion sobre el tagger de XEROX se puede conseguir en:
 
CONSORTIUM FOR LEXICAL RESEARCH
email: lexical at crl.nmsu.edu
ftp:// clr.nmsu.edu
 
Ftp Directory:     members-only/tools/ling-analysis/syntax/xerox-tagger/
 
This part-of-speech tagger, designed by Doug Cutting and Jan Pederson
at Xerox, was written in ANSI Common Lisp. Its development was done
in Franz Allegro Common Lisp version 4.1 on SunOS4.x and MacIntosh
Common Lisp 2.0p2. The following code is provided:  source code, a
tokenizer for plain ASCII English, an English lexicon enduced from the
Brown corpus, a table of mappings for word suffixes to likely
ambiguity classes, and an HMM trained on the odd numbered sentences in
the Brown corpus. More Info: info/XEROX.
 
o:
 
ftp ://parcftp.xerox.com/pub/tagger
 
If you need to install Common Lisp to run it, several good free implementations
 at
http://www.cs.rochester.edu/users/staff/miller/alu.html.
 
 
- --------------------------------------------------------------------
 
 
European Corpus Initiative corpora available on CD-ROM:
 
ECI1/MUL06/MSP06/SPA16A:
Information technology, EU, 26,000 words
 
ECI1/SPA02A-J:
El Diario Sur, local newspaper from Malaga, belongs to national publisher, in
 existence for 40 years.
Different writing styles, 500,000 words.
 
ECI2/MUL04/MSP04A-J:
Telecommunication user manual, several 100,000 words.
 
ECI2/MUL09/SPA19A:
Xerox ScanWorx user manual, 45,000 words.
 
ECI2/MUL12/MSP12/MSP12A-C:
Civil law, Switzerland, 600,000 words.
 
ECI4/SPA03:
Minimally processed by ECI; contains errors and duplication but the CLEAN and F
C
 files are clean(?)
 
 
 
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words
 
 
The national newspaper ABC has just released a CD-ROM with last year's literary
 supplement that can be purchased
for under $50. +4 million words of clean, high-quality written text.
 
Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM.
Charles Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley
 
The EU MULTEXT Project of collecting a corpus which will contain parallel texts
 from the European
Parliament and financial newspaper articles (Spanish from Expansion newspaper).
Still finalizing licence agreements for these data.
 
The RELATOR language resources server, supports distribution of NLP resources.
Currently available through RELATOR speech and text corpora, lexicons, NLP
 programs and tools,
and related databases and systems.
 
ftp://de.relator.research.ec.org/relator=0D
afs://afs/research.ec.org/projects/relator
 
Multilingual Web pages: http://www.XX.relator.research.ec.org (XX=3Dtwo-letter
 country codes of
the EU countries such as de, uk, etc.) Only speech materials.=0D
 
Alice Carlberger
alice at speech.kth.se
 
- --------------------------------------------------------------------
 
We have been working on a Spanish to English Machine Translation
system and so have access to a large corpus of Spanish text and have
developed a tagger for general newspaper articles.  Although the
tagger uses proprietary information (Collins Spanish-English on-line
dictionary), we will shortly make the results available on-line.  That
is, you will be able to e-mail Spanish texts and they will be returned
tagged with part of speech.
 
Steve Helmreich
shelmrei at crl.nmsu.edu
 
- --------------------------------------------------------------------
 
HOLA;
SOY EL COAUTOR DE UN DICCIONARIO DE FRECUENCIAS DEL CASTELLANO.
...
MON ALAMEDA
CMSFI52 at vmesa.cpd.uniovi.es
 
- --------------------------------------------------------------------
 
Quizas pueda serte util la lista Terminometro electronico en espanhol.
 
La direccion de la lista es LATIN-TE at FRMOP11.CNUSC.FR
El servidor electonico de la lista es LISTSERV at FRMOP11.CNUSC.FR
 
Martin Beaumont Franowsky
BEAUMONT at DESCO.ORG.PE
 
- --------------------------------------------------------------------
 
Desde hace mucho existe el trabajo de El Colegio de Me'xico (el
Diccionario del espan~ol de Me'xico), proyecto cuyo investigador
principal es Luis Fernando Lara.  E'l tiene cuenta en Internet, pero no la
tengo a la mano, asi' que te doy su direccio'n de snail-mail:
        Dr. Luis Fernando Lara
        DEM
        El Colegio de Me'xico
        Camino al Ajusco
        Me'xico, D. F.
        ME'XICO.
Han hecho recuentos por frecuencia segu'n un corpus de aproximadamente 2
millones (si no mal recuerdo) de palabras, y tienen un programa de
asignacio'n de palabras segu'n su parte de la oracio'n.
 
James L. Fidelholtz
jfidel at udlapvms.pue.udlap.mx
jfidel at unm.edu
 
- --------------------------------------------------------------------
 
Nosotros tratamos corpus de lengua de gran tamano, y hemos creado herramientas
para la extraccion de informacion linguistica:
 
- programa de busqueda y extraccion automatica de lemas con su contexto: REAL
- programa de segmentacion y etiquetado morfologico de lemas, SMORPH.
 
Jose L. Rodrigo
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
jose at gril.univ-bpclermont.fr
GRIL : GROUPE DE RECHERCHE DANS LES INDUSTRIES DE LA LANGUE
UNIVERSITE BLAISE PASCAL - CLERMONT II
34 Av. Carnot, F - 63037 Clermont-Ferrand Cedex
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
rodrigo at eucmax.sim.ucm.es
Facultad de Filologia
Universidad Complutense de Madrid
 
- --------------------------------------------------------------------
 
You might want to check out the AGFL Grammar WorkLab which
also contains a small grammar for the Spanish Noun Phrase.
The author, Paula Maria Santalla, can be contacted through
paula at cs.kun.nl. The URL of the AGFL home page is:
 
http://www.cs.kun.nl/agfl/
 
Erik Oltmans
Department of Computer Science
University of Nijmegen
Nijmegen, The Netherlands
http://www.cs.kun.nl/agfl/eriko
 
- --------------------------------------------------------------------
 
The Autonomous University of Nuevo Leon College of Medicine,
 
Monterrey, Mexico and California State University at
 
Fullerton (CSUF) make available "Spanish 92" (the first
 
2,000 most frequent words of Spanish) based on ESPA~NOL 92
 
(E92), computational linguistic analysis of a million-
 
word corpus of contemporary Spanish carried out between
 
1986 and 1992 under a grant from the Secretariat of Public
 
Education of the Mexican government.
 
 
"Spanish 92" is available from the ftp server at CSUF:
 
 
ftp wintermute.fullerton.edu
 
user> anonymous
 
  pw> username at host.domain
 
 FTP> cd/pub/research/chandler
 
 
Prof. R. M. Chandler-Burns
 
College of Medicine
 
Autonomous University of Nuevo Leon
 
Monterrey, MEXICO
 
Remite:
 
 
Gabriel Amores
Departamento de Lengua Inglesa
Universidad de Sevilla
 
NOTA :
 
La direccion del Prof. Chandler-Burns es rchandlr at ccr.dsi.uanl.mx
 
- --------------------------------------------------------------------
 
CONSORTIUM FOR LEXICAL RESEARCH
email: lexical at crl.nmsu.edu
ftp:// clr.nmsu.edu
 
 
Parallel Text in English and Spanish
Pan American Health Organization
 
Ftp Directory: members-only/corpora/PAHO/
 
The Pan American Health Organization (PAHO), Conferences and General
Services Division, has kindly allowed this group of sample parallel
texts to be released for nlp research purposes.  There are 180 pairs
of text, 360 individual files, which amount to about 8 Mb of data.
The documents cover the general domains of Public Health and Latin
America, but vary greatly in content and in length.  Some are short
memos or letters, most are longer reports and conference proceedings.
The Spanish documents do contain the Spanish character encoding.
Other formatting commands, such as tabs, centering, italicizing, etc.
have been removed.  Special thanks to Dr. Marjorie Leon for her
assistance in making these texts available.
 
- --------------------------------------------------------------------
 
                The PAPPI System: A Principle-Based Parser
 
 
        Announcing the first public release of PAPPI, a Prolog-based
        natural language parser for theories in the Principles-and-
        Parameters framework. PAPPI is designed to run on Sun Sparc-
        stations with Quintus Prolog. The PAPPI system includes:
 
        * An X-Window system-based user interface to the
          underlying Prolog-based parser.
 
        * A sample implementation of classic GB-theory, based
          on theory described in Lasnik and Uriagereka's textbook
          "A Course in GB Syntax". The implementation also includes
          sets of example sentences and sample parameterization for
          six languages. Currently, these are English, Japanese,
          Dutch, French, Spanish and German. (This software was
          recently demoed at COLING '94.)
 
        PAPPI is a parser that is designed to be a high-level research
        tool for experimenting with and learning about linguistic
        theory. This release represents just one possible instantiation
        within the Principles-and-Parameters framework. Users are
        encouraged to experiment with and modify the sample principles.
 
        The PAPPI system represents code written to support research
        work. It is still very much under development.  Alternate
        theories (and more sophisticated parsing models) will be made
        publically available at a later stage. Upcoming releases may
        also support other platforms and may not need Quintus Prolog.
 
        This is free software developed at the NEC Research Institute,
        Inc., an institute for conducting long-term, fundamental
        research in computer and physical sciences. Comments and
        suggestions for improvement to the system will be gratefully
        accepted! I would like to also hear from those interested in
        extending the system. The PAPPI project also welcomes unencumbered
        software contributions, including (but not limited to) support
        for additional languages, theory and debugging tools.
 
        The system is available for anonymous ftp as:
 
                external.nj.nec.com:/pub/sandiway/Pappi-2.0X.tar.Z
 
        [Note: X is an alphabetic character denoting the current
         minor release.]
 
        A .gz compressed version of the same tar file is also
        available as:
 
                external.nj.nec.com:/pub/sandiway/Pappi-2.0X.tar.gz
 
        This version is recommended for those for those installations
        having GNU compress.
 
        Current requirements:
 
                Sun Sparcstation
                SunOS 4.1.3 or 5.3 (aka Solaris 2.3)
                Quintus Prolog 3.1.4 or 3.1.1 (June 1992)
                Approx. 35MB of disk space (55-70MB to install)
 
        Contact address:
 
                Dr. Sandiway Fong
                NEC Research Institute, Inc.
                Princeton NJ 08540
                USA
                Email: sandiway at research.nj.nec.com
                Fax: (609) 951-2482
 
- --------------------------------------------------------------------
 
Cualquier otra informacion sobre recursos para el espanol, por
favor envienla a mi direccion de e-mail (no voy a estar suscrito
a la lista).
 
Please, send any other information about spanish resources to
my e-mail address (I'll be no longer subscribed to the list).
 
Muchas gracias !!
Thank you very much !!
 
Pablo Accuosto
Facultad de Ingenieria
Universidad de la Republica
Montevideo - Uruguay
 
e-mail: accuosto at fing.edu.uy
 
 
------------------------------------------------------------------------
LINGUIST List: Vol-6-990.



More information about the LINGUIST mailing list