6.977, Sum: Spanish corpora
The Linguist List
linguist at tam2000.tamu.edu
Mon Jul 17 18:29:37 UTC 1995
---------------------------------------------------------------------------
LINGUIST List: Vol-6-977. Mon Jul 17 1995. ISSN: 1068-4875. Lines: 188
Subject: 6.977, Sum: Spanish corpora
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
Assoc. Editor: Ljuba Veselinova <lveselin at emunix.emich.edu>
Asst. Editors: Ron Reck <rreck at emunix.emich.edu>
Ann Dizdar <dizdar at tam2000.tamu.edu>
Annemarie Valdez <avaldez at emunix.emich.edu>
Editor for this issue: dizdar at tam2000.tamu.edu (Ann Dizdar)
---------------------------------Directory-----------------------------------
1)
Date: Fri, 14 Jul 1995 11:48:49 BST
From: albert at incyta.es (Albert Llorens)
Subject: summary
---------------------------------Messages------------------------------------
1)
Date: Fri, 14 Jul 1995 11:48:49 BST
From: albert at incyta.es (Albert Llorens)
Subject: summary
Dear all,
I send you a summary of the answers I got for my query on Spanish corpora.
My apologies for the repetitions: I haven't got the time to really "summarize".
Ta.
Yours,
Albert Llorens
Spanish-English Development Group
Incyta, S.A.
c. Lluis Muntadas 5
08940 Cornella de Llobregat
Barcelona
Spain
e-mail: albert at incyta.es
___________________________________________________________________________
There's a CD-ROM edited by the European Corpus Initiative which includes
a number of texts in several european languages. Among others it includes
CEE law in Spanish, English and Portugese, or a Xerox manual in English
and Spanish.
A somewhat more detailed account of the contents of this CD-ROM follows:
EUROPEAN CORPUS INITIATIVE CORPORA AVAILABLE ON CD-ROM:
ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words
ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belongs
to national publisher, in existence for 40 years. Different writing
styles, 500,000 words.
ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000
words.
ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words.
ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words.
ECI4/SPA03: Minimally processed by ECI; contains errors and
duplication but the CLEAN and FC files seem to be clean.
El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words
___________________________________________________________________________
Apart from the ECI CD-ROM there are the following corpora available:
ftp lola.lllf.uam.es /pub/corpus/argentina 2 million words
/pub/corpus/chile 2 millions words
Fernando Sanchez Leon, Laboratorio de Linguistica Informatica:
The CRATER Project: ITU corpus in the process of postediting.
Trilingual (French/English/Spanish) corpus has more than 3 million
words and is the so-called "White Book on Telecommunications"
released by the International Telecommunications Union. Fernando et al
are working with a 1-million word subcorpus, which will also be
postedited. This corpus, along with the tagger developed for its
tagging and all the resources associated with the tagger
will be in the public domain in October 1995. There is a lexicon with
+35,000 words (full forms, not lemmas), part-of-speech annotated, that
can be used as a starting point in lexicon-building tasks.
The national newspaper ABC has just released a CD-ROM with last year's
literary supplement that can be purchased for under $50. +4 million
words of clean, high-quality written text.
Archivo Digital de Manuscritos y Textos Espa=A4oles available on
CD-ROM. Charles Faulhaber, Dept. of Spanish & Portuguese, U of
California, Berkeley.
The EU MULTEXT Project of collecting a corpus which will contain
parallel texts from the European Parliament and financial newspaper
articles (Spanish from Expansion newspaper). Still finalizing licence
agreements for these data.
The RELATOR language resources server, supports distribution of NLP
resources. Currently available through RELATOR speech and text
corpora, lexicons, NLP programs and tools, and related databases and
systems. ftp://de.relator.research.ec.org/relator
afs://afs/research.ec.org/projects/relator
Multilingual Web pages: http://www.XX.relator.research.ec.org
(XX=3Dtwo-letter country codes of the EU countries such as de, uk,
etc.) Only speech materials.
Briscoe et al paper reports a 17,000-word tagged corpus. (This is all
the info I have on this paper.)
ftp ://parcftp.xerox.com/pub/tagger
Spanish tagger, implemented in Common Lisp. Comes with documentation,
works very well. If you need to install Common Lisp to run it, several
good free implementations at
http://www.cs.rochester.edu/users/staff/miller/alu.html.
___________________________________________________________________________
A last report.
> 1. /pub/corpus/: a. Oral corpus of Spanish (7 MB, about 2,000,000 words)
> b. Some written corpora of South American Spanish
>
> 2. The lds is the best source, but joining costs money.
>
> 3. The Oxford Text Archive
> 13 Banbury Road
> Oxford OX2 6NN
> fax: +44 865 273275
>
> Catalogue of over 1300 titles, available in paper
> or electronic form on the Oxford VAX Cluster as OX$DOC:TEXTARCHIVE.LIST and
> OX$TEXTARCHIVE.SGML, from various ListServers, e.g., LISTSERV at BROWNVM (send
> the mail message GET HUMANIST FILELIST for details), by anonymous FTP from
> Internet site ota.ox.ac.uk (163.1.2.4) in the directory pub/ota/public.
> Also, wherever you are, you can send a note to ARCHIVE at VAX.OXFORD.AC.UK
> specifying which form you want.
>
> Spanish
>
> a. Literary works, poems.
>
> 4. 1066108 words (approx.)
> Origin: Grupo EUROTRA, Universidad Autonoma de Madrid
> Contact: Manuel Campos, eurotrac at ccuam3.sdi.uam.es or
> Fernando Sanchez Leon, Laboratorio de L
> Available: Publically via anonymous ftp, node lola.lllf.uam.es,
> directory pub/corpus
> Contents: transcriptions of spoken language (conferences, conversations, etc.
)
>
> 5. 121051 words (approx.)
> Origin: CHILDES (Child Language Data Exchange System) database, Carnegie
Mellon
> Univ.
> Contact: Brian MacWhinney, brian at andrew.cmu.edu
> Available: Publically, previous communication with Brian MacWhinney
> Contents: Database of corpora of parent-child and child-child interactions
> from children speaking.
>
> 6. 9,000,000 words (approx.)
> Origin: THis is the European Corpus Initiative Multilingual Corpus I CD-ROM
> Cost: 20 Pounds
> Contact: eucorp at cogsci.ed.ac.uk
> Available: All use of this corpus is subject to a licence agreement
> The CD-ROM is available in the US from the Linguistic Data Consortium (LDC),
> for members of the LDC or those making a bulk purchase, and otherwise from
> ELSNET, 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND. The cost from ELSNET
> is 20 UK Pounds plus postage, handling and tax where applicable. Ordering
> procedure is detailed in
>
> http://www.cogsci.ed.ac.uk/elsnet/eci.html
>
> 7. University of Barcelona: spoken corpus
___________________________________________________________________________
------------------------------------------------------------------------
LINGUIST List: Vol-6-977.
More information about the LINGUIST
mailing list