Corpora: Summary: Catalan & Galician corpora

Alex Franz alex at google.com
Fri May 18 23:06:37 UTC 2001


Thank you very much to the following people for their extremely
helpful responses to my query about Catalan and Galician corpora:

Ines Diz
Linda Oxnard
Shlomo Izre'el
Joan Soler i Bou
Claus Pusch
Teresa Cabre
Mary Taffet
Lluis de Yzaguirre
Jorge Vivaldi
Lidia Lluis
Araceli Alonso

Below is a summary of the information that I received.

--Alex

---
The "Ramón Piñeiro Centre for Humanities Research" is developing a
galician corpus.

You can find information at:

http://www.cirp.es/WXN/wxn/frames/proxectos.html
---
There are three universities which I know are doing corpus research in
Catalonia, Universitat de Barcelona, Universitat Pompeu Fabra and
Universitat Politecnica de Catalunya.

You might want to take a look at the multilingual corpus which the UPF have
put together (texts in Catalan, Spanish, French, English and German) in
specialised areas (law, environment, medicine, economy and IT).

http://www.iula.upf.es/corpus/corpus.htm

Take a look at http://www.iula.upf.es/corpus/eticca.htm for the tools and
demos which they have put on line.

At http://nipadio.lsi.upc.es/cgi-bin/demo/demo.pl you will find some demos
for corpus tools in Catalan, Spanish and English which the UPC have put on
line.

Also, the UB are working on a number of corpora (including an oral one of
colloquial Catalan and a written one of contemporary Catalan).  I'm not
sure of the exact URL and their server seems to be down at the moment, but
the group is called Lincat and they are at www.ub.es.

Finally, you might be interested to know that there is an automatic
language identifier which includes Catalan at
http://odur.let.rug.nl/~vannoord/TextCat/Demo/
I have used this with reasonable success to do focused web crawling for
Catalan pages.
---
Take a look at http://www.uni-tuebingen.de/romanistik/zfk/oller.html
---
The Institut d'Estudis Catalans (IEC) have developed a corpus of
contemporary Catalan of 52 million words. The corpus is specially conceived
as a reference corpus for dictionary-making, and this is in fact the
internal use of the IEC. The corpus also accessible via internet in
http://pdl.iec.es/. At the same address you can find documentation on the
corpus structure. Keep in mind that the corpus browser limits the results to
100 instances for each consult. One of the firsts result of this corpus has
been the publication of a frequency dictionary of Catalan, containing in
paper and in electronic support the statistic information obtained from the
lemmas of the corpus.

The IEC has also the PAROLE Catalan corpus, a 21 million corpus developed
within the PAROLE project, financed by the European Commission.
---
I suppose that you are looking for
contemporary language samples and not for historical texts.

I am more aware of ressources for Catalan, so I will concentrate on this
language.

As far as web-based written texts (for non-linguistic purposes) are
concerned, the best platform to start from is <http://www.lincaweb.es>
(or maybe it's <http://www.lincaweb.com>). This gives thousands of links
to Catalan web sites, official, commercial and private ones.

Among the ressources you might find particularly useful are the online
versions of Catalan newspapers and weekly magazines. These are also
accessible from <http://www.lincaweb.es>, but here are some direct links
(the newspapers come from different parts of Catalan-speaking Spain,
most from Catalonia, one from the Valencian Land, one from the Balearic
Islands):

El Periódico:           <http://www.elperiodico.com>
Avui:                   <http://www.avui.com>
Diari de Tarragona:     <http://www.diaridetarragona.com>
Diari de Girona:        <http://www.diaridegirona.es>
Diari El Segre          <http://www.diarisegre.com>
Diari de Balears:       <http://www.diaridebalears.com>
Regió7:                 <http://www.regio7.com>
El Temps:               <http://www.eltemps.com>

A good platform to start from collecting a web corpus may also be the
Catalan newsportal Partal at <http://www.partal.com> (or, again, it may
be: <http://www.partal.es>). They have now two versions, one for
Catalonia, the other for the Land of València.

As far as official texts are concerned, a sample of the Official
Bulletins of the Catalan regional government (Diari Oficial de la
Generalitat de Catalunya) is available on-line (<http://www.gencat.es>,
then choose link _DOGC_). All the issues of this Bulletin, containing
mainly legal texts, are also available for purchase on CD-ROM. See
<http://www.gencat.es/nov_edit/> (actually, this is the page for new
books published by the Government's publication service, but there will
be a link back to the catalogue). If I remember well, there are also
CD-ROMs available with the minutes of the Parliament sessions but I have
not seen this yet, and furthermore these CDs are quite costly.

Now for academic corpora:

The Institut d'Estudis Catalans has published, a couple of years ago, a
hugh two-volume frequency dictionary for literary and non-literary
language (one volume each), both mainly based on written sources. These
books come with a CD-ROM but this contains only the frequency lists in
differently consultable form, alas not the text ressources the lists are
based on. But perhaps these are available or accessible directly at the
Institut d'Estudis Catalans. You might go to <http://www.iec.es> and
look for a mail link.

The University of Barcelona (<http://www.ub.es>) is working on an
excellent, but hitherto unpublished reference corpus for Catalan,
"Corpus de Català Contemporani de la Universitat de Barcelona",
including both written, semi-spontaneous and spontaneous oral texts.
This corpus should be available on CD-ROM or on the web soon. You can
read an introductory text to the oral corpus in the online version of
the 'Zeitschrift für Katalanistik' at:
<http://www.uni-tuebingen.de/romanistik/zfk.html> (then choose issue 13,
Article "El COC del CUB"). You might also contact Emili Boix
<boix at lincat.ub.es> or Núria Alturo <alturo at lincat.ub.es> who are both
working on this corpus.

The University Pompeu Fabra of Barcelona (<http://www.upf.es>) and the
Catalunya Ràdio station are working on a hypermedia compilation of oral
Catalan texts the primary goal of which will be the pronunciation
training of radio speakers but which should also be made available for
research and teaching purposes through the web. For this project, called
DOPO, you might contact Oriol Camps of Catalunya Ràdio
<ocamps.n at catradio.com> or Lluís de Yzaguirre of UPF <de_yza at upf.es>.

The Institut d'Estudis Catalans, again, has collected in the 60's and
70's oral texts for a language atlas project, "Atles Lingüístic del
Domini Català", which are now being treated electronically. CD-ROM
versions of both the recordings and their transcriptions are announced,
but up to now only a selection of transcripts has been published in book
form, together with an audio cassette containing the recordings. Please
contact Mar Massanell of the Universitat Oberta de Catalunya
<mmassanell at campus.uoc.es> to see when / if / where the electronic
version of this (dialect) corpus is available.

I know of some more spoken language corpora published in printed form,
sometimes with the recordings on audio tape, but I do not know if this
is of interest for you. As far as Galician is concerned, it might be
useful to ask my colleague Johannes Kabatek at Tübingen university, who
has published a corpus of spoken Galician (he certainly will give it to
you in electronic form) and who is very well informed about Galician
research and corpus projects; his mail is <kabatek at uni-tuebingen.de>.
---

The most important Catalan corpus (60Mwords) is online at URL

http://www.iec.es:120/

You will find Catalan newspapers under TACTWEB in

http://www.iula.upf.es/altres/evt/CECA1.htm

(5Mwords in 125 single-day files).
---
You will find information about a LSP corpus as well as NLP tools for catalan at
the following url: http://www.iula.upf.es/corpus/corpusuk.htm
Such corpus is being compiled at the Institute for Applied Linguistic at the
Pompeu Fabra University in Barcelona.
---
I send you some address where you can find some information about
Galician Corpora whether in Galician or in Galician-Portuguese.

http://www.uvigo.es/webs/h06/weba573/persoal/henr/recurs/bibl1.htm

CORPUS DE REFERENCIA DO GALEGO ACTUAL
http://www.cirp.es/WXD/wxd/prox/prxCorg1996.html

Corpus documentale latinum Gallaeciae
http://www.cirp.es/WXD/wxd/prox/prxCLat1998.html
---



More information about the Corpora mailing list