[Corpora-List] Spanish corpus
Steven Bird
sb at csse.unimelb.edu.au
Wed Oct 31 20:48:54 UTC 2007
On 11/1/07, Mario Crespo Miguel <mario.crespo at uca.es> wrote:
> Dear all,
>
> I wonder if anyone on the list knows if there is available a
> syntactically tagged corpus of Spanish and it could be used for
> research purposes. Thank you very much in advance,
NLTK includes the CESS-ESP Treebank, with 6030 parsed sentences,
distributed with permission of Dr Toni Martí at the University of Barcelona.
For details, please see:
http://nltk.svn.sourceforge.net/viewvc/*checkout*/nltk/trunk/nltk/data/corpora/cess_esp/README
NLTK includes a corpus reader with methods for iterating over the
words, tagged words, sentences, and parsed sentences of the corpus,
e.g.:
>>> import nltk
>>> nltk.corpus.cess_esp.words()
['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]
>>> nltk.corpus.cess_esp.sents()
[['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', '-Fpa-',
'EDF', '-Fpt-', 'anunci\xf3', 'hoy', ',', 'jueves', ',', 'la',
'compra', 'del', '51_por_ciento', 'de', 'la', 'empresa', 'mexicana',
'Electricidad_\xc1guila_de_Altamira', '-Fpa-', 'EAA', '-Fpt-', ',',
'creada', 'por', 'el', 'japon\xe9s', 'Mitsubishi_Corporation', 'para',
'poner_en_marcha', 'una', 'central', 'de', 'gas', 'de', '495',
'megavatios', '.'], ['Una', 'portavoz', 'de', 'EDF', 'explic\xf3',
'a', 'EFE', 'que', 'el', 'proyecto', 'para', 'la', 'construcci\xf3n',
'de', 'Altamira_2', ',', 'al', 'norte', 'de', 'Tampico', ',',
'prev\xe9', 'la', 'utilizaci\xf3n', 'de', 'gas', 'natural', 'como',
'combustible', 'principal', 'en', 'una', 'central', 'de', 'ciclo',
'combinado', 'que', 'debe', 'empezar', 'a', 'funcionar', 'en',
'mayo_del_2002', '.'], ...]
>>> print nltk.corpus.cess_esp.parsed_sents()[0]
(S
(sn-SUJ
(espec.ms (da0ms0 El))
(grup.nom.ms
(ncms000 grupo)
(s.a.ms (grup.a.ms (aq0cs0 estatal)))
(sn
(grup.nom.ms
(np00000 Electricit?_de_France)
(sn (grup.nom.ms (Fpa -Fpa-) (np00000 EDF) (Fpt -Fpt-)))))))
(grup.verb (vmis3s0 anunci?))
(sadv-CCT
(grup.adv (rg hoy) (sn (Fc ,) (grup.nom.ms (W jueves)) (Fc ,))))
(sn-CD
(espec.fs (da0fs0 la))
(grup.nom.fs
(ncfs000 compra)
(sp
(prep (spcms del))
(sn
(grup.nom.ms
(Zp 51_por_ciento)
(sp
(prep (sps00 de))
(sn
(espec.fs (da0fs0 la))
(grup.nom.fs
(ncfs000 empresa)
(s.a.fs (grup.a.fs (aq0fs0 mexicana)))
(sn
(grup.nom.fs
(np00000 Electricidad_?guila_de_Altamira)
(sn
(grup.nom.fs
(Fpa -Fpa-)
(np00000 EAA)
(Fpt -Fpt-)))))
(S.NF.P
(Fc ,)
(participi (aq0fsp creada))
(sp-CAG
(prep (sps00 por))
(sn
(espec.ms (da0ms0 el))
(grup.nom.ms
(s.a.ms (grup.a.ms (aq0ms0 japon?s)))
(np00000 Mitsubishi_Corporation))))
(sp-CC
(prep (sps00 para))
(S.NF.C
(infinitiu (vmn0000 poner_en_marcha))
(sn-CD
(espec.fs (di0fs0 una))
(grup.nom.fs
(ncfs000 central)
(sp
(prep (sps00 de))
(sn
(grup.nom.ms
(ncms000 gas)
(sp
(prep (sps00 de))
(sn
(espec.mp (Z 495))
(grup.nom.mp
(ncmp000 megavatios))))))))))))))))))))
(Fp .))
To download NLTK, please visit http://nltk.org/index.php
Steven Bird
http://www.csse.unimelb.edu.au/~sb/
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list