[Corpora-List] Searching for NE annotated portuguese corpora...

sandra at icmc.usp.br sandra at icmc.usp.br
Wed Sep 8 11:21:29 UTC 2004


Thamar,

have a look at Lácio-Web Project:
http://www.nilc.icmc.usp.br/lacioweb/english/index.htm

where you can download the MAC-MORPHO corpus besides using the tools associated
with this corpus. This can be of same use for you as MAC-MORPHO contains
1.167.183 words of journalistic texts extracted from ten sections of the daily
newspaper Folha de São Paulo, 1994 and the tagset
(http://www.nilc.icmc.usp.br/lacioweb/english/ConjEtiquetas.htm) uses
additional tags besides the traditional POS ones.

There is some more information about it below. I hope this helps.

Sandra Aluísio
NILC - University of São Paulo
http://www.nilc.icmc.usp.br/nilc/index.html

------
MAC-MOPRHO is available for download in two versions:

1) Version for linguistic research using frequency counters and concordancers,
for instance. This format preserves all tags included in MAC-MORPHO´s Tagging
Manual. Some files also contain XML tags for filename, title, subtitle,
paragraph, and sentence, which were generated by the “Palavras” parser. You may
also download this version by separate scetions or by individual texts.

2) Version adequate for training taggers. This version does not contain the tags
that indicate that the material has not been tagged (<NA> ...</NA>); it does
not contain the XML tags for filename, title, subtitle, paragraph, and
sentence, which were generated by the “Palavras” parser; it does not contain
complementary tags for foreign words (EST), aposto (AP), data (DAD), telephone
number (TEL), date (DAT) and time (HOR). Multiwords are separated; for
example:

a) the proper name, which in the research format is shown as
“Rio=de=Janeiro_NPROP” has been separated into three parts, one in each line,
with the same tags: “Rio_NPROP de_NPROP Janeiro_NPROP”;

b) the prepositional phrase, which in the research format is shown as
“apesar=de_PREP” hás been separated into two parts, one in each line, with the
same tags: “apesar_PREP de_PREP”.


These changes have increased the size of the corpus to 1.221.468 words.

------------


Citando Thamar Solorio <thamy at inaoep.mx>:

> Hi!
> I've been searching for portuguese corpora annotated with Named
> Entities. So far I've only found raw corpora and portals to portuguese
> analyzers such as the one from the VISL project, but it is only for
> online use and it does not provide NE classification.
> So, if anyone knows of an available portuguese corpus  tagged with NE
> I'll appreciate if you let me know.
>
> Thanks!
>
> Thamar Solorio
> Coord. Ciencias Computacionales
> Instituto Nacional de Astrofísica, Óptica y Electrónica
> Luis Enrique Erro #1, Tonantzintla, Puebla
> México
>
> http://ccc.inaoep.mx/~thamy
>
>
>
>



More information about the Corpora mailing list