[Corpora-List] Lácio-Web Project --- First release

Sandra Maria Aluísio sandra at icmc.usp.br
Tue Jan 20 20:25:16 UTC 2004


Dear colleagues 


We are pleased to announce the first release of the Lácio-Web webpage, aimed at providing corpora for Brazilian Portuguese and software tools for computational linguistic processing. 

Six corpora will be available at the end of the Lácio-Web Project in May, 2004. In this first release, two corpora are made available: one version of Lácio-Ref for research and generation of subcorpora and MAC-Morpho for download. For the download of the first public release, please visit the webpage at 

http://www.nilc.icmc.usp.br/lacioweb


Further details of the 2 corpora being released are given below. General information is given in the webpage above:

 

Lácio-Ref 

This version of the reference corpus has 4,156,816 words, comprising texts from five genres (news, scientific, prose, poetry and drama), several types of text (such as reports, papers, chronicles, letters), various domains (such as education, engineering, politics) and different media (magazines, Internet pages, books). Lácio-Ref is available for research with generation of subcorpora for download in 2 formats: one with headings in XML, with bibliographic data, and another with title, subtitles, authorship and the plain text. 

MAC-Morpho 

MAC-Morpho has 1,167,183 words from the newspaper Folha de São Paulo, 1994. It has been tagged with the Palavras parser by Eckhard Bick (http://visl.hum.sdu.dk) and mapped to the tagset of the Lácio-Web project. The morphosyntactic tags have been manually revised. MAC-MORPHO is available for download in 2 formats: 

1) for linguistic research with frequency counters and concordancers, for example. 


2) for training taggers, as it allows the tagset to be altered. For instance, some sub- specification of the tags has been removed and multiword items were separated. These changes increased the size of the corpus to 1,221,468 words. 



Lácio-Web Project will also make available computational linguistics tools. In this first release we have frequency counters and concordancers in order to allow users to get a quick view of the subcorpora generated. New tools, such as morphosyntactic taggers, will be made available in the future. 


Cordially,

 

Lácio-Web Team 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20040120/b07316ad/attachment.htm>


More information about the Corpora mailing list