[Corpora-List] Project Lácio-Web --- Second Release

Sandra Maria Aluísio sandra at icmc.usp.br
Tue Jun 29 20:16:51 UTC 2004


Dear colleagues 


We are pleased to announce the second release of the Lácio-Web webpage. Lácio-Web is a project aimed at providing corpora for Brazilian Portuguese and software tools for computational linguistic processing. 

 

As a result of the first release, launched in January 20th, two corpora were made available: 

- a version of the Lácio-Ref (a reference corpus with 4,156,816 words) constituted of five genres of texts (informative, scientific, prose, poetry and drama), for research and building of subcorpora, and 

- the MAC-MORPHO, a POS annotated corpus with 1,167,183 words, from the newspaper Folha de São Paulo, 1994. 

 

For the second release, Lácio-Ref has been enhanced with texts from the following genres: legal, scientific, informative and instructional. The Lácio-Ref Corpus consists of 4,278 files with 8,291,818 words at the time of its second release.

A parallel corpus Par-C has also been made available with 646 text files in English and 646 in Portuguese from the Revista Pesquisa Fapesp. The total number of words in the parallel corpus is 893,283.

Apart from these corpora, a tool to build English-Portuguese comparable corpora for the legal genre has also been made available. For that purpose, a reference corpus with English texts (Ref-Ig) has been compiled for that domain. It contains 29 texts with a total of 61,149 words, and will be enlarged in the future.

 

All in all, Lácio-Web contains 5,708 files with a total of 10,413,524 words.

 

The project also makes available several computational linguistic tools such as frequency counters, concordancers and three POS taggers trained with the MAC-Morpho corpus: MXPOST, TreeTagger and Brill TBL. 

 

These new facilities are available from the project webpage:


http://www.nilc.icmc.usp.br/lacioweb



 

Cordially,

 

Lácio-Web Team 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20040629/d32343f7/attachment.htm>


More information about the Corpora mailing list