[Corpora-List] General Italian wordlist

Adam Kilgarriff adam at lexmasterclass.com
Thu Nov 15 07:54:24 UTC 2007


Wordlists for the following languages (or wordforms and of lemmas) from
corpora loaded into the Sketch Engine are available at
http://sketchengine.co.uk 

	Web corpora: English French German Italian Japanese Spanish 
	Non-web corpora: Chinese English Slovene Portuguese

Web corpora are, for general-language research purposes, usually better than
newspaper corpora -the usual alternative- as they cover a much wider range
of text types (see various studies by Serge Sharoff).

For Italian we have a 2b web corpus prepared by Marco Baroni, see 
Baroni and Kilgarriff 2006 Large linguistically-processed Web corpora for
multiple languages
<http://kilgarriff.co.uk/Publications/2006-BaroniKilg-EACL-DeWAC.pdf>  Proc.
EACL.  Trento, Italy. 

Adam Kilgarriff

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Emiliano Guevara
Sent: 14 November 2007 23:08
To: CORPORA
Subject: Re: [Corpora-List] General Italian wordlist

Dear Jane,

unfortunately, there are still neither freely available, nor freely  
manipulable "general" corpora in Italian comparable to the BNC (I  
suppose what you mean is a reference corpus, balanced according to  
genre, medium, large enough in size to be representative of the whole  
language, etc).

I guess the best you can get is either wordlists generated from web  
corpora or from large unbalanced corpora such as "La Repubblica  
corpus" (check http://dev.sslmit.unibo.it/corpora/corpus.php? 
path=&name=Repubblica).

The good news is: you can get all of this right at Bologna University!

I'll be happy to help you with any of these alternatives, and  
eventually also to find a better way to do the keyword search beyond  
what WSTools has to offer (when you start playing with several  
million words, WSTools just chokes...).

Cheers,

Emiliano



On 14 Nov 2007, at 16:12, jane..johnson@@libero..it wrote:

> Similar to the BNC_World.lst for use with the Keyword tool of the  
> WordSmith suite, I am looking for a wordlist generated from a  
> general corpus of contemporary Italian  to create a Keyword list  
> for a selection of Italian novels. Can anyone point me in the right  
> direction? thanks
> Jane Johnson
> University of Bologna
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dip. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia

Homepage: http://morbo.lingue.unibo.it/

E-mail:   emiliano.guevara at unibo.it
           emiguevara at gmail.com
****************************************


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list