[Corpora-List] Incidence of MWEs

Tue Mar 14 14:03:51 UTC 2006

A student of mine, Chun-Yu Kit, who did a thesis at Sheffield about  
six years ago, applied Rissanen's MDL (Minimum Description length)  
algorithm to English corpora as the first stage in a machine learning  
project to derive grammars. What MDL does is to decide what selection  
of English phrases (taken from a large corpus) and put in a phrase  
lexicon will minimise the length of the whole object (corpus +  
lexicon) taken together. This algorithm is extraordinarily effective  
in selecting out, unsupervised, unseeded, a plausible phrase  
inventory for the language based only on cooccurrence in the corpus  
plus this very nice algorithm.
Yorick Wilks

On 14 Mar 2006, at 13:30, Chris Butler wrote:

> Dear David,
>
> As Adam Kilgarriff makes clear, the answer depends crucially on  
> exactly what
> you're looking for, and the decisions you make about what to  
> include. For
> estimates and discussion, you might like to look at the following:
>
> Altenberg, B (1998) On the phraseology of spoken English: the  
> evidence of
> recurrent word combinations. In A P Cowie (ed.) Phaseology. (Oxford  
> Studies
> in Leixcography and Lexicology), pp101-122. Oxford: Oxford  
> University Press.
>
> Biber, D et al (1999) Longman Grammar of Spoken and Written English,
> pp990-1024.
>
> Butler, C S (1997) Repeated word combinations in spoken and written  
> text:
> come implications for Functional Grammar. In C S Butler, J H  
> Connolly, R A
> Gatward and R M Vismans (eds.) A Fund of Ideas: Recent Developments in
> Functional Grammar. (Studies in Language and Language use 31),  
> pp60-77.
> Amsterdam: IFOTT, University of Amsterdam.
>
> Wray, A M (2002) Formulaic language and the Lexicon. Cambridge:  
> Cambridge
> University Press, especially Chapters 2 and 3.
>
> Best wishes,
>
> Chris Butler
> Honorary Professor, Centre for Applied Language Studies, University  
> of Wales
> Swansea
>
> ----- Original Message -----
> From: "David Brooks" <D.J.Brooks at cs.bham.ac.uk>
> To: "Corpora List" <corpora at uib.no>
> Sent: Tuesday, March 14, 2006 12:42 PM
> Subject: [Corpora-List] Incidence of MWEs
>
>
>
>> Dear Corpora-folk,
>>
>> I was wondering if anyone has estimated the incidence of multi-word
>> expressions in language. I know that empirical estimates are tied to
>> particular corpora, but does anyone have an account of MWEs for
>> particular corpora, so that "ball-park" figures of the proportion of
>> MWEs can be estimated?
>>
>> Better yet, can anyone give me a good reference for the incidence  
>> of MWEs?
>>
>> Regards,
>> David
>> -- 
>> David Brooks
>> http://www.cs.bham.ac.uk/~djb
>>
>>
>
>
>