[Corpora-List] Incidence of MWEs
Yorick Wilks
yorick at dcs.shef.ac.uk
Tue Mar 14 14:03:51 UTC 2006
A student of mine, Chun-Yu Kit, who did a thesis at Sheffield about
six years ago, applied Rissanen's MDL (Minimum Description length)
algorithm to English corpora as the first stage in a machine learning
project to derive grammars. What MDL does is to decide what selection
of English phrases (taken from a large corpus) and put in a phrase
lexicon will minimise the length of the whole object (corpus +
lexicon) taken together. This algorithm is extraordinarily effective
in selecting out, unsupervised, unseeded, a plausible phrase
inventory for the language based only on cooccurrence in the corpus
plus this very nice algorithm.
Yorick Wilks
On 14 Mar 2006, at 13:30, Chris Butler wrote:
> Dear David,
>
> As Adam Kilgarriff makes clear, the answer depends crucially on
> exactly what
> you're looking for, and the decisions you make about what to
> include. For
> estimates and discussion, you might like to look at the following:
>
> Altenberg, B (1998) On the phraseology of spoken English: the
> evidence of
> recurrent word combinations. In A P Cowie (ed.) Phaseology. (Oxford
> Studies
> in Leixcography and Lexicology), pp101-122. Oxford: Oxford
> University Press.
>
> Biber, D et al (1999) Longman Grammar of Spoken and Written English,
> pp990-1024.
>
> Butler, C S (1997) Repeated word combinations in spoken and written
> text:
> come implications for Functional Grammar. In C S Butler, J H
> Connolly, R A
> Gatward and R M Vismans (eds.) A Fund of Ideas: Recent Developments in
> Functional Grammar. (Studies in Language and Language use 31),
> pp60-77.
> Amsterdam: IFOTT, University of Amsterdam.
>
> Wray, A M (2002) Formulaic language and the Lexicon. Cambridge:
> Cambridge
> University Press, especially Chapters 2 and 3.
>
> Best wishes,
>
> Chris Butler
> Honorary Professor, Centre for Applied Language Studies, University
> of Wales
> Swansea
>
> ----- Original Message -----
> From: "David Brooks" <D.J.Brooks at cs.bham.ac.uk>
> To: "Corpora List" <corpora at uib.no>
> Sent: Tuesday, March 14, 2006 12:42 PM
> Subject: [Corpora-List] Incidence of MWEs
>
>
>
>> Dear Corpora-folk,
>>
>> I was wondering if anyone has estimated the incidence of multi-word
>> expressions in language. I know that empirical estimates are tied to
>> particular corpora, but does anyone have an account of MWEs for
>> particular corpora, so that "ball-park" figures of the proportion of
>> MWEs can be estimated?
>>
>> Better yet, can anyone give me a good reference for the incidence
>> of MWEs?
>>
>> Regards,
>> David
>> --
>> David Brooks
>> http://www.cs.bham.ac.uk/~djb
>>
>>
>
>
>
More information about the Corpora
mailing list