[Corpora-List] Re: [Corpora-list] Incidence of MWEs

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Fri Mar 17 15:38:26 UTC 2006


This gives me a good excuse to announce that we are planning something 
very similar to what Diana proposes as the "more interesting one" for 
the next, XML only, release of the BNC.

As current users of the BNC will know, in that corpus the MWEs 
recognised by CLAWS are treated as if they were "words". So, for 
example, we see things like "<w AVO>in spite of" alongside "<w PRP>in <w 
NN1>spite".  This has led to some discontent, both from people who want 
to decide for themselves what counts as a MWE and from people who want 
to treat the components of MWEs in the same way as other words.  There's 
no denying the usefulness of the information in the CLAWS groupings, 
however so we don't want to do without it.

The current plan is to introduce a new tag "<mw>" to mark all 
CLAWS-identified multiword units, within which the orthographically 
distinct components will be tagged, as elsewhere, with <w> tags. So we 
will see something like

<mw AVO>
    <w PRP>in</w>
    <w NN1>spite</w>
    <w PRP>of</w>
</mw>

This means that a simple minded query for the word "spite" will find all 
occurrences. It also means that more interesting queries like "what 
parts of speech contribute to the pOS of a multiword unit" are feasible.

Contrary to rumour, the new tag is not named after Martin Wynne.

Lou



Santos Diana wrote:
> Even though this is obviousy true, I suppose the original query by David Brooks was not how linguistic analysis should proceed -- a very interesting issue, of course -- but (correct if I am wrong) how have the present existing annotated corpora or treebanks dealt with the question of identifying (or not) MWEs, since he wanted to use existing treebanks like SUSANNE or the Penn Treebank to induce parsers (or do parser evaluation).
> 
> So -- and while fully agreeing with e.g Adam Kilgarriff's post on the general lack of consensus about what a MWE is -- I think that some more positive answers (although not solutions) could be given:
> 
> For Portuguese annotated corpora (AC/DC project and further projects) we decided to encode both individual POS -- if that was at all possible -- and their MWE equivalents (if they had been so found by the parser).
> 
> For example (I present here fake examples in English, real Portuguese ones can be read in the papers below, as well as more discussion on parser evaluation and annotated corpora encoding:-)
> 
> "round table"
> 
> would be encoded (in IMS Corpus Workbench syntax) as
> <mwe pos="N">
> round		ADJ
> table 	N
> </mwe>
> 
> or probably a more interesting one like NPs with adverbial meaning
>  
> <mwe pos="ADV">
> night	N
> and 	CONJ
> day	N
> </mwe>
> 
> This allows people to choose which (or both) kinds of information they will use.
> 
> I don't know if other (syntactically annotated) corpora encoders did use this two view of MWE expressions, but would like to have your feedback, as well as any information on how other existing treebanks have dealt with this problem.
> 
> Best regards,
> Diana
> 
> Diana Santos & Eckhard Bick. "Providing Internet access to Portuguese corpora: the AC/DC project". In Maria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperidis & Gregory Stainhauer (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) (Athens, 31 May-2 June 2000), pp. 205-210. http://www.linguateca.pt/Diana/download/SantosBickLREC2000.pdf
> 
> Diana Santos & Caroline Gasperin. "Evaluation of parsed corpora: experiments in user-transparent and user-visible evaluation". In Manuel González Rodrigues & Carmen Paz Suarez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Espanha, 29-31 de Maio de 2002), Paris: ELRA, pp. 597-604. http://www.linguateca.pt/Diana/download/SantosGasperinLREC2002.pdf
> 
> Diana Santos & Susana Inácio. "Annotating COMPARA, a grammar-aware parallel corpus". to appear in Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006 ) (Génova, Italia, 22-28 de Maio de 2006). http://www.linguateca.pt/Diana/download/SantosInacioLREC2006.pdf
> 
> 
> 
> 
>>-----Original Message-----
>>From: owner-corpora at lists.uib.no 
>>[mailto:owner-corpora at lists.uib.no] On Behalf Of Chris Butler
>>Sent: 17. mars 2006 10:21
>>To: corpora at uib.no
>>Subject: [Corpora-List] Re: [Corpora-list] Incidence of MWEs
>>
>>There is now a considerable body of theoretical linguistic 
>>work which underlies the position taken by Rob Freeman, i.e. 
>>that we should build our linguistic models on the basis of 
>>generalisations over usage. I am referring to the so-called 
>>'usage-based model' represented by the work of Langacker in 
>>Cognitive Grammar, much work in Construction Grammar (e.g. 
>>that of Goldberg, Croft), and also the work of scholars such 
>>as Bybee, Hopper, Thompson, Barlow and Kemmer.
>>
>>Chris Butler
>>Honorary Professor, Centre for Applied Language Studies, 
>>University of Wales Swansea, UK
>>
>>
>>
> 
> 
> 
> 



More information about the Corpora mailing list