[Corpora-List] Re: [Corpora-list] Incidence of MWEs

Fri Mar 17 15:10:05 UTC 2006

Even though this is obviousy true, I suppose the original query by David Brooks was not how linguistic analysis should proceed -- a very interesting issue, of course -- but (correct if I am wrong) how have the present existing annotated corpora or treebanks dealt with the question of identifying (or not) MWEs, since he wanted to use existing treebanks like SUSANNE or the Penn Treebank to induce parsers (or do parser evaluation).

So -- and while fully agreeing with e.g Adam Kilgarriff's post on the general lack of consensus about what a MWE is -- I think that some more positive answers (although not solutions) could be given:

For Portuguese annotated corpora (AC/DC project and further projects) we decided to encode both individual POS -- if that was at all possible -- and their MWE equivalents (if they had been so found by the parser).

For example (I present here fake examples in English, real Portuguese ones can be read in the papers below, as well as more discussion on parser evaluation and annotated corpora encoding:-)

"round table"

would be encoded (in IMS Corpus Workbench syntax) as
<mwe pos="N">
round		ADJ
table 	N
</mwe>

or probably a more interesting one like NPs with adverbial meaning

<mwe pos="ADV">
night	N
and 	CONJ
day	N
</mwe>

This allows people to choose which (or both) kinds of information they will use.

I don't know if other (syntactically annotated) corpora encoders did use this two view of MWE expressions, but would like to have your feedback, as well as any information on how other existing treebanks have dealt with this problem.

Best regards,
Diana

Diana Santos & Eckhard Bick. "Providing Internet access to Portuguese corpora: the AC/DC project". In Maria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperidis & Gregory Stainhauer (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) (Athens, 31 May-2 June 2000), pp. 205-210. http://www.linguateca.pt/Diana/download/SantosBickLREC2000.pdf

Diana Santos & Caroline Gasperin. "Evaluation of parsed corpora: experiments in user-transparent and user-visible evaluation". In Manuel González Rodrigues & Carmen Paz Suarez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Espanha, 29-31 de Maio de 2002), Paris: ELRA, pp. 597-604. http://www.linguateca.pt/Diana/download/SantosGasperinLREC2002.pdf

Diana Santos & Susana Inácio. "Annotating COMPARA, a grammar-aware parallel corpus". to appear in Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006 ) (Génova, Italia, 22-28 de Maio de 2006). http://www.linguateca.pt/Diana/download/SantosInacioLREC2006.pdf

> -----Original Message-----
> From: owner-corpora at lists.uib.no 
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Chris Butler
> Sent: 17. mars 2006 10:21
> To: corpora at uib.no
> Subject: [Corpora-List] Re: [Corpora-list] Incidence of MWEs
> 
> There is now a considerable body of theoretical linguistic 
> work which underlies the position taken by Rob Freeman, i.e. 
> that we should build our linguistic models on the basis of 
> generalisations over usage. I am referring to the so-called 
> 'usage-based model' represented by the work of Langacker in 
> Cognitive Grammar, much work in Construction Grammar (e.g. 
> that of Goldberg, Croft), and also the work of scholars such 
> as Bybee, Hopper, Thompson, Barlow and Kemmer.
> 
> Chris Butler
> Honorary Professor, Centre for Applied Language Studies, 
> University of Wales Swansea, UK
> 
> 
>