[Corpora-List] Re: [Corpora-list] Incidence of MWEs

Santos Diana Diana.Santos at sintef.no
Fri Mar 17 20:58:35 UTC 2006


Dear Afsaneh,

The point made was to annotate MWEs both as such and as individual words.
It had nothing to do with their being contiguous or not, even though the examples I presented were contiguous.

In fact, it might not have been obvious, either, that both examples were meant to illustrate cases where the actual expressions also could have a literal meaning (and therefore would not have the additional mwe tag), as in "I bought a round table", or "I like day and night".

(round table - probably not a good example in English - came from the Portuguese "mesa redonda" which in its non-literal meaning denotes a debate with many participants (at least more than 2).)

There are parsing formalisms that do allow discontinuous constituents, as well as treebanks (such as Floresta Sintactica, www.inguateca.pt/Floresta/ ) which encode discontinuous constituents. In that case you may simply use the dual approach.

For a general method of dealing with MWEs in MT, see my old paper, which may gve you some ideas on how to deal with kinds of expressions you have in mind in a practical application

Santos, Diana. "Lexical gaps and idioms in Machine Translation", Hans Karlgren (ed.), Proceedings of COLING'90 (Helsinki, August 1990), Vol 2, pp.330-5. http://www.linguateca.pt/Diana/download/Santos1990COLING.pdf

Diana

> -----Original Message-----
> From: Afsaneh Fazly [mailto:afsaneh at cs.toronto.edu] 
> Sent: 17. mars 2006 19:18
> To: Lou Burnard
> Cc: Santos Diana; Chris Butler; corpora at uib.no
> Subject: Re: [Corpora-List] Re: [Corpora-list] Incidence of MWEs
> 
> 
> Tagging multiword units (MWUs) using such an approach would 
> be of great use.  However, one issue still remains unsolved, 
> and that is the syntactic flexibility of some multiword 
> units.  As I mentioned in my previous email, this is 
> especially a problem with verb--noun MWUs, such as "give a 
> call", "take a walk", etc.  It would be nice if we could also 
> come up with a way for representing such cases.  For example, 
> in "give <somebody> a call", "<somebody>" is an argument of 
> the complex verb "give a call", and should NOT be included 
> inside the <mw> tag containing "give a call".  This is of 
> course a more general issue with non-continguous MWUs, and 
> not specific to a few examples.  Unfortunately, I do not have 
> a solution for this problem.  Hopefully, we will find out a 
> solution (or at least a partial one), now that the issue has 
> been raised.
> 
> Regards,
> 
> Afsaneh Fazly
> ==================================================================
> PhD student, Computational Linguistics Group, University of 
> Toronto www.cs.toronto.edu/~afsaneh 
> ==================================================================
> 
> On Fri, 17 Mar 2006, Lou Burnard wrote:
> 
> > This gives me a good excuse to announce that we are 
> planning something 
> > very similar to what Diana proposes as the "more 
> interesting one" for 
> > the next, XML only, release of the BNC.
> >
> > As current users of the BNC will know, in that corpus the MWEs 
> > recognised by CLAWS are treated as if they were "words". So, for 
> > example, we see things like "<w AVO>in spite of" alongside 
> "<w PRP>in 
> > <w
> > NN1>spite".  This has led to some discontent, both from people who 
> > NN1>want
> > to decide for themselves what counts as a MWE and from 
> people who want 
> > to treat the components of MWEs in the same way as other words.  
> > There's no denying the usefulness of the information in the CLAWS 
> > groupings, however so we don't want to do without it.
> >
> > The current plan is to introduce a new tag "<mw>" to mark all 
> > CLAWS-identified multiword units, within which the orthographically 
> > distinct components will be tagged, as elsewhere, with <w> 
> tags. So we 
> > will see something like
> >
> > <mw AVO>
> >     <w PRP>in</w>
> >     <w NN1>spite</w>
> >     <w PRP>of</w>
> > </mw>
> >
> > This means that a simple minded query for the word "spite" 
> will find 
> > all occurrences. It also means that more interesting queries like 
> > "what parts of speech contribute to the pOS of a multiword 
> unit" are feasible.
> >
> > Contrary to rumour, the new tag is not named after Martin Wynne.
> >
> > Lou
> >
> >
> >
> > Santos Diana wrote:
> > > Even though this is obviousy true, I suppose the original 
> query by David Brooks was not how linguistic analysis should 
> proceed -- a very interesting issue, of course -- but 
> (correct if I am wrong) how have the present existing 
> annotated corpora or treebanks dealt with the question of 
> identifying (or not) MWEs, since he wanted to use existing 
> treebanks like SUSANNE or the Penn Treebank to induce parsers 
> (or do parser evaluation).
> > >
> > > So -- and while fully agreeing with e.g Adam Kilgarriff's 
> post on the general lack of consensus about what a MWE is -- 
> I think that some more positive answers (although not 
> solutions) could be given:
> > >
> > > For Portuguese annotated corpora (AC/DC project and 
> further projects) we decided to encode both individual POS -- 
> if that was at all possible -- and their MWE equivalents (if 
> they had been so found by the parser).
> > >
> > > For example (I present here fake examples in English, real 
> > > Portuguese ones can be read in the papers below, as well as more 
> > > discussion on parser evaluation and annotated corpora encoding:-)
> > >
> > > "round table"
> > >
> > > would be encoded (in IMS Corpus Workbench syntax) as <mwe pos="N">
> > > round		ADJ
> > > table 	N
> > > </mwe>
> > >
> > > or probably a more interesting one like NPs with adverbial meaning
> > >
> > > <mwe pos="ADV">
> > > night	N
> > > and 	CONJ
> > > day	N
> > > </mwe>
> > >
> > > This allows people to choose which (or both) kinds of 
> information they will use.
> > >
> > > I don't know if other (syntactically annotated) corpora 
> encoders did use this two view of MWE expressions, but would 
> like to have your feedback, as well as any information on how 
> other existing treebanks have dealt with this problem.
> > >
> > > Best regards,
> > > Diana
> > >
> > > Diana Santos & Eckhard Bick. "Providing Internet access to 
> > > Portuguese corpora: the AC/DC project". In Maria 
> Gavrilidou, George 
> > > Carayannis, Stella Markantonatou, Stelios Piperidis & Gregory 
> > > Stainhauer (eds.), Proceedings of the Second International 
> > > Conference on Language Resources and Evaluation (LREC 
> 2000) (Athens, 
> > > 31 May-2 June 2000), pp. 205-210. 
> > > http://www.linguateca.pt/Diana/download/SantosBickLREC2000.pdf
> > >
> > > Diana Santos & Caroline Gasperin. "Evaluation of parsed corpora: 
> > > experiments in user-transparent and user-visible evaluation". In 
> > > Manuel González Rodrigues & Carmen Paz Suarez Araujo (eds.), 
> > > Proceedings of LREC 2002, the Third International Conference on 
> > > Language Resources and Evaluation (Las Palmas de Gran Canaria, 
> > > Espanha, 29-31 de Maio de 2002), Paris: ELRA, pp. 597-604. 
> > > http://www.linguateca.pt/Diana/download/SantosGasperinLREC2002.pdf
> > >
> > > Diana Santos & Susana Inácio. "Annotating COMPARA, a 
> grammar-aware 
> > > parallel corpus". to appear in Proceedings of the 5th 
> International 
> > > Conference on Language Resources and Evaluation (LREC'2006 ) 
> > > (Génova, Italia, 22-28 de Maio de 2006). 
> > > http://www.linguateca.pt/Diana/download/SantosInacioLREC2006.pdf
> > >
> > >
> > >
> > >
> > >>-----Original Message-----
> > >>From: owner-corpora at lists.uib.no
> > >>[mailto:owner-corpora at lists.uib.no] On Behalf Of Chris Butler
> > >>Sent: 17. mars 2006 10:21
> > >>To: corpora at uib.no
> > >>Subject: [Corpora-List] Re: [Corpora-list] Incidence of MWEs
> > >>
> > >>There is now a considerable body of theoretical linguistic work 
> > >>which underlies the position taken by Rob Freeman, i.e.
> > >>that we should build our linguistic models on the basis of 
> > >>generalisations over usage. I am referring to the so-called 
> > >>'usage-based model' represented by the work of Langacker in 
> > >>Cognitive Grammar, much work in Construction Grammar (e.g.
> > >>that of Goldberg, Croft), and also the work of scholars such as 
> > >>Bybee, Hopper, Thompson, Barlow and Kemmer.
> > >>
> > >>Chris Butler
> > >>Honorary Professor, Centre for Applied Language Studies, 
> University 
> > >>of Wales Swansea, UK
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > >
> >
> >
> >
> 



More information about the Corpora mailing list