[Corpora-List] Re: [Corpora-list] Incidence of MWEs

Afsaneh Fazly afsaneh at cs.toronto.edu
Fri Mar 17 18:18:24 UTC 2006


Tagging multiword units (MWUs) using such an approach would be of great
use.  However, one issue still remains
unsolved, and that is the syntactic flexibility of some multiword
units.  As I mentioned in my previous email, this is especially
a problem with verb--noun MWUs, such as "give a call", "take a walk",
etc.  It would be nice if we could also come up with a way for
representing such cases.  For example, in "give <somebody> a call",
"<somebody>" is an argument of the complex verb "give a call",
and should NOT be included inside the <mw> tag containing "give
a call".  This is of course a more general issue with non-continguous
MWUs, and not specific to a few examples.  Unfortunately, I do not
have a solution for this problem.  Hopefully, we will find out a
solution (or at least a partial one), now that the issue has been
raised.

Regards,

Afsaneh Fazly
==================================================================
PhD student, Computational Linguistics Group,
University of Toronto
www.cs.toronto.edu/~afsaneh
==================================================================

On Fri, 17 Mar 2006, Lou Burnard wrote:

> This gives me a good excuse to announce that we are planning something
> very similar to what Diana proposes as the "more interesting one" for
> the next, XML only, release of the BNC.
>
> As current users of the BNC will know, in that corpus the MWEs
> recognised by CLAWS are treated as if they were "words". So, for
> example, we see things like "<w AVO>in spite of" alongside "<w PRP>in <w
> NN1>spite".  This has led to some discontent, both from people who want
> to decide for themselves what counts as a MWE and from people who want
> to treat the components of MWEs in the same way as other words.  There's
> no denying the usefulness of the information in the CLAWS groupings,
> however so we don't want to do without it.
>
> The current plan is to introduce a new tag "<mw>" to mark all
> CLAWS-identified multiword units, within which the orthographically
> distinct components will be tagged, as elsewhere, with <w> tags. So we
> will see something like
>
> <mw AVO>
>     <w PRP>in</w>
>     <w NN1>spite</w>
>     <w PRP>of</w>
> </mw>
>
> This means that a simple minded query for the word "spite" will find all
> occurrences. It also means that more interesting queries like "what
> parts of speech contribute to the pOS of a multiword unit" are feasible.
>
> Contrary to rumour, the new tag is not named after Martin Wynne.
>
> Lou
>
>
>
> Santos Diana wrote:
> > Even though this is obviousy true, I suppose the original query by David Brooks was not how linguistic analysis should proceed -- a very interesting issue, of course -- but (correct if I am wrong) how have the present existing annotated corpora or treebanks dealt with the question of identifying (or not) MWEs, since he wanted to use existing treebanks like SUSANNE or the Penn Treebank to induce parsers (or do parser evaluation).
> >
> > So -- and while fully agreeing with e.g Adam Kilgarriff's post on the general lack of consensus about what a MWE is -- I think that some more positive answers (although not solutions) could be given:
> >
> > For Portuguese annotated corpora (AC/DC project and further projects) we decided to encode both individual POS -- if that was at all possible -- and their MWE equivalents (if they had been so found by the parser).
> >
> > For example (I present here fake examples in English, real Portuguese ones can be read in the papers below, as well as more discussion on parser evaluation and annotated corpora encoding:-)
> >
> > "round table"
> >
> > would be encoded (in IMS Corpus Workbench syntax) as
> > <mwe pos="N">
> > round		ADJ
> > table 	N
> > </mwe>
> >
> > or probably a more interesting one like NPs with adverbial meaning
> >
> > <mwe pos="ADV">
> > night	N
> > and 	CONJ
> > day	N
> > </mwe>
> >
> > This allows people to choose which (or both) kinds of information they will use.
> >
> > I don't know if other (syntactically annotated) corpora encoders did use this two view of MWE expressions, but would like to have your feedback, as well as any information on how other existing treebanks have dealt with this problem.
> >
> > Best regards,
> > Diana
> >
> > Diana Santos & Eckhard Bick. "Providing Internet access to Portuguese corpora: the AC/DC project". In Maria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperidis & Gregory Stainhauer (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) (Athens, 31 May-2 June 2000), pp. 205-210. http://www.linguateca.pt/Diana/download/SantosBickLREC2000.pdf
> >
> > Diana Santos & Caroline Gasperin. "Evaluation of parsed corpora: experiments in user-transparent and user-visible evaluation". In Manuel González Rodrigues & Carmen Paz Suarez Araujo (eds.), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria, Espanha, 29-31 de Maio de 2002), Paris: ELRA, pp. 597-604. http://www.linguateca.pt/Diana/download/SantosGasperinLREC2002.pdf
> >
> > Diana Santos & Susana Inácio. "Annotating COMPARA, a grammar-aware parallel corpus". to appear in Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006 ) (Génova, Italia, 22-28 de Maio de 2006). http://www.linguateca.pt/Diana/download/SantosInacioLREC2006.pdf
> >
> >
> >
> >
> >>-----Original Message-----
> >>From: owner-corpora at lists.uib.no
> >>[mailto:owner-corpora at lists.uib.no] On Behalf Of Chris Butler
> >>Sent: 17. mars 2006 10:21
> >>To: corpora at uib.no
> >>Subject: [Corpora-List] Re: [Corpora-list] Incidence of MWEs
> >>
> >>There is now a considerable body of theoretical linguistic
> >>work which underlies the position taken by Rob Freeman, i.e.
> >>that we should build our linguistic models on the basis of
> >>generalisations over usage. I am referring to the so-called
> >>'usage-based model' represented by the work of Langacker in
> >>Cognitive Grammar, much work in Construction Grammar (e.g.
> >>that of Goldberg, Croft), and also the work of scholars such
> >>as Bybee, Hopper, Thompson, Barlow and Kemmer.
> >>
> >>Chris Butler
> >>Honorary Professor, Centre for Applied Language Studies,
> >>University of Wales Swansea, UK
> >>
> >>
> >>
> >
> >
> >
> >
>
>
>



More information about the Corpora mailing list