[Corpora-List] SemCor: extrapolating Brown data to larger corpora

Adam Kilgarriff adam at lexmasterclass.com
Tue Feb 14 06:25:53 UTC 2006


Mark,

I'd be very skeptical of any such extrapolation.  The senses that happen to
come up when the numbers are so small (usually single figures) are just
arbitrary, and don't sustain extrapolation, even before we agitate about the
match between SEMCOR and big-corpus text type.  

And we should assume everything is Zipfian.  I've been puzzling over the
implications of this for years and have done some modeling: see "How
dominant is the commonest sense of a word" at
http://lexmasterclass.com/people/Publications/2004-K-TSD-CommonestSense.pdf

(In: Text, Speech, Dialogue 2004. Lecture Notes in Artificial Intelligence
Vol.  3206.   Sojka, Kopecek and Pala, Eds.  Springer Verlag: 103-112.)

Diana McCarthy and colleagues explore the issue in their ACL paper (best
paper award, ACL 2004 Barcelona).  The premise for their work is that you're
better off establishing what domain you are in, and assigning all instances
of a word to the sense associated with that domain, than trying to do
local-context-based WSD.

Of course, everything depends on how similar the two corpora are.  Let's
make that the big research question for the new half-decade!

 Regards,

  Adam

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Mark Davies
Sent: 13 February 2006 23:06
To: corpora at hd.uib.no
Subject: [Corpora-List] SemCor: extrapolating Brown data to larger corpora

A graduate student here is working with SemCor
(http://multisemcor.itc.it/semcor.php), and she's looking at how well
the data from the Brown-based SemCor corpus might potentially compare
with that of a larger corpus, like the BNC.

For example, [crack] as a verb has 17 tokens in SemCor, distributed
among the seven different WordNet senses as follows (if I'm reading the
cntlst files from SemCor 1.6 correctly):

WordNet	Tokens
sense
------		------
1		5
2		4
3		2
4		2
5		2
6		1
7		1
-----		-----
TOTAL		17

The question is whether in a 100 million word corpus, we would get more
or less the same distribution.  For example, might Senses 6-7
(hypothetically) be the most common, even though they each only occur
once in the Brown/SemCor corpus?

Has anyone attempted to compare the results of SemCor with a
randomly-selected subset of tokens from a much larger corpus, such as
the BNC -- even for just a small subset of words (particularly verbs)?
Also, are there any statistical tests that might be used to see whether
we have a sufficiently robust for a given word for WSD with SemCor?
(It's obviously a function of frequency - you'd probably get more
reliable results with a high-frequency word like [break] than a lower
frequency word like [smear]).

Also, we're not really looking for basic articles on WSD (or literature
on Senseval, etc), but rather just the issue at hand -- the
extrapolatability (??) of SemCor to a larger corpus.

Sorry if this an FAQ-like question.  If so, simple references to
existing literature would be appreciated.

Thanks,

Mark Davies

=================================================

Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

================================================= 



More information about the Corpora mailing list