[Corpora-List] SemCor: extrapolating Brown data to larger corpora
Ramesh Krishnamurthy
r.krishnamurthy at aston.ac.uk
Tue Feb 14 11:25:46 UTC 2006
Hi Mark,
There's also the problem of comparing
USA-1962 (Brown) written data with UK-1994 (BNC) written and spoken data,
collected according to different design criteria...
Best
Ramesh
At 06:25 14/02/2006, you wrote:
>Mark,
>
>I'd be very skeptical of any such extrapolation. The senses that happen to
>come up when the numbers are so small (usually single figures) are just
>arbitrary, and don't sustain extrapolation, even before we agitate about the
>match between SEMCOR and big-corpus text type.
>
>And we should assume everything is Zipfian. I've been puzzling over the
>implications of this for years and have done some modeling: see "How
>dominant is the commonest sense of a word" at
>http://lexmasterclass.com/people/Publications/2004-K-TSD-CommonestSense.pdf
>
>(In: Text, Speech, Dialogue 2004. Lecture Notes in Artificial Intelligence
>Vol. 3206. Sojka, Kopecek and Pala, Eds. Springer Verlag: 103-112.)
>
>Diana McCarthy and colleagues explore the issue in their ACL paper (best
>paper award, ACL 2004 Barcelona). The premise for their work is that you're
>better off establishing what domain you are in, and assigning all instances
>of a word to the sense associated with that domain, than trying to do
>local-context-based WSD.
>
>Of course, everything depends on how similar the two corpora are. Let's
>make that the big research question for the new half-decade!
>
> Regards,
>
> Adam
>
>-----Original Message-----
>From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
>Behalf Of Mark Davies
>Sent: 13 February 2006 23:06
>To: corpora at hd.uib.no
>Subject: [Corpora-List] SemCor: extrapolating Brown data to larger corpora
>
>A graduate student here is working with SemCor
>(http://multisemcor.itc.it/semcor.php), and she's looking at how well
>the data from the Brown-based SemCor corpus might potentially compare
>with that of a larger corpus, like the BNC.
>
>For example, [crack] as a verb has 17 tokens in SemCor, distributed
>among the seven different WordNet senses as follows (if I'm reading the
>cntlst files from SemCor 1.6 correctly):
>
>WordNet Tokens
>sense
>------ ------
>1 5
>2 4
>3 2
>4 2
>5 2
>6 1
>7 1
>----- -----
>TOTAL 17
>
>The question is whether in a 100 million word corpus, we would get more
>or less the same distribution. For example, might Senses 6-7
>(hypothetically) be the most common, even though they each only occur
>once in the Brown/SemCor corpus?
>
>Has anyone attempted to compare the results of SemCor with a
>randomly-selected subset of tokens from a much larger corpus, such as
>the BNC -- even for just a small subset of words (particularly verbs)?
>Also, are there any statistical tests that might be used to see whether
>we have a sufficiently robust for a given word for WSD with SemCor?
>(It's obviously a function of frequency - you'd probably get more
>reliable results with a high-frequency word like [break] than a lower
>frequency word like [smear]).
>
>Also, we're not really looking for basic articles on WSD (or literature
>on Senseval, etc), but rather just the issue at hand -- the
>extrapolatability (??) of SemCor to a larger corpus.
>
>Sorry if this an FAQ-like question. If so, simple references to
>existing literature would be appreciated.
>
>Thanks,
>
>Mark Davies
>
>=================================================
>
>Mark Davies
>Assoc. Prof., Linguistics
>Brigham Young University
>(phone) 801-422-9168 / (fax) 801-422-0906
>
>http://davies-linguistics.byu.edu
>
>** Corpus design and use // Linguistic databases **
>** Historical linguistics // Language variation **
>** English, Spanish, and Portuguese **
>
>=================================================
Ramesh Krishnamurthy
Lecturer in English Studies
School of Languages and Social Sciences
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812
Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/english/
More information about the Corpora
mailing list