[Corpora-List] SemCor: extrapolating Brown data to larger corpora

Mark Davies Mark_Davies at byu.edu
Mon Feb 13 23:06:15 UTC 2006


A graduate student here is working with SemCor
(http://multisemcor.itc.it/semcor.php), and she's looking at how well
the data from the Brown-based SemCor corpus might potentially compare
with that of a larger corpus, like the BNC.

For example, [crack] as a verb has 17 tokens in SemCor, distributed
among the seven different WordNet senses as follows (if I'm reading the
cntlst files from SemCor 1.6 correctly):

WordNet	Tokens
sense
------		------
1		5
2		4
3		2
4		2
5		2
6		1
7		1
-----		-----
TOTAL		17

The question is whether in a 100 million word corpus, we would get more
or less the same distribution.  For example, might Senses 6-7
(hypothetically) be the most common, even though they each only occur
once in the Brown/SemCor corpus?

Has anyone attempted to compare the results of SemCor with a
randomly-selected subset of tokens from a much larger corpus, such as
the BNC -- even for just a small subset of words (particularly verbs)?
Also, are there any statistical tests that might be used to see whether
we have a sufficiently robust for a given word for WSD with SemCor?
(It's obviously a function of frequency - you'd probably get more
reliable results with a high-frequency word like [break] than a lower
frequency word like [smear]).

Also, we're not really looking for basic articles on WSD (or literature
on Senseval, etc), but rather just the issue at hand -- the
extrapolatability (??) of SemCor to a larger corpus.

Sorry if this an FAQ-like question.  If so, simple references to
existing literature would be appreciated.

Thanks,

Mark Davies

=================================================

Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

================================================= 



More information about the Corpora mailing list