[Corpora-List] Testing how representative a particular corpus is

Thu Jan 23 18:57:03 UTC 2014

Reading the first paper, I think the questions raised by Marc and his 
colleagues and the data they collect are very valuable, particularly 
about over-reliance on Kučera and Francis (1967). I just wrote last week 
about the importance of sampling:

http://grieve-smith.com/blog/2014/01/estimating-universals-averages-and-percentages/

But their judgment of corpora as "bad" based on the correlation of their 
word frequencies with reaction times seems circular. Why take Thorndike 
and Lorge (1944) as the gold standard, and conclude that any deviance 
from reaction time data is evidence of "poor quality"?

Brysbaert et al. chose one particular set of reaction times from the 
Elexicon project (Balota et al. 2007) "because the effect of word 
frequency is particularly strong in this task," but the effect of word 
frequency in which corpus? How representative is the Elexicon data? 
Specifically, how representative is it of the reaction times of people 
who are most likely to be reading or hearing the texts in each corpus?

I agree with Ramesh that the problem of representativeness is a 
difficult one, and that a truly representative corpus has not been made, 
but not that it is impossible. Corpora are tools, and tools are never 
"good" or "bad," they're just good or bad for whatever task you have in 
mind for them. Experiments are tools in the same way. We just need to 
decide what "population" it is that we want to model, and that depends 
on what we're studying. A corpus that is representative of the language 
encountered by the average Wall Street Journal reader (for example) will 
not match up well with the reaction times of the average American 
college sophomore (for example).

Marc's strategies cast an interesting light on corpus design, but it is 
reaching to interpret that as an overall judgment of the "quality" of 
one corpus or another.

On 1/22/2014 4:19 PM, Marc Brysbaert wrote:
> We use lexical decision times (is this a word or not?) to validate 
> word frequency measures from different types of corpora. Usually 
> spoken corpora are not doing extremely well, although this could be 
> due to their small size. Subtitles seem to come closest to spoken 
> language. You find two pointers here:
>
> http://crr.ugent.be/papers/Brysbaert%20&%20New%20BRM%202009%20Subtlexus.pdf 
>

-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora