[Corpora-List] Testing how representative a particular corpus is

Thu Jan 23 19:34:08 UTC 2014

Dear Angus,

I fully agree with you that validation on the basis of lexical  
decision data is only one criterion and that this criterion is more  
important for psycholinguists like me, interested in word recognition,  
than for other researchers. For instance, I can imagine that if one is  
interested in language variation film subtitles might be a nightmare  
(though this in itself may be an interesting research question).

As for the reliance on the English Lexicon Project, I agree with you.  
This is why we have invested heavily in the British Lexicon Project,  
the French Lexicon Project, and the Dutch Lexicon Projects I and II. I  
can reassure you: for each language we tested, the subtitle  
frequencies did best. For English it also does way better than the  
Google frequencies, as I have shown in a Frontiers article. Probably  
the most decisive evidence for me was the finding that British  
subtitle frequencies did better than the BNC frequencies.

Again, however, this is only one criterion (albeit an important one  
for me). It will be interesting to see how well subtitle corpora do  
for other criteria.

Returning to the original question, I would be happy with other  
criteria brought forward about how representative a corpus is for a  
particular task/register.

All the best, marc

Quoting Angus Grieve-Smith <grvsmth at panix.com>:

> Reading the first paper, I think the questions raised by Marc and  
> his colleagues and the data they collect are very valuable,  
> particularly about over-reliance on Kučera and Francis (1967). I  
> just wrote last week about the importance of sampling:
>
> http://grieve-smith.com/blog/2014/01/estimating-universals-averages-and-percentages/
>
> But their judgment of corpora as "bad" based on the correlation of  
> their word frequencies with reaction times seems circular. Why take  
> Thorndike and Lorge (1944) as the gold standard, and conclude that  
> any deviance from reaction time data is evidence of "poor quality"?
>
> Brysbaert et al. chose one particular set of reaction times from the  
> Elexicon project (Balota et al. 2007) "because the effect of word  
> frequency is particularly strong in this task," but the effect of  
> word frequency in which corpus? How representative is the Elexicon  
> data? Specifically, how representative is it of the reaction times  
> of people who are most likely to be reading or hearing the texts in  
> each corpus?
>
> I agree with Ramesh that the problem of representativeness is a  
> difficult one, and that a truly representative corpus has not been  
> made, but not that it is impossible. Corpora are tools, and tools  
> are never "good" or "bad," they're just good or bad for whatever  
> task you have in mind for them. Experiments are tools in the same  
> way. We just need to decide what "population" it is that we want to  
> model, and that depends on what we're studying. A corpus that is  
> representative of the language encountered by the average Wall  
> Street Journal reader (for example) will not match up well with the  
> reaction times of the average American college sophomore (for  
> example).
>
> Marc's strategies cast an interesting light on corpus design, but it  
> is reaching to interpret that as an overall judgment of the  
> "quality" of one corpus or another.
>
> On 1/22/2014 4:19 PM, Marc Brysbaert wrote:
>> We use lexical decision times (is this a word or not?) to validate  
>> word frequency measures from different types of corpora. Usually  
>> spoken corpora are not doing extremely well, although this could be  
>> due to their small size. Subtitles seem to come closest to spoken  
>> language. You find two pointers here:
>>
>> http://crr.ugent.be/papers/Brysbaert%20&%20New%20BRM%202009%20Subtlexus.pdf
>
> -- 
> 				-Angus B. Grieve-Smith
> 				grvsmth at panix.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora