[Corpora-List] Testing how representative a particular corpus is

Angus Grieve-Smith grvsmth at panix.com
Sun Jan 26 14:53:44 UTC 2014


On 1/26/2014 7:50 AM, Adam Kilgarriff wrote:
> I'd say it's wildly optimistic to talk about representativeness (of 
> "general English/French/Chinese/ .. /Swahili/Luo/Welsh/ ...") .  It 
> assumes you know the population to be represented.  What corners and 
> niches of the language in question do we want to include, and which 
> exclude?  We don't even know what they are yet, let alone how to 
> collect them, and how much of each we want.  The best we can do is to 
> stay open to all the varieties of a language that there are, gather 
> data for them where we can, and explore how they relate to each other. 
>  That's my research agenda

     Another thing we can do is to put off the problem of finding a 
representative sample of Language X and focus on a particular genre or 
register, where there will be less variability.  For my dissertation 
study on French negation, I focused on the language of Parisian theater, 
so it was actually the spread of change in negation *in Parisian 
theatrical French*:

https://www.academia.edu/203559/The_Spread_of_Change_in_French_Negation

     As it turned out, there was still quite a bit of variation within 
that genre, and there is still a large question as to how representative 
the plays I studied were of the overall language of the stage, but it 
was more manageable.  If I had used a corpus that aspired to be more 
general or canonical, I would have wound up with a balance of 
conventions and fads that swung widely from one century to the next.

     Restricting your corpus to a single genre can be bad for 
psycholinguistic studies, since we know for a fact that people are 
exposed to more than just the language of the stage in the course of a 
day.  But there at least the fact that one genre is represented is 
salient and keeps us from forgetting that we are not dealing with a 
representative sample.

-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list