[Corpora-List] Testing how representative a particular corpus is
Angus Grieve-Smith
grvsmth at panix.com
Sun Jan 26 14:53:44 UTC 2014
On 1/26/2014 7:50 AM, Adam Kilgarriff wrote:
> I'd say it's wildly optimistic to talk about representativeness (of
> "general English/French/Chinese/ .. /Swahili/Luo/Welsh/ ...") . It
> assumes you know the population to be represented. What corners and
> niches of the language in question do we want to include, and which
> exclude? We don't even know what they are yet, let alone how to
> collect them, and how much of each we want. The best we can do is to
> stay open to all the varieties of a language that there are, gather
> data for them where we can, and explore how they relate to each other.
> That's my research agenda
Another thing we can do is to put off the problem of finding a
representative sample of Language X and focus on a particular genre or
register, where there will be less variability. For my dissertation
study on French negation, I focused on the language of Parisian theater,
so it was actually the spread of change in negation *in Parisian
theatrical French*:
https://www.academia.edu/203559/The_Spread_of_Change_in_French_Negation
As it turned out, there was still quite a bit of variation within
that genre, and there is still a large question as to how representative
the plays I studied were of the overall language of the stage, but it
was more manageable. If I had used a corpus that aspired to be more
general or canonical, I would have wound up with a balance of
conventions and fads that swung widely from one century to the next.
Restricting your corpus to a single genre can be bad for
psycholinguistic studies, since we know for a fact that people are
exposed to more than just the language of the stage in the course of a
day. But there at least the fact that one genre is represented is
salient and keeps us from forgetting that we are not dealing with a
representative sample.
--
-Angus B. Grieve-Smith
grvsmth at panix.com
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list