[Corpora-List] Testing how representative a particular corpus is

Chris Brew christopher.brew at gmail.com
Sun Jan 26 23:18:46 UTC 2014


I agree with much of what has been said so far. But I particularly agree
with those who are saying (or only suggesting) that 'representative' makes
more sense when it is understood as 'representative enough for the purpose
at hand', and the purpose is clearly specified. Nobody seriously claims
that the British National Corpus is much use for studying modern American
internet slang. It was not designed for that purpose, and cannot serve it,
because the sample frame is wrong. In the same way, you have to be careful
if you want to argue that traditional studies of memory (conducted on
undergraduates) have direct relevance to the treatment of cognitive
problems that manifest themselves mainly in the over-60s. I'm not saying it
can't be done, but you need to think clearly about whether the sample you
have is adequate for your purpose.


The problem. if there is a problem, is that the designers of corpora,
suitably impressed by the time, effort and expense that they have devoted
to corpus collection, frequently want to argue that they have created
something that is suitable for a very wide range of purposes, including
ones not dreamt of when the corpus was collected. And users of corpora are
frequently drawn to the hope that some existing corpus will spare them the
effort of collecting their own. That's OK if you can make a reasoned case
that the existing corpus really is suitable, but you do have to make that
case, and if the case doesn't convince, you may have to either abandon the
question or find a different corpus.


On Sun, Jan 26, 2014 at 1:51 PM, Matías Guzmán Naranjo <mortem.dei at gmail.com
> wrote:

>
> Another thing we can do is to put off the problem of finding a
>> representative sample of Language X and focus on a particular genre or
>> register, where there will be less variability.
>
>
> The problem is that we want to be able to generalized. It is of little
> insight to say that construction X is more frequent than construction Y in
> <<semi-guided interviews conducted by profession linguists, where the test
> subjects know they are being recorded>> for the 100 people you picked. We
> would like to be able to say that those results are representative of, say,
> spoken language in a particular city, or at least a formal spoken register.
> Not being able to generalize would mean that things like collocational, or
> collostructional studies are meaningless for spoken corpora because they
> would only apply to that particular set of texts.
>
> Best,
>
> Matías
>
>
> 2014-01-26 Angus Grieve-Smith <grvsmth at panix.com>
>
> On 1/26/2014 7:50 AM, Adam Kilgarriff wrote:
>>
>>> I'd say it's wildly optimistic to talk about representativeness (of
>>> "general English/French/Chinese/ .. /Swahili/Luo/Welsh/ ...") .  It assumes
>>> you know the population to be represented.  What corners and niches of the
>>> language in question do we want to include, and which exclude?  We don't
>>> even know what they are yet, let alone how to collect them, and how much of
>>> each we want.  The best we can do is to stay open to all the varieties of a
>>> language that there are, gather data for them where we can, and explore how
>>> they relate to each other.  That's my research agenda
>>>
>>
>>     Another thing we can do is to put off the problem of finding a
>> representative sample of Language X and focus on a particular genre or
>> register, where there will be less variability.  For my dissertation study
>> on French negation, I focused on the language of Parisian theater, so it
>> was actually the spread of change in negation *in Parisian theatrical
>> French*:
>>
>> https://www.academia.edu/203559/The_Spread_of_Change_in_French_Negation
>>
>>     As it turned out, there was still quite a bit of variation within
>> that genre, and there is still a large question as to how representative
>> the plays I studied were of the overall language of the stage, but it was
>> more manageable.  If I had used a corpus that aspired to be more general or
>> canonical, I would have wound up with a balance of conventions and fads
>> that swung widely from one century to the next.
>>
>>     Restricting your corpus to a single genre can be bad for
>> psycholinguistic studies, since we know for a fact that people are exposed
>> to more than just the language of the stage in the course of a day.  But
>> there at least the fact that one genre is represented is salient and keeps
>> us from forgetting that we are not dealing with a representative sample.
>>
>> --
>>                                 -Angus B. Grieve-Smith
>>                                 grvsmth at panix.com
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140126/4dde5115/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list