[Corpora-List] Testing how representative a particular corpus is

Matías Guzmán Naranjo mortem.dei at gmail.com
Sun Jan 26 21:51:29 UTC 2014


> Another thing we can do is to put off the problem of finding a
> representative sample of Language X and focus on a particular genre or
> register, where there will be less variability.


The problem is that we want to be able to generalized. It is of little
insight to say that construction X is more frequent than construction Y in
<<semi-guided interviews conducted by profession linguists, where the test
subjects know they are being recorded>> for the 100 people you picked. We
would like to be able to say that those results are representative of, say,
spoken language in a particular city, or at least a formal spoken register.
Not being able to generalize would mean that things like collocational, or
collostructional studies are meaningless for spoken corpora because they
would only apply to that particular set of texts.

Best,

Matías


2014-01-26 Angus Grieve-Smith <grvsmth at panix.com>

> On 1/26/2014 7:50 AM, Adam Kilgarriff wrote:
>
>> I'd say it's wildly optimistic to talk about representativeness (of
>> "general English/French/Chinese/ .. /Swahili/Luo/Welsh/ ...") .  It assumes
>> you know the population to be represented.  What corners and niches of the
>> language in question do we want to include, and which exclude?  We don't
>> even know what they are yet, let alone how to collect them, and how much of
>> each we want.  The best we can do is to stay open to all the varieties of a
>> language that there are, gather data for them where we can, and explore how
>> they relate to each other.  That's my research agenda
>>
>
>     Another thing we can do is to put off the problem of finding a
> representative sample of Language X and focus on a particular genre or
> register, where there will be less variability.  For my dissertation study
> on French negation, I focused on the language of Parisian theater, so it
> was actually the spread of change in negation *in Parisian theatrical
> French*:
>
> https://www.academia.edu/203559/The_Spread_of_Change_in_French_Negation
>
>     As it turned out, there was still quite a bit of variation within that
> genre, and there is still a large question as to how representative the
> plays I studied were of the overall language of the stage, but it was more
> manageable.  If I had used a corpus that aspired to be more general or
> canonical, I would have wound up with a balance of conventions and fads
> that swung widely from one century to the next.
>
>     Restricting your corpus to a single genre can be bad for
> psycholinguistic studies, since we know for a fact that people are exposed
> to more than just the language of the stage in the course of a day.  But
> there at least the fact that one genre is represented is salient and keeps
> us from forgetting that we are not dealing with a representative sample.
>
> --
>                                 -Angus B. Grieve-Smith
>                                 grvsmth at panix.com
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140126/abb6266d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list