[Corpora-List] Testing how representative a particular corpus is

Xu Jiajin ustcxujj at gmail.com
Sun Jan 26 07:15:34 UTC 2014


Hi Kevin,



Lexical Closure is a nice idea. But what defines representativeness is
meant to be criteria at two levels: one external and the other internal.
The external criterion depends on how good a taxonomy of text
categories/genres we have, which has been proved to be extremely difficult,
if not impossible, to formulate. Lexical Closure or Saturation (Belica
1996) only concerns the internal criterion of the breadth or coverage of
linguistic (i.e. lexical) features. The genre criterion aims at textual
heterogeneity, and closure measure at linguistic homogeneity. Up to this
point, I'm reminded of stratified random sampling in general statistical
sampling. Likewise, a genre taxonomy based text collection plus a
snowballing lexical closure test might lead to a more balanced corpus.



Cf. Lexical closure (McEnery and Wilson, 2001: 173-176); Part-of-speech
closure (ibid.: 176-180); Parsing closure (ibid.: 180-183).



Jiajin



--

Jiajin XU

Ph.D., Professor

National Research Centre for Foreign Language Education

Beijing Foreign Studies University

Beijing 100089

China


On Sat, Jan 25, 2014 at 10:37 PM, Kevin B. Cohen <kevin.cohen at gmail.com>wrote:

> Part of the problem with determining how representative a corpus is is
> that we don't have good definitions in corpus linguistics of either
> representativeness or balance--I think that we all sort of think that we
> know them when we see them, but in looking recently at a number of
> textbooks on corpus linguistics trying to find definitions of either of
> these terms, I didn't come up with much.  This is a big difference from
> some other quantitative sciences, where the notion of representativeness
> has a reasonably clear statistical definition.
>
> My colleague Irina Temnikova and I have tried recently to come at the
> question of representativeness from the opposite angle.  We used the work
> of McEnery and Wilson on closure properties of language to build a tool
> that looks at the extent to which a corpus represents a sublanguage; if it
> doesn't look like a sublanguage at all, then we suggest that it looks like
> it's representative.  The paper will appear at LREC this year.
>
> Kev
>
>
>
> On Wed, Jan 22, 2014 at 12:51 PM, Matías Guzmán Naranjo <
> mortem.dei at gmail.com> wrote:
>
>> Dear all,
>>
>> A (not involved in corpus linguistics) college expressed his concerns to
>> me about corpus linguistics, mainly the fact that he thought oral corpora
>> are not really representative of spoken language, and that thus, results of
>> investigations that use oral corpora are not really reliable as reflecting
>> the wider picture of how people speak and use language. My question is
>> whether there have been studies done about how representative are, say
>> phone recordings, or semi-guided interviews, of actual spoken language.
>>
>> I use oral corpora for my work but just assume that semi-guided
>> interviews are somewhat representative of spoken language outside
>> semi-guided interviews, and that the results do generalize to some degree
>> to the rest of situations, but I ad never really thought about testing this
>> assumption.
>>
>> Best,
>>
>> Matías Guzmán Naranjo
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> --
> Kevin Bretonnel Cohen, PhD
> Biomedical Text Mining Group Lead, Computational Bioscience Program,
> U. Colorado School of Medicine
> 303-916-2417 (cell) 303-377-9194 (home)
> http://compbio.ucdenver.edu/Hunter_lab/Cohen
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140126/283c4fd3/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list