[Corpora-List] Testing how representative a particular corpus is

Alexander Osherenko osherenko at gmx.de
Sun Jan 26 13:54:22 UTC 2014


Sure, it is difficult to talk about concrete varieties of language to
include or to exclude because they can be very different. To reduce
complexity, it would be more appropriate to talk about particular
categories of varieties, for example, grammatical.

--
Alexander Osherenko, Dr. rer. nat.
Humboldt Innovation <http://www.humboldt-innovation.de/>
Humboldt-Universität zu Berlin
Senior HCI architect

Socioware Development <http://www.socioware.de/>
Founder and R&D


2014-01-26 Adam Kilgarriff <adam at lexmasterclass.com>

> I'd say it's wildly optimistic to talk about representativeness (of
> "general English/French/Chinese/ .. /Swahili/Luo/Welsh/ ...") .  It assumes
> you know the population to be represented.  What corners and niches of the
> language in question do we want to include, and which exclude?  We don't
> even know what they are yet, let alone how to collect them, and how much of
> each we want.  The best we can do is to stay open to all the varieties of a
> language that there are, gather data for them where we can, and explore how
> they relate to each other.  That's my research agenda
>
> Adam
>
>
> On 26 January 2014 07:15, Xu Jiajin <ustcxujj at gmail.com> wrote:
>
>> Hi Kevin,
>>
>>
>>
>> Lexical Closure is a nice idea. But what defines representativeness is
>> meant to be criteria at two levels: one external and the other internal.
>> The external criterion depends on how good a taxonomy of text
>> categories/genres we have, which has been proved to be extremely difficult,
>> if not impossible, to formulate. Lexical Closure or Saturation (Belica
>> 1996) only concerns the internal criterion of the breadth or coverage of
>> linguistic (i.e. lexical) features. The genre criterion aims at textual
>> heterogeneity, and closure measure at linguistic homogeneity. Up to this
>> point, I'm reminded of stratified random sampling in general statistical
>> sampling. Likewise, a genre taxonomy based text collection plus a
>> snowballing lexical closure test might lead to a more balanced corpus.
>>
>>
>>
>> Cf. Lexical closure (McEnery and Wilson, 2001: 173-176); Part-of-speech
>> closure (ibid.: 176-180); Parsing closure (ibid.: 180-183).
>>
>>
>>
>> Jiajin
>>
>>
>>
>> --
>>
>> Jiajin XU
>>
>> Ph.D., Professor
>>
>> National Research Centre for Foreign Language Education
>>
>> Beijing Foreign Studies University
>>
>> Beijing 100089
>>
>> China
>>
>>
>> On Sat, Jan 25, 2014 at 10:37 PM, Kevin B. Cohen <kevin.cohen at gmail.com>wrote:
>>
>>> Part of the problem with determining how representative a corpus is is
>>> that we don't have good definitions in corpus linguistics of either
>>> representativeness or balance--I think that we all sort of think that we
>>> know them when we see them, but in looking recently at a number of
>>> textbooks on corpus linguistics trying to find definitions of either of
>>> these terms, I didn't come up with much.  This is a big difference from
>>> some other quantitative sciences, where the notion of representativeness
>>> has a reasonably clear statistical definition.
>>>
>>> My colleague Irina Temnikova and I have tried recently to come at the
>>> question of representativeness from the opposite angle.  We used the work
>>> of McEnery and Wilson on closure properties of language to build a tool
>>> that looks at the extent to which a corpus represents a sublanguage; if it
>>> doesn't look like a sublanguage at all, then we suggest that it looks like
>>> it's representative.  The paper will appear at LREC this year.
>>>
>>> Kev
>>>
>>>
>>>
>>> On Wed, Jan 22, 2014 at 12:51 PM, Matías Guzmán Naranjo <
>>> mortem.dei at gmail.com> wrote:
>>>
>>>> Dear all,
>>>>
>>>> A (not involved in corpus linguistics) college expressed his concerns
>>>> to me about corpus linguistics, mainly the fact that he thought oral
>>>> corpora are not really representative of spoken language, and that thus,
>>>> results of investigations that use oral corpora are not really reliable as
>>>> reflecting the wider picture of how people speak and use language. My
>>>> question is whether there have been studies done about how representative
>>>> are, say phone recordings, or semi-guided interviews, of actual spoken
>>>> language.
>>>>
>>>> I use oral corpora for my work but just assume that semi-guided
>>>> interviews are somewhat representative of spoken language outside
>>>> semi-guided interviews, and that the results do generalize to some degree
>>>> to the rest of situations, but I ad never really thought about testing this
>>>> assumption.
>>>>
>>>> Best,
>>>>
>>>> Matías Guzmán Naranjo
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Bretonnel Cohen, PhD
>>> Biomedical Text Mining Group Lead, Computational Bioscience Program,
>>> U. Colorado School of Medicine
>>> 303-916-2417 (cell) 303-377-9194 (home)
>>> http://compbio.ucdenver.edu/Hunter_lab/Cohen
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> --
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director                                    Lexical Computing Ltd<http://www.sketchengine.co.uk/>
>
> Visiting Research Fellow                 University of Leeds<http://leeds.ac.uk>
>
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>
>                         *DANTE: a lexical database for English
> <http://www.webdante.com>                  *
> ========================================
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140126/68d98185/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list