[Corpora-List] Testing how representative a particular corpus is

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Thu Jan 23 15:20:51 UTC 2014


Hi Matías
#1 whenever we have no clue about how large the population is (eg 'spoken language'),
or how it is composed, it is impossible to discuss how representative a sample (corpus) is
#2 on the whole, it is best to describe the corpus as accurately as possible,
and initially make statements only about the corpus data; we can then suggest
that some of the findings might be more generally valid, with specific indicators
about the limitations of one's own corpus, known before the research, and others
that may have come to light during the research
#3 we can compare our findings with other samples/corpora, and see which features
are confirmed in similar samples, and which are found in different/larger/broader samples, and try
again to see if a statement can be at all extended
#4 all spoken data corpora are likely to contain some elements/categories of elements
common to all spoken language (in order to use the term 'spoken' for both),
but this is unlikely to be true the other way round...
#5 I know www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf‎
compares corpora in general... but don't know of any test criteria or procedures,
and would revert to impossibility in terms of 'representativeness' (#1)

hope this helps
ramesh
---------------------------------------------
Message: 7
Date: Wed, 22 Jan 2014 20:51:46 +0100
From: Matías Guzmán Naranjo <mortem.dei at gmail.com>
Subject: [Corpora-List] Testing how representative a particular corpus
        is
To: "corpora at uib.no" <corpora at uib.no>
Dear all,
A (not involved in corpus linguistics) college expressed his concerns to me
about corpus linguistics, mainly the fact that he thought oral corpora are
not really representative of spoken language, and that thus, results of
investigations that use oral corpora are not really reliable as reflecting
the wider picture of how people speak and use language. My question is
whether there have been studies done about how representative are, say
phone recordings, or semi-guided interviews, of actual spoken language.

I use oral corpora for my work but just assume that semi-guided interviews
are somewhat representative of spoken language outside semi-guided
interviews, and that the results do generalize to some degree to the rest
of situations, but I ad never really thought about testing this assumption.
Best,
Matías Guzmán Naranjo
----------------------------------------
Message: 8
Date: Wed, 22 Jan 2014 22:19:15 +0100
From: Marc Brysbaert <Marc.Brysbaert at UGent.be>
Subject: Re: [Corpora-List] Testing how representative a particular
        corpus is
To: corpora at uib.no

We use lexical decision times (is this a word or not?) to validate
word frequency measures from different types of corpora. Usually
spoken corpora are not doing extremely well, although this could be
due to their small size. Subtitles seem to come closest to spoken
language. You find two pointers here:
http://crr.ugent.be/papers/Brysbaert%20&%20New%20BRM%202009%20Subtlexus.pdf
http://crr.ugent.be/archives/1423
or on our website:
http://crr.ugent.be
Best, mb
----------------------------------------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list