[Corpora-List] Testing how representative a particular corpus is

Mon Jan 27 10:18:30 UTC 2014

--------------------------
Message: 1
Date: Thu, 23 Jan 2014 15:20:51 +0000
From: "Krishnamurthy, Ramesh" <r.krishnamurthy at aston.ac.uk>
Subject: [Corpora-List] Testing how representative a particular corpus
        is
To: "mortem.dei at gmail.com" <mortem.dei at gmail.com>
Cc: "corpora at uib.no" <corpora at uib.no>
Hi Matías
#1 whenever we have no clue about how large the population is (eg 'spoken language'),
or how it is composed, it is impossible to discuss how representative a sample (corpus) is
#2 on the whole, it is best to describe the corpus as accurately as possible,
and initially make statements only about the corpus data; we can then suggest
that some of the findings might be more generally valid, with specific indicators
about the limitations of one's own corpus, known before the research, and others
that may have come to light during the research
#3 we can compare our findings with other samples/corpora, and see which features
are confirmed in similar samples, and which are found in different/larger/broader samples, and try
again to see if a statement can be at all extended
#4 all spoken data corpora are likely to contain some elements/categories of elements
common to all spoken language (in order to use the term 'spoken' for both),
but this is unlikely to be true the other way round...
#5 I know www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf?<http://www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf?>
compares corpora in general... but don't know of any test criteria or procedures,
and would revert to impossibility in terms of 'representativeness' (#1)
hope this helps
ramesh
----------------------------

Message: 7
Date: Wed, 22 Jan 2014 20:51:46 +0100
From: Matías Guzmán Naranjo <mortem.dei at gmail.com>
Subject: [Corpora-List] Testing how representative a particular corpus
        is
To: "corpora at uib.no" <corpora at uib.no>
Dear all,
A (not involved in corpus linguistics) college expressed his concerns to me
about corpus linguistics, mainly the fact that he thought oral corpora are
not really representative of spoken language, and that thus, results of
investigations that use oral corpora are not really reliable as reflecting
the wider picture of how people speak and use language. My question is
whether there have been studies done about how representative are, say
phone recordings, or semi-guided interviews, of actual spoken language.

I use oral corpora for my work but just assume that semi-guided interviews
are somewhat representative of spoken language outside semi-guided
interviews, and that the results do generalize to some degree to the rest
of situations, but I ad never really thought about testing this assumption.
Best,
Matías Guzmán Naranjo
----------------------------------------
Message: 8
Date: Wed, 22 Jan 2014 22:19:15 +0100
From: Marc Brysbaert <Marc.Brysbaert at UGent.be>
Subject: Re: [Corpora-List] Testing how representative a particular
        corpus is
To: corpora at uib.no
We use lexical decision times (is this a word or not?) to validate
word frequency measures from different types of corpora. Usually
spoken corpora are not doing extremely well, although this could be
due to their small size. Subtitles seem to come closest to spoken
language. You find two pointers here:
http://crr.ugent.be/papers/Brysbaert%20&%20New%20BRM%202009%20Subtlexus.pdf
http://crr.ugent.be/archives/1423
or on our website:
http://crr.ugent.be
Best, mb
-----------------------------------
Message: 4
Date: Thu, 23 Jan 2014 13:57:03 -0500
From: Angus Grieve-Smith <grvsmth at panix.com>
Subject: Re: [Corpora-List] Testing how representative a particular
        corpus is
To: corpora at uib.no

Reading the first paper, I think the questions raised by Marc and his
colleagues and the data they collect are very valuable, particularly
about over-reliance on Ku?era and Francis (1967). I just wrote last week
about the importance of sampling:

http://grieve-smith.com/blog/2014/01/estimating-universals-averages-and-percentages/

But their judgment of corpora as "bad" based on the correlation of their
word frequencies with reaction times seems circular. Why take Thorndike
and Lorge (1944) as the gold standard, and conclude that any deviance
from reaction time data is evidence of "poor quality"?

Brysbaert et al. chose one particular set of reaction times from the
Elexicon project (Balota et al. 2007) "because the effect of word
frequency is particularly strong in this task," but the effect of word
frequency in which corpus? How representative is the Elexicon data?
Specifically, how representative is it of the reaction times of people
who are most likely to be reading or hearing the texts in each corpus?

I agree with Ramesh that the problem of representativeness is a
difficult one, and that a truly representative corpus has not been made,
but not that it is impossible. Corpora are tools, and tools are never
"good" or "bad," they're just good or bad for whatever task you have in
mind for them. Experiments are tools in the same way. We just need to
decide what "population" it is that we want to model, and that depends
on what we're studying. A corpus that is representative of the language
encountered by the average Wall Street Journal reader (for example) will
not match up well with the reaction times of the average American
college sophomore (for example).

Marc's strategies cast an interesting light on corpus design, but it is
reaching to interpret that as an overall judgment of the "quality" of
one corpus or another.

  -Angus B. Grieve-Smith
                                grvsmth at panix.com
-------------

Message: 5
Date: Thu, 23 Jan 2014 20:34:08 +0100
From: Marc Brysbaert <Marc.Brysbaert at UGent.be>
Subject: Re: [Corpora-List] Testing how representative a particular
        corpus is
To: corpora at uib.no

Dear Angus,

I fully agree with you that validation on the basis of lexical
decision data is only one criterion and that this criterion is more
important for psycholinguists like me, interested in word recognition,
than for other researchers. For instance, I can imagine that if one is
interested in language variation film subtitles might be a nightmare
(though this in itself may be an interesting research question).

As for the reliance on the English Lexicon Project, I agree with you.
This is why we have invested heavily in the British Lexicon Project,
the French Lexicon Project, and the Dutch Lexicon Projects I and II. I
can reassure you: for each language we tested, the subtitle
frequencies did best. For English it also does way better than the
Google frequencies, as I have shown in a Frontiers article. Probably
the most decisive evidence for me was the finding that British
subtitle frequencies did better than the BNC frequencies.

Again, however, this is only one criterion (albeit an important one
for me). It will be interesting to see how well subtitle corpora do
for other criteria.

Returning to the original question, I would be happy with other
criteria brought forward about how representative a corpus is for a
particular task/register.

All the best, marc

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora