Corpora: Corpus representativeness: A "summary" of the query

Sampo Nevalainen samponev at cc.joensuu.fi
Thu Aug 30 13:07:03 UTC 2001


Corpus representativeness: A "summary" of the query

First, I want to apologize for not writing this summary before. I sent my 
query to the corpus list in November 2000, that is, almost a year ago! I 
have been busy with other things, but I admit that the longish delay is 
partly due to my sloppiness. I would like to thank all those nice people 
who used (wasted?) their valuable time to answer my questions. Although I 
did not get many answers, they were all very interesting and important for 
me. I am grateful to the following people who kindly assisted me (in 
alphabetical order; no preference ;-)):

Eric Atwell
Eleanor Batchelder
Pascual Cantos
Florence Duclaye
Bill Fisher
Shlomo Izre'el
Ramesh Krishnamurthy
Petek Kurtboke
Uta Lausberg de Morales
Geoffrey Williams

I apologize my unintentional negligence, if I did not mention someone who I 
should have. Since some of the respondents wished to remain anonymous, I 
shall generally not refer to the author in the following compilation of 
e-mails, even though I take advantage of straight citations. (Consequently, 
it is (un)fortunately pretty easy for people involved to deduce "who said 
what"
) However, if you are interested in particular citations, you may ask 
me for the author to be contacted for further information, but only if 
(s)he did not wish to remain anonymous.

I underline that the ideas presented below are not my personal thoughts 
(although I can mostly agree with them.)  In general, the respondents seem 
to have a pretty fine consensus on what representativeness is or SHOULD be 
in corpus linguistics, but as we all know, practise is often different to 
theory. Unsurprisingly, we'll see that there are several approaches to this 
issue, depending on the field of interest. While summing up the answers I 
have got so far, I am still willing to hear about people who have (any kind 
of) ideas about representativeness in corpus linguistics. (hint hint ;-)) 
Feel free to contact me.

The "summary" (read: a messy compilation of citations) is divided into 
three parts:
1) Towards the concept of representativeness
	- short citations about representativeness as a concept
2) Considerations and methods in the pursuit for representativeness
	- some general questions arising from the material
	- longer citations, for those who want more context :-)
3) References and links

Clarifying additions are presented in [angle brackets], while (
) indicates 
that some text fragments were left out. Note that some of the citations in 
the first part are presented also in the more extensive citations of the 
second part to ensure readability.

----------------------------------------------------------------------------------------------------------------

1. TOWARDS THE CONCEPT OF REPRESENTATIVENESS


" (
) "representativeness" depends on the application, there can be no such 
thing as a generically representative corpus."

"We don't tackle the issue of representativeness directly but via 
predictability."

"What is the corpus to be "representative" of?"

"Representativeness depends on the purpose of the corpus."

"For me, representativeness is without compromise: it includes sampling of 
both demographic varieties and contextual varieties."

" "Representativeness" to me in that arena [speech recognition evaluation] 
means "How well is the test set represented by the training set?" "

"The Brown corpus (1960s, Kucera and Francis) seems to be generally 
considered to be a "representative" corpus (
)"

"A lot depends on your corpus, if you are building a reference corpus then 
you have to follow Atkins & Clear, Biber etc to have 'balanced' samples of 
different genre. If (
) you are concerned with special languages then you 
must change your criteria. (
) --- This is still not really representative, 
personally I don't believe that really exists. We replace this by 
justification."

"Representativeness of a corpus implies that you are working on a 
particular theme, and you are trying to give people a general overview of 
it. (
) the keywords behind representativeness are : main subjects of a 
theme, brief information on these subjects, and links to know more if 
desired. (
) a representative corpus must remain as neutral as possible, so 
that the readers get an objective point of view of the subject. Or, if the 
theme requires to give an opinion, then it should give all the opinions 
existing on the same subject."


2. CONSIDERATIONS AND METHODS IN THE PURSUIT FOR REPRESENTATIVENESS


general questions:
- what is the corpus to be "representative" of?
- how to measure representativeness?
- how to define the structure of the corpus (categories of texts)?
- what about variety? should we use language "production" or "consumption" 
as a criterion? how to judge "correctness" and "incorrectness"? is 
"vintage" a matter of date of production or date of consumption? what is 
the relationship between "ideal" and "actual"?
- how to ensure comparability?


" (
) "representativeness" depends on the application, there can be no such 
thing as a generically representative corpus. (
) for this [grammatical 
analysis and part-of-speech tagging], the genre of the text is less 
important than for, say, dialog-act modelling, since grammar varies less 
between genres (
). On the other hand, if every researcher is free to 
select their own "representative" text-set for their own application, how 
can we comparatively evaluate across research grounded on different 
corpora? --- (
) The original taggers for LOB, UPenn, ICE etc 
corpus-annotation schemes started from different "representative" corpora, 
so accuracy rates reported by these projects, in terms of their own 
"representative" corpora, may not be directly comparable."

" "Representativeness" to me in that arena [speech recognition evaluation] 
means "How well is the test set represented by the training set?".  (The 
usual paradigm is for a large sample of transcribed speech to be made 
available to sites being evaluated, for their use in automatically training 
their recognizers; then a smaller sample of similar material is presented 
to their recognizers for a test and the output hypothesized by the 
recognizers is scored against human-derived reference 
transcriptions.)  It's widely regarded as an unfair test if the test data 
is not represented well by the training data. --- When the training set is 
explicitly defined, the representativeness of the test set can be estimated 
pretty well by the test set perplexity of the test set relative to a 
statistical language model derived solely from the training set.  (
)"

"Last year I worked on the question of whether two test sets drawn from 
telephone speech recorded at different times were equally difficult for 
recognizers to recognize.  Since the training data was not a specific set, 
I tried to get at it by assuming that one factor of difficulty was the 
homogeneity of the test set; that is, a set of utterances that are more 
alike is inherently easier to recognize. This follows, I think, if you 
assume that the training data is drawn from a sample space typified by the 
test set.  I then estimated the homogeneity of each test set by averaging 
the results of a number of randomized experiments, in each of which I 
measured the representativeness of a randomly-chosen tenth of the utterance 
relative to the rest, computing representativeness as the perplexity of the 
chosen utterances using an ngram language model trained up solely on the 
other nine-tenths of the utterances.  In other words, homogeneity = average 
representativeness of one fraction of the set relative to the other.  I 
made scripts and programs to do these calculations, but the project kind of 
bogged down at that point because the actual test results, which I would 
have used to validate my method, were in fact produced by sites all using 
the same arbitrary language model rather than ones trained up on different 
training sets. Also, I discovered that my work had been foreshadowed by 
Adam Kilgarriff and Tony Rose: check out their paper "Measures for Corpus 
Similarity and Homogeneity". "

"The Brown corpus (1960s, Kucera and Francis) seems to be generally 
considered to be a "representative" corpus, and LOB, SEU and ICE corpora 
are designed in a very similar way: the corpus consists of 500 texts of 
2000 words each (to make a 1 million word corpus). 300 spoken and 200 
written texts. Spoken consists of 180 Dialogue texts and 120 Monologues. 
Written consist of 150 Printed and 50 Non-printed texts. Each of these 
categories are then subdivided, and so on. My objections to this "a priori" 
design are: a) some categories of texts are very difficult to obtain (e.g. 
business transactions, because of commercial confidentiality) b) many 
categories of texts are omitted (e.g. email) c) there is no justification 
for the proportions: I do not know of any sociolinguistic research which 
says that the average person consumes/produces 3/5 spoken language and 2/5 
written language (just to take the first main categorial division). The 
proportions for sub-categories are even more questionable."

- "What is the corpus to be "representative" of? Current estimates 
(Crystal, British Council, etc) suggest there are 1500 million speakers of 
English, 750m EFL speakers/users, 350m ESL, and 350m "native-speakers". 
Should a corpus of "contemporary English" include all of these? 
Representativeness depends on the purpose of the corpus. If we want to know 
what "modern English" is like, we should certainly include all types of 
speakers/users."
- "What about "variety"? Some Thai users of English may favour American 
English, others British English, others Australian English. Most probably 
use a mixture."
- "Should language "production" or "consumption" be the criterion? Most of 
us consume more than we produce in an average day, I suspect."
- " "Correct" and "Incorrect": how are we to judge? Should this be a 
criterion? (Certainly it is for EFL dictionary compilers: what models of 
English should we be a) describing and b) recommending?"
- " "Vintage": if we are collecting a corpus of "modern English", when does 
"modern" begin? Some texts written a long time ago are still popular (on 
reading lists, or e.g. Agatha Christie crime thrillers, P.G. Wodehouse, 
etc) - again, is it a matter of date of production or date of consumption?"
- " "Ideal" vs "Actual": 50% of humans are men, 50% women. But what is the 
ratio of published books, newspaper articles, broadcast items, etc? Are men 
and women equally disseminated? I suspect not. So should the corpus reflect 
the actual reality/inequality, or the ideal? The former may reinforce 
stereotypes, the latter may just create new ones."

"(
) If 1500 million people are using English every day, how can we ever 
capture more than an infinitesimal sample? Cobuild's Bank of English corpus 
is now 418 million words, and various people (Stubbs, Church and Lieberman, 
Gottlieb) have tried to estimate the amount of language an average human 
experiences in a lifetime, and end up with figures around the 500 million 
word mark. --- These are just a few of the problems relating to 
"representativeness" (
). But I have only been thinking of "modern 
English", not diachronic, not other languages, and the corpus only as 
written (
), not as audio or even video data (
) - because as linguists we 
ought to deal with pronunciation, intonation, etc and also with 
extra-linguistic aspects such as gesture (
) and who or what we are looking 
at when we're speaking, etc."

"A lot depends on your corpus, if you are building a reference corpus then 
you have to follow Atkins & Clear, Biber etc to have 'balanced' samples of 
different genre. If, like me, you are concerned with special languages then 
you must change your criteria. I have always thrown out the idea of 
sublanguages as defined by Harris, and used in much NLP and IA research. 
This is a generative approach, and like all generative approaches tends to 
ignore reality. The classical sublanguage approach views science languages 
as realisations of bibliographical systems, such as Dewey. They go deeper 
into the Dewey system and then try to justify boundaries that delimits one 
group from another. This is not very useful (
) in that they ignore 
multidisciplinarity which is the basis of all research, for instance in 
medicine you call upon biology, chemistry, statistics, if you remove all of 
these you have nothing less. (
) Outside of humans, language does not 
exist, there is no Platonic cave of reality out there. --- If language is 
essentially human, it would seem more intelligent to approach 
representativeness from the point of view of the language users, anathema 
to a generative linguist. To do this rather than think in terms of 
disciplines we think in terms of discourse communities and define 
representative in terms of the language they produce. This is still not 
really representative, personally I don't believe that really exists. We 
replace this by justification."


3. REFERENCES AND LINKS


"Check the archive of corpora-list, as I'm sure, as you yourself state, 
that this topic has been discussed before. Biber, Biber and Finegan, Leech, 
Sinclair, Stubbs, Atkins and Clear and Ostler, and many others have 
certainly written about this topic."

"for representativeness of oral corpora you can read introduction books to 
quantitative sociology, as well as literature about Latin American language 
atlases. Next year I [Dr. Uta Lausberg de Morales] will publish an article 
in the journal "neue romania" (Berlin) about an oral corpus of Guatemalan 
Spanish, and there I will allude to the problem of representativeness as 
well (the article will be in German)."

Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S. 
2000. A comparative evaluation of modern English corpus grammatical 
annotation schemes. ICAME Journal, volume 24, pages 7-23, International 
Computer Archive of Modern and medieval English, HIT Centre, Bergen 
University. ISSN: 0801-5775

Bowker, L. Towards a methodology foe exploiting specialised target language 
corpora as translation resources. International Journal of Corpus 
Linguistics. Vol.5/1: 17-52.

Aquilino Sánchez and Pascual Cantos (1997) "Predictability of Word Forms 
(Types) and Lemmas in Linguistic Corpora. A Case Study Based on the 
Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary 
Spanish". International Journal of Corpus Linguistics 2/2: 259-280. (See 
abstract http://solaris3.ids-mannheim.de/~ijcl/ijcl-2-2.html).

Sánchez, A. and P. Cantos (1998) "El ritmo incremental de palabras nuevas 
en los repertorios de textos. Estudio experimental y comparativo basado en 
dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las 
lenguas inglesa y española y en cinco autores de ambas lenguas". ATLANTIS, 
19/2: 205-223.

Meyer, I., Mackintosh, K., The Corpus from a Terminographer's viewpoint. 
International Journal of Corpus Linguistics. Vol.1/2: 257-285.

Williams, G. 1998. Collocational Networks: Interlocking Patterns of Lexis 
in a Corpus of Plant Biology Research Articles. International Journal of 
Corpus Linguistics. Vol.3/1: 151-171.

Williams, G. 1999. Looking in before looking out: Internal selection 
criteria in a corpus of plant biology. Papers in Computational 
Lexicography. Complex '99. Hungary: Budapest.: 195-204.

S Yang, Dan-Hee, Cantos, P. and Song, Mansuk (2000) "An Algorithm for 
Predicting the Relationship between Lemmas and Corpus Size", ETRI Journal, 
22/2: 20-31 (http://etlars.etri.re.kr/etrij/index.html)

The Corpus of Spoken Israeli Hebrew:
http://spinoza.tau.ac.il/hci/dep/semitic/maamad.html (Hebrew text)
http://spinoza.tau.ac.il/hci/dep/semitic/cosih.html (English text)

Have a look at
http://www.vicnet.net.au/~petek/thesis/

Try the archives at http://www.hit.uib.no/corpora/


( : ============================================= : )

Sampo Nevalainen, M.A.
Researcher
University of Joensuu
Savonlinna School of Translation Studies
P.O.Box 48
FIN-57101 Savonlinna
FINLAND

tel 	+358-15-511 70	    (operator)
	+358-15-511 7704
fax	+358-15-515 096
email	samponev at cc.joensuu.fi



More information about the Corpora mailing list