Corpora: when does a subcorpus become a corpus
    P. Kaszubski 
    przemka at amu.edu.pl
       
    Sat Jan  5 00:47:24 UTC 2002
    
    
  
On 4 Jan 2002, at 11:58, Sampo Nevalainen wrote:
> height of human beings on the basis of a basketball team. The problem with
> language is that exceptions are often not evident and not easily detected
> since there is no clear "reference set" for language. In principle, if
> your findings are truly generalizable you should get similar results from
> any corpus, although there is obviously more "noise" in more "general"
> corpora. Am I right? Or am I pedant? Or both. ( About the "Terms in
I think the similarity of results is deeply affected by corpus size and
Zipfian distribution. Some interesting features will only show up
when the (sub)corpora compared are large enough, and this is in
turn dependent on the composition of the "general corpus" from
which you may have retrieved them (if you have done so). Now
we're back to the issue of how large a corpus, or subcorpus, or
special corpus, should be in order to be representative not just of a
given genre/variety etc. but also of the linguistic feature(s)
investigated. Are 5 occurrences (in a million or less running words)
enough? This is yet another contributing factor to the conclusion
that in order to study sth in a corpus-based (or corpus-driven)
manner, you need to first clearly define this "sth" and lay down your
purpose.
(Slightly tardy) Season's greetings to you and all "corporeans",
Przemek
=======================================
Dr Przemyslaw Kaszubski
t: +48 61 8293515
e: przemka at amu.edu.pl
w: http://elex.amu.edu.pl/ifa/staff/kaszubski.html
(ENGLISH) LEARNER CORPORA PAGE:
http://main.amu.edu.pl/~przemka
COMPREHENSIVE CORPORA BIBLIOGRAPHY:
http://main.amu.edu.pl/~przemka/welcome.html#Corpbibl
School of English
Adam Mickiewicz University
Al. Niepodleglosci 4
61-874 Poznan
t: +48 61 8293506
f: +48 61 8523103
w: http://elex.amu.edu.pl/ifa
=======================================
    
    
More information about the Corpora
mailing list