Corpora: when does a subcorpus become a corpus

Fri Jan 4 13:05:44 UTC 2002

  I would like to make a little example. There was a report here about distribution of the meanings of the verb "moch'". This Russian verb has two main meanings - "can" and "may".

  Would my distributions based on the corpora like Corpus of Russian Proverbs, Political metaphors or Russian newspapers have any value - or - in other words - tell us smth about Russian language as a whole? I think that the proof of it can give the texts of general carefully compiled balanced represantative corpus of Russian language.

>Well I guess I tried to focus on the issue of representativeness rather 
>than the proper nomination for the set of texts, but, yes, probably the 
>proper term might be 'special purpose corpus'. This, however, raises 
>another interesting question. I personally would hope that every single 
>corpus had been compiled for a particular purpose. Indeed, I wonder if 
>there really IS such thing as a 'general corpus'? I have a feeling that so 
>called 'general corpora' - if they exist - are pretty useless in general, 
>unless they're modified for a particular purpose or task. I suppose that in 
>empirical research you always have to choose your "object" (material) 
>according to your subject, and not to use "just something", i.e. you have 
>to know your material: I guess no one would try to determine the average 
>height of human beings on the basis of a basketball team. The problem with 
>language is that exceptions are often not evident and not easily detected 
>since there is no clear "reference set" for language. In principle, if your 
>findings are truly generalizable you should get similar results from any 
>corpus, although there is obviously more "noise" in more "general" corpora. 
>Am I right? Or am I pedant? Or both. ( About the "Terms in Context" - which 
>I do have read more than up to p. 45 :-) -, I liked the book, and I think I 
>could make use of some chapters in my course on corpora as translation tools. )
>
>sincerely,
>Sampo
>
>At 09:54 4.1.2002 +0100, Pearson, Jennifer wrote:
>>If you look at the same publication, p.48, you will find that I argue that,
>>given Sinclair's definitions, neither the term subcorpus nor the term
>>component is appropriate for the sets of texts I was working with (and
>>probably not for the EAP texts referred to in previous e-mails either). I
>>chose therefore to use the term special purpose corpus, "a corpus whose
>>composition is determined by the precise purpose for which it is to be used.
>>While a special purpose corpus may be derived from a general reference
>>corpus or from a monitor corpus it will not constitute a subcorpus in the
>>sense defined by Sinclair because it will not have all of the properties of
>>a larger corpus." I coined this particular term for two reasons, a) because
>>the language of the texts I was working with could be classified as
>>'language for special purposes' or 'LSP', two terms that already existed in
>>applied linguistics to designate, for example, the language of business, the
>>language of medicine, the language of economics, and b) because the term
>>'special purpose corpus' implies that the corpus has been compiled for a
>>particular purpose.
>>Wishing you all a happy new year
>>Jennifer
>>
>>Dr Jennifer Pearson
>>Chief of Translation
>>UNESCO
>>7 Place de Fontenoy
>>75352 Paris 07
>>Tel:. 00 33 1 456 80 780
>>e-mail: j.pearson at unesco.org
>>http://www.unesco.org
>
>
>
>

-- 
Vladimir Rykov, PhD in Comp Linguistics, 
 MOSCOW
http://rykov.narod.ru/
Engl. http://www.blkbox.com/~gigawatt/rykov.html
Tel +7-903-749-19-99