Corpora: Reference

Tue Feb 13 08:19:20 UTC 2001

"Melamed, Dan" wrote:
>
> I don't know of any rigorous study on this topic, but the claim would follow
> from two observations:
>
> 1. Any text corpus is but a sample of some (sub)language.  As the sample
> grows, it comes closer and closer to representing the whole population.  The
> WSJ has been around for quite a while, so it's likely to have used all of
> the words in its (sub)language by now.
>
> 2. New words keep entering the (sub)language.  20 new words per month would
> not be surprising, even if you exclude proper nouns and technospeak.
>
> IDM
>

I think these observations presuppose that at any given moment a
language or sub-language
has a well-defined finite set of words in it.  I am not sure I would
agree with this, even if you consider
an individual idiolect, given the productivity of certain morphological
rules (eg writing re-writing re-rewriting ... and so on), and other word
formation processes.

More generally this relates to the various observations about Zipfian
distributions in the lexicon made by e.g. Baayen, Gazdar and so on.

--
Alexander Clark  asc at aclark.demon.co.uk
Alex.Clark at issco.unige.ch ISSCO / TIM, Ecole de Traduction et
d'Interprétation,
University of Geneva, Boulevard du Pont-d'Arve, CH-1211 Genève 4
Tel: (+41) 022 7058682 Fax: 7058689