Corpora: Reference

Melamed, Dan Dan.Melamed at westgroup.com
Mon Feb 12 17:28:01 UTC 2001


I don't know of any rigorous study on this topic, but the claim would follow
from two observations:

1. Any text corpus is but a sample of some (sub)language.  As the sample
grows, it comes closer and closer to representing the whole population.  The
WSJ has been around for quite a while, so it's likely to have used all of
the words in its (sub)language by now.

2. New words keep entering the (sub)language.  20 new words per month would
not be surprising, even if you exclude proper nouns and technospeak.

IDM

> -----Original Message-----
> From: Mari Olsen [mailto:molsen at microsoft.com]
> Sent: Monday, February 12, 2001 10:51 AM
> To: corpora at hd.uib.no
> Cc: John Nave
> Subject: Corpora: Reference
>
>
> Can anyone provide a reference for a purported study, in which someone
> analyzed the Wall Street Journal for new words, the number of
> which tailed
> off to 20 words per (month? week?) after a certain point? Or
> is this an NLP
> urban legend? A colleague recalls Mitch Marcus pointing out
> that the rate of
> new word occurrences does not asymptote but rather continues
> at some small
> but non-trivial rate, but not whether this is Marcus' own study, an
> observation, or a reference to another work.
>
> Thanks,
>
> Mari Olsen
> Microsoft-Natural Language Group
>



More information about the Corpora mailing list