[Corpora-List] Corpus size & frequency counts

Adam Kilgarriff adam.kilgarriff at itri.brighton.ac.uk
Wed Oct 9 15:02:30 UTC 2002


Brett at staff.sakuragaoka.ac.jp writes:
 > I'm doing a frequency count of Japanese vocabulary in post-war Japanese
 > novels. Is there any rough guide to how many times a given word should
 > appear before you can be reasonably confident of its rank? Or
 > alternatively, at a given frequency, any way to calculate the likely range
 > of ranks?
 >

it's always a good idea to look at distribution as well as
frequency.  Where a word has its frequency spread across a large
number of documents (say, 50 or more) - and the documents cover the
genre you want to talk about (so, eg, they do not all come from the
same author), then you can talk with some confidence about frequency
in the text type.

Where the occurrences mostly come from a small number of documents,
the issue is more complex.  A word like goalkeeper probably feels
pretty common to any English speaker who is interested in soccer, much
less common to anyone who is not interested. Correspondingly, most
novels won't mention goalkeepers but those that do may well mention
them losts of times.  This implies, minimally, frequency should be
seen as having two dimensions, one which is simply the count, the
other whcih is the spread.

Issues include what counts as "the same document" (two articles form
the same magazine?? two chapters from the same book??), and what to do
about specialist subcorpora within the text type you arte interested in
(eg multiple articles from the same journal/by the same author - see
my mailing re words like 'colitis' in the BNC from a few weeks ago).

See also corpora mailing by Ken Church a couple of months back - there
is his paper on "two noriegas".  Another very good paper is by Slava
Katz

Article{katz:96,
  author =       "Slava Katz",
  title =        "Distribution of content words and phrases in text
                  and language modelling",
  journal =      "Natural Language Engineering",
  year =         1996,
  volume =       2,
  number =       1,
  pages =        "15--60"
}

Regards,

	Adam Kilgarriff

--
NEW!! MSc and Short Courses in Lexical Computing and Lexicography
Info at

http://www.itri.brighton.ac.uk/lexicom

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



More information about the Corpora mailing list