Corpora: corpora variety/summary

Bill Fisher william.fisher at
Fri Sep 1 13:03:23 UTC 2000

Vladimir -

   This is not exactly to the point of your query,
but as a part of a long-standing effort here at NIST to
understand the factors that affect computer speech recognition
accuracy, I've done some preliminary work in calculating
what I call the <ital> diversity </ital> of a test-set
corpus, which is how varied the corpus is when seen
thru the eyes of an ngram language model of the type
almost universally used in speech recognition.  It's supposed
to be like test-set perplexity, except you don't use an external
language model.  I repeat a sort of jack-knifing experiment
a number of times, averaging the perplexity result: randomly
choose x% of the utterances and build a language model from
them, then compute the test-set perplexity of the other (1-x)%
of them.  Ceteris paribus, the test-set corpus with lower
diversity should be easier to recognize.  If you or anyone
else knows of a publication by someone already doing this,
I'd appreciate being told about it.

 - Bill F.

More information about the Corpora mailing list