[Corpora-List] [Corpora List] Absolute Frequencies

Robert A. Amsler amsler at cs.utexas.edu
Wed Feb 26 16:48:16 UTC 2014


What seems to be the problem is that normalization is dependent upon
assuming the separate sub-corpora are themselves 'normal', that is, that
they don't individually differ from each other or from 'average text' of
the language in significant ways. I.e., so maybe to 'correctly' do
normalization one needs to know how 'normal' each sub-corpus in the
collection is and compensate for their degree of abnormality?

It could be that if one examined some particular set of vocabulary in the
separate corpora, such as the ratio of the frequency of the commonest
function words to overall size of the corpus, or the frequencies of the
commonest content words one could get individualized normalization
factors.

At it's worst, one might have to normalize the frequencies of the words in
each sub-corpus that were 'abnormal'.

For example, imagine a corpus consisting of sub-corpora, each of which is
a collection of encyclopedia articles about the same subject (e.g., all
the encyclopedia articles about "Australia"; all the encyclopedia articles
about "London", etc. as sub-coprora). Each sub-corpus would have its key
concepts that would have significantly higher frequencies for certain
words that were directly tied to the subject of the article. To
'normalize' these texts (i.e., to make them each act like 'average' text),
those content words would have to have their frequencies 'normalized' to
more average frequencies before they were combined together?


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list