[Corpora-List] Which test is suitable
Christer Johansson
christer.johansson at uib.no
Mon Jul 18 12:55:24 UTC 2011
Before we get carried away, let's just do the exercise of calculating chisq.test and cramer's phi for effect size.
I will use R for the analysis. The package for the effect size (cramer.r) was written by Gard Jenset.
The date provided by Krishnamurthy (from google) gives a matrix (reservation for typos):
> x <- matrix(c(223000000,1290000000,738000,10800000), ncol = 2, dimnames = list(c("ceiling", "piece"), c("correct", "reversed")))
> x
correct reversed
ceiling 2.23e+08 738000
piece 1.29e+09 10800000
% analyze this:
> chisq.test(x)
Pearson's Chi-squared test with Yates' continuity correction
data: x
X-squared = 636454.7, df = 1, p-value < 2.2e-16
% i.e. Highly significant (because we have an incredible number of observations)
% Is it worth bothering about?
> source(file.choose()) % this is where to load "cramer.r"
[1] Usage: cv.test(matrix)
> cv.test(x)
- - - - - - - - - - - - - - - - - - - - -
Effect size for contingency tables
Data: x
Phi: 0.020432
T-shirt effect size: tiny
- - - - - - - - - - - - - - - - - - - - -
Warning: effect sizes are only guidelines!
% The effect size is tiny so it is not really worth bothering about. Significance comes from having a lot of examples, and deviance from a random distribution is more or less expected for language data.
When it comes to the lower number of errors in COCA and BYU-BNC corpora: This is likely explained by the fact that the data is published text which have gone through editing and spell-checking. However, these kind of errors occur in real checked text, and probably more the more "immediate" or "unchecked" the text is; so we as computational linguists will have to deal with it in some way (luckily for us the example words were in the non-word category, which is fairly easy to detect; misspelling "piece" as "peas" or "peace" will be harder).
The correct test to use seem to me to be effect size tests rather than significance tests.
Interesting hypotheses: 1) if the spelling errors are lexically driven or not. I.e. is it the variation as such (e.g. "ie"/"ei") , or does it depend on the word (its frequency?, its "meaning"? its POS?, its lexical status (loan word?, etc)).
The differences between peice and cieling seem to depend on lexical frequency, the more used a word is, surely the more variants we will find. Frequency smoothing will help some.
just my two bits,
\Christer
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list