[Corpora-List] Which Statistical Test is Suitable

Benjamin Allison ballison at staffmail.ed.ac.uk
Thu Jul 14 11:16:08 UTC 2011


I think the problem you're looking at is far more complex than you're  
going to be able to uncover by looking at simple unigram frequencies,  
if I understand what you're trying to demonstrate properly.

I think what you said is, there are cases where some rule applies  
which should dictate which letter is used (alif or hay, let's say). In  
those cases, you would expect that the frequency of the wrong letter  
to be no more than is predicted by a model of spelling error. What you  
in fact see is that it is far higher than this, suggesting people  
confuse the two letters even where there is a rule.

Using standard statistical terminology, you might say:

H0 (null hypothesis) - In cases where the rule requires using  
character c_1, the occurrence of the erroneous character c_2 is no  
more than chance spelling error
H1 (alternate hypothesis) - In cases where the rule requires using  
c_1, both c_1 and c_2 are equally probable

Depending on which school of statistics you subscribe to, you might  
evaluate the probability of your observations under H0 alone, or both  
H0 and H1 and compare. Note that in either case, you need a decent  
model of spelling error (a simple one would be that each character has  
some small probability epsilon of being rendered incorrectly, and if  
it is then the choice of which letter it is replaced with is  
uniform---this is almost certainly too simple to give you anything  
useful, but it has the virtue of being easy to work with so you might  
try it as a first step!). If you want to evaluate the probability of  
your observations under H1 then you'd also need to consider whether  
you want to assume the two characters to be equiprobable or not.

In any case, I'm with Chris - there's no test in a book I can think of  
that will answer the question you want to ask. Note that all the above  
is for a single confusion in a single context---you'd need to think  
about how to combine all these predictions too!

B

Quoting True Friend <true.friend2004 at gmail.com> on Thu, 14 Jul 2011  
12:36:58 +0500:

> Dear Corpora Members
> Thanks for your responses. I am actually having a research on spelling
> alternation of ? alif  and ? hay (two Urdu letters). There has been a long
> debate among scholars that which word should be written with which letter.
> For example the word Ghonsa (English: Punch) can be written as ??????
> (ending at alif) or as ?????? (ending at hay) with no change in meaning. In
> most cases the frequencies are clearly different. There is a clear choice
> for Alif or Hay variant, but in some cases the frequencies correlate very
> closely. I've selected the words which have very close frequencies in each
> variant (with no change in meaning of the word of course), now I wanted to
> summarize the group bahaviour by applying correlation formula etc. An
> example of such variant spellings is as follows:
>  Alif Variant Freq Hay Variant Freq.  ???? 587 ???? 508  ??? 97 ??? 116
> ?????? 586 ?????? 725
> As you can see the frequencies are closely related, my aim was to summarize
> the group behaviour. The point here is to show the general public's usage,
> that despite of rules available, people are confused in spelling of these
> words.
> Hopefully this would elaborate why I asked.
> --
> *Muhammad Shakir Aziz* *???? ???? ????*
> *Masters in Applied Linguistics
> Translator, Course Developer, Linguist for Urdu, Punjabi and English*
> Urdu:- http://awaz-e-dost.blogspot.com/
> English:- http://linguisticslearner.blogspot.com/
> Facebook:- http://www.facebook.com/truefriend2004
> Skype:- true_friend2004
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list