Corpora: corpus of IT

Kim Tan kimmy1003 at hotmail.com
Sat Dec 23 04:00:41 UTC 2000


Hi all,

I know everyone's in the festive mood but I just had to thank the foll.
people for responding personally to my query before 2001...

Adam Kilgarriff
Paul Rayson
Michael Oakes
John Sinclair
Jilani Warsi
Khairul

Among the possible ways of identifying words that are characteristic of a
text ( for 2 corpora, one > specialized and the other > general i.e. )are
using the non-parametric Mann-Whitney test to find words with the most
consistently different frequencies and  the log-likelihood or G-square. Adam
suggested chopping both corpora into same-size chunks, producing a word freq
list for each chunk, and then using the Mann-Whitney test to find words with
the most consistently different frequencies. The Log-likelihood or G-square
can be performed automatically using Mike Scott's WordSmith package ( I use
Excel with the formulas provided in the article "Comparing Corpora using
Frequency Profiling" by Rayson and Garside. They suggested producing a freq.
list for both corpora and for each word in the 2 freq. lists , the
loglikelihood statistic is calculated. The largest LL representing the word
which has the most significant relative freq. difference is the most
indicative ( or characteristic ) of one corpus as compared to the other
corpus. I'm still surveying the statistical methods and one of the main
problems I encountered was matching the words in the two corpora (the IT
specific corpus and the general ME corpus)before I actually apply the
statistical measures ( so far I've tried Loglikelihood) . I did the matching
  manually , there must be an automatic way of going about it. Someone
suggested using Dbase 3...

Can I also draw your attention to the work by Yang Hui-Zhong whose article
was published in the Journal of Literary and Linguistic Computing (I'm still
trying to locate the article myself).I was told that Yang compared the
frequency of words across a range of texts and established 2 measures, the
"peak ratio" and the " range ratio". A word with a high PR in certain texts
and a low RR is almost certain to be a technical term. A high RR and low PR
indicates a word of general utility etc. it's worth looking into... I also
find Adam's good downloadable technical report "Comparing corpora" Report
ITRI-96-08 most useful where he touches on a survey of statistical
approaches...

Happy Christmas, New Year

KIM
Nat. Univ. of Malaysia


_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.



More information about the Corpora mailing list