[Corpora-List] Keywords Generator

Ben Allison ben at dcs.shef.ac.uk
Mon Feb 18 11:57:29 UTC 2008


Hi,

If you're happy with *nix, this can be done in a couple of lines -- 
first produce a list of frequency counted words for each of the two 
corpora, then grep for the word you're interested in in both lists. 
Depending on how reliable you want word detection to be, the first part 
might be a little more complicated (especially if you have odd 
formatting/encoding issues), but the second is just a singe line.

A very simple script for determining the count of a word SOME_WORD in 
the corpus myfile.txt (assuming the script words.pl I include below) 
would be:

./words.pl < myfile.txt | sort | uniq -c | grep 'SOME_WORD'

(replace myfile.txt and SOME_WORD with appropriate strings) Although 
there are no doubt better ways... Also, you may wish to consider 
normalised frequency, since raw counts are not going to be great for 
comparison if the corpora are of different lengths.

Ben

---------------------------------

#!/usr/bin/perl

while(<>){
  tr/A-Z/a-z/;
  while (/\b([a-z']+)\b/g){
    print "$1\n";
  }
}

True Friend wrote:
> Hi Folks
> I need a a programm/script (even of *nix) that can provide frequency 
> of a wordlist from two corpora. Actually I have made this list by 
> comparing two word lists one from general english (specifically from 
> Pakistani Origin) and law english (also of Pakistani origin). I know 
> want to present these keywords with their frequencies in both corpora 
> as a proof that these words are more frequent in law. Keywords are 
> generated by Antconc.
> Is there any script/tool that can generate a parallel list of 
> frequencies of each word in both corpora?
> Regards
> M Shakir Aziz
> A Corpus Linguistics Student
> Pakistan
>
> -- 
> محمد شاکر عزیز
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list