[Corpora-List] Keywords Generator

Ben Allison ben at dcs.shef.ac.uk
Mon Feb 18 18:39:14 UTC 2008


I'd have to second this -- the unix pipe is a little inefficient because 
sort is run on a list of tokens rather than types. However, some sort of 
simple hash-backed frequency counting will work on corpora with 
*billions* of words of text, so that is unlikely to be the problem.

I have included a script below which will do what you want from any 
number of files, and will not run out of memory even on corpora 
containing orders of magnitude more words than you are using. If this 
produces zeroes, either the words are not there (use 'grep' to check) or 
else you may be having encoding issues (Ubuntu is natively UTF8, but 
your files may not be, and if you don't tell it otherwise Perl will try 
to guess encodings and may get them wrong).

Paste the contents of this mail, starting from the "#!/usr/bin/perl" 
line, into a text file and save it somewhere as, say, "word_finder.pl". 
Open a command prompt in that directory, ensure the script is executable 
("chmod 755 word_finder.pl"), and then run:

./word_finder.pl WORD file1.txt file2.txt ... filen.txt

obviously, replace WORD and filex.txt with appropriate names. I know 
this is Perl again, but if you're using Ubuntu it's included with the 
distribution...

Ben

-------------------------------

#!/usr/bin/perl

foreach $i (1..$#ARGV){
  open(IN,$ARGV[$i]);
  while(<IN>){
    tr/A-Z/a-z/;
    while (/\b([a-z']+)\b/g){
      $hash{$1}[$i]++;
    }
  }
  close IN;
}
foreach $i (1..$#ARGV){
  print "Frequency of $ARGV[0] in Corpus $i: ";
  defined($hash{$ARGV[0]}[$i]) ? print "$hash{$ARGV[0]}[$i]\n" : print 
"0\n";
}



Trevor Jenkins wrote:
> On Mon, 18 Feb 2008, True Friend <true.friend2004 at gmail.com> wrote:
>
>   
>> Trevor Jenkins: Sorry I forgot to mention the size it was in words,
>> 1.9million words. I also thought that large amount of data is the
>> reason.
>>     
>
> Oh okay. So roughly around 8Mb to 12Mb based on an average (English) word
> length of say 6 characters. I ran my pipe of filters across the Jane
> Austen texts including the juvenalia (which came to about 11Mb); no
> problem at all other than that all the words were stuffed into one result
> file. On a MacBook Pro with Intel Dual Core processor it took a matter of
> seconds to create the (2.5Mb) result file.
>
> Personally I don't consider 1.9million words to be large. I once had a
> junior programmer who managed to stuff an 8Mb sentence into one record.
>
> Regards, Trevor
>
> <>< Re: deemed!
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>   


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list