Corpora: a program needed - a kinda summary

Fri May 31 07:08:53 UTC 2002

Hi,

I was asked for a summary of the responses I got for my request for a 
simple program that would calculate the cumulative numbers of types in a 
text files. So here it comes (although you’ll see that I was not given the 
gift of summarising things!).

At first, a couple of links kindly supported by Paul Clough:
Dan Melamed has a number of Perl scripts which are very useful for 
linguistic tasks:
http://www.cs.nyu.edu/~melamed/software.html.
Another good source of Perl modules is CPAN:
http://www.cpan.org/

And now to the solutions I got. Not surprisingly, all the scripts were 
written in Perl, and this summary shows pretty well the abilities of this 
language as we proceed from a dozen of lines to a single command line
 I 
have edited the mails a little, but the scripts, of course, are intact. I 
personally do not know Perl very well (I have some programming experience 
in Basic, Turbo Pascal and C++), and I have not tested all of the following 
scripts, so I WILL NOT be responsible of any nasty things they may do on 
your puter... for example, format your hard disk ;-)

Sebastian Hoffmann:
--------------------------------------------------------------------------------------------------
#!usr/bin/perl

$countDifferent=0;
open (IN, "</path/to/file") || die "can't open the file!";
while (<IN>) {
         $line= $_;
         @words = split(/\s/, $line);
         foreach $word (@words) {

                 if (!$words{$word}) {
                         $countDifferent++;
                         $words{$word} = 1;
                 }
         print "$countDifferent\n";
         }
}
close (IN);
exit(0);
---------------
The script "assumes that you are interested in orthographic words and that 
there is always one whitespace between words". As a response to Sebastian 
Hoffmann, Klas Prytz suggests that couldn't it “be a good idea to 'chomp' 
the lines before splitting them so that not words at the end of lines are 
counted as separate words just because they have a end of line character at 
the end?” Sebastian encounters a couple of other problems with the script:
- It doesn't distinguish between lower and upper case (which could easily 
be remedied by adding "$line=lc($line);")
- What happens to punctuation? If you add "$line=~s/[,.;:-!?]//g;" this 
would be taken care of - but no difference is being made between sentence 
boundaries and abbreviations.

Alexander Clark has another approach to the problem:
--------------------------------------------------------------------------------
The tokenisation is obviously very poor. But if you run a tokenisation tool 
to put it in one word per line format, it would work correctly.
----------
#!/usr/bin/perl -w

$numberTypes = 0;
%dict;
#$/ = " ";
while ($line = <>)
{
          @words = split(' ',$line);
          foreach $word (@words){
                 if (!exists($dict{$word})){
                         $dict{$word} = $numberTypes++;
                 }
                 print("$numberTypes\n");
         }
}
----------

And a pretty similar solution from Kaarel Kaljurand:
--------------------------------------------------------------------------------
this is a perl program, which expects its input from STDIN, and expects 
that each token (word) is on a separate line. each type is stored in a hash 
(%wordlist) therefore you might run out of memory when the inputfile is 
really huge.
--cut--
#!/bin/perl -w

use strict;
my %wordlist = ();
my $i = 0;
while(<>) {
          if(!defined($wordlist{$_})) {
                  $i++;
                  $wordlist{$_} = 1;
         }
         print "$i\n";
}
--cut--

Dave Graff also points out the problem of tokenization:
------------------------------------------------------------------------------------------------------------------------
The harder part of the problem is tokenization -- deciding what patterns 
constitute actual "types" (excluding all sorts of punctuation, normalizing 
case, deciding whether to treat hyphen-connected forms as if they were 
"space separated" or "not-space-separated", etc).

Assume you have a suitable tokenizer for your data that simply puts out one 
word per line:
tokenize data.file | \
perl -pe 's/(\S+)/if(exists($t{$1})){ $t{$1} } else { $t{$1}=++$tc }/ge'

Or more briefly, again, granting that the data is already tokenized to one 
word token per line:
cat token.stream | \
perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'

As for tokenization, a separate perl command line could do that:
cat data.file | \
perl -ne '@t=split /[_\d\W]+/;print join($/,map{lc}@t,"")'

Substitute this bit for the "tokenize data.file" above, and you have your 
program -- if this is the correct method of tokenization for your data. 
(The output will include some blank lines, which you can ignore.) To handle 
a full ISO accented character set in the tokenizer command, change this 
"/[_\d\W]/" to this: "/[^a-z\xa1-\xff]/i" for the split pattern.

And finally, Daniel Walker gives another elegant one-line solution for the 
problem (I am impressed!):
-----------------------------------------------------------------------------------------------------------
Actually, I believe the numbers are supposed to be incremented when a new 
type is encountered and otherwise stay the same: the numbers change less 
frequently towards the end of the file, and the last one printed is the 
number of different types. So, an even terser one-liner (got to love perl)...

$ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'

Hopefully I did not miss anything. Thank you all again for your response!

sampo

( : ============================ : )

Sampo Nevalainen, FM
suunnittelija
Joensuun yliopisto
Kansainvälisen viestinnän laitos
PL 48
57101 Savonlinna

puh     +358-15-511 70      (keskus)
         +358-15-511 7704
fax     +358-15-515 096
email   samponev at cc.joensuu.fi
http://www.joensuu.fi/slnkvl/