Corpora: a program needed - a kinda summary
Sampo Nevalainen
samponev at cc.joensuu.fi
Fri May 31 07:08:53 UTC 2002
Hi,
I was asked for a summary of the responses I got for my request for a
simple program that would calculate the cumulative numbers of types in a
text files. So here it comes (although youll see that I was not given the
gift of summarising things!).
At first, a couple of links kindly supported by Paul Clough:
Dan Melamed has a number of Perl scripts which are very useful for
linguistic tasks:
http://www.cs.nyu.edu/~melamed/software.html.
Another good source of Perl modules is CPAN:
http://www.cpan.org/
And now to the solutions I got. Not surprisingly, all the scripts were
written in Perl, and this summary shows pretty well the abilities of this
language as we proceed from a dozen of lines to a single command line
I
have edited the mails a little, but the scripts, of course, are intact. I
personally do not know Perl very well (I have some programming experience
in Basic, Turbo Pascal and C++), and I have not tested all of the following
scripts, so I WILL NOT be responsible of any nasty things they may do on
your puter... for example, format your hard disk ;-)
Sebastian Hoffmann:
--------------------------------------------------------------------------------------------------
#!usr/bin/perl
$countDifferent=0;
open (IN, "</path/to/file") || die "can't open the file!";
while (<IN>) {
$line= $_;
@words = split(/\s/, $line);
foreach $word (@words) {
if (!$words{$word}) {
$countDifferent++;
$words{$word} = 1;
}
print "$countDifferent\n";
}
}
close (IN);
exit(0);
---------------
The script "assumes that you are interested in orthographic words and that
there is always one whitespace between words". As a response to Sebastian
Hoffmann, Klas Prytz suggests that couldn't it be a good idea to 'chomp'
the lines before splitting them so that not words at the end of lines are
counted as separate words just because they have a end of line character at
the end? Sebastian encounters a couple of other problems with the script:
- It doesn't distinguish between lower and upper case (which could easily
be remedied by adding "$line=lc($line);")
- What happens to punctuation? If you add "$line=~s/[,.;:-!?]//g;" this
would be taken care of - but no difference is being made between sentence
boundaries and abbreviations.
Alexander Clark has another approach to the problem:
--------------------------------------------------------------------------------
The tokenisation is obviously very poor. But if you run a tokenisation tool
to put it in one word per line format, it would work correctly.
----------
#!/usr/bin/perl -w
$numberTypes = 0;
%dict;
#$/ = " ";
while ($line = <>)
{
@words = split(' ',$line);
foreach $word (@words){
if (!exists($dict{$word})){
$dict{$word} = $numberTypes++;
}
print("$numberTypes\n");
}
}
----------
And a pretty similar solution from Kaarel Kaljurand:
--------------------------------------------------------------------------------
this is a perl program, which expects its input from STDIN, and expects
that each token (word) is on a separate line. each type is stored in a hash
(%wordlist) therefore you might run out of memory when the inputfile is
really huge.
--cut--
#!/bin/perl -w
use strict;
my %wordlist = ();
my $i = 0;
while(<>) {
if(!defined($wordlist{$_})) {
$i++;
$wordlist{$_} = 1;
}
print "$i\n";
}
--cut--
Dave Graff also points out the problem of tokenization:
------------------------------------------------------------------------------------------------------------------------
The harder part of the problem is tokenization -- deciding what patterns
constitute actual "types" (excluding all sorts of punctuation, normalizing
case, deciding whether to treat hyphen-connected forms as if they were
"space separated" or "not-space-separated", etc).
Assume you have a suitable tokenizer for your data that simply puts out one
word per line:
tokenize data.file | \
perl -pe 's/(\S+)/if(exists($t{$1})){ $t{$1} } else { $t{$1}=++$tc }/ge'
Or more briefly, again, granting that the data is already tokenized to one
word token per line:
cat token.stream | \
perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'
As for tokenization, a separate perl command line could do that:
cat data.file | \
perl -ne '@t=split /[_\d\W]+/;print join($/,map{lc}@t,"")'
Substitute this bit for the "tokenize data.file" above, and you have your
program -- if this is the correct method of tokenization for your data.
(The output will include some blank lines, which you can ignore.) To handle
a full ISO accented character set in the tokenizer command, change this
"/[_\d\W]/" to this: "/[^a-z\xa1-\xff]/i" for the split pattern.
And finally, Daniel Walker gives another elegant one-line solution for the
problem (I am impressed!):
-----------------------------------------------------------------------------------------------------------
Actually, I believe the numbers are supposed to be incremented when a new
type is encountered and otherwise stay the same: the numbers change less
frequently towards the end of the file, and the last one printed is the
number of different types. So, an even terser one-liner (got to love perl)...
$ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'
Hopefully I did not miss anything. Thank you all again for your response!
sampo
( : ============================ : )
Sampo Nevalainen, FM
suunnittelija
Joensuun yliopisto
Kansainvälisen viestinnän laitos
PL 48
57101 Savonlinna
puh +358-15-511 70 (keskus)
+358-15-511 7704
fax +358-15-515 096
email samponev at cc.joensuu.fi
http://www.joensuu.fi/slnkvl/
More information about the Corpora
mailing list