Corpora: a program needed
Alexander Clark
asc at aclark.demon.co.uk
Thu May 30 05:27:48 UTC 2002
Something like this?
The tokenisation is obviously very poor. But if you run a tokenisation
tool to put it in one word per line format, it would work correctly.
#!/usr/bin/perl -w
$numberTypes = 0;
%dict;
#$/ = " ";
while ($line = <>)
{
@words = split(' ',$line);
foreach $word (@words){
if (!exists($dict{$word})){
$dict{$word} = $numberTypes++;
}
print("$numberTypes\n");
}
}
Sampo Nevalainen wrote:
> Dear corporal mates,
>
> I am in an acute need for a simple program (dos, Windows, Unix) that
> would provide me with cumulative numbers of different words (types) as
> it skims through a text word by word. In other words, the program should
> print out a number for each word but increase the number only when a new
> type is encountered. The output would be something like that:
> 1
> 2
> 3
> 4
> 4
> 5
> 6
> 6
> 6
> ...
> Probably I could write this kind of program myself, but I do not have
> time or ardour to reinvent the wheel. Maybe a simple Perl script would
> do the trick? Thank you in advance for your support.
>
> yours,
> sampo
>
>
> ( : ============================================= : )
>
> Sampo Nevalainen, M.A.
> Researcher
> University of Joensuu
> Savonlinna School of Translation Studies
> P.O.Box 48
> FIN-57101 Savonlinna
> FINLAND
>
> tel +358-15-511 70 (operator)
> +358-15-511 7704
> fax +358-15-515 096
> email samponev at cc.joensuu.fi
> http://www.joensuu.fi/slnkvl/
>
>
>
--
Alexander Clark
asc at aclark.demon.co.uk
http://www.issco.unige.ch/staff/clark/index.html
ISSCO/ETI, University of Geneva
More information about the Corpora
mailing list