Corpora: a program needed

Thu May 30 05:27:48 UTC 2002

Something like this?

The tokenisation is obviously very poor. But if you run a tokenisation
tool to put it in one word per line format, it would work correctly.

#!/usr/bin/perl -w

$numberTypes = 0;
%dict;
#$/ = " ";
while ($line = <>)
{
     @words = split(' ',$line);
     foreach $word (@words){
	if (!exists($dict{$word})){
	    $dict{$word} = $numberTypes++;
	}
	print("$numberTypes\n");
     }
}

Sampo Nevalainen wrote:

> Dear corporal mates,
>
> I am in an acute need for a simple program (dos, Windows, Unix) that
> would provide me with cumulative numbers of different words (types) as
> it skims through a text word by word. In other words, the program should
> print out a number for each word but increase the number only when a new
> type is encountered. The output would be something like that:
> 1
> 2
> 3
> 4
> 4
> 5
> 6
> 6
> 6
> ...
> Probably I could write this kind of program myself, but I do not have
> time or ardour to reinvent the wheel. Maybe a simple Perl script would
> do the trick? Thank you in advance for your support.
>
> yours,
> sampo
>
>
> ( : ============================================= : )
>
> Sampo Nevalainen, M.A.
> Researcher
> University of Joensuu
> Savonlinna School of Translation Studies
> P.O.Box 48
> FIN-57101 Savonlinna
> FINLAND
>
> tel     +358-15-511 70      (operator)
>         +358-15-511 7704
> fax     +358-15-515 096
> email   samponev at cc.joensuu.fi
> http://www.joensuu.fi/slnkvl/
>
>
>

--
Alexander Clark
asc at aclark.demon.co.uk
http://www.issco.unige.ch/staff/clark/index.html
ISSCO/ETI, University of Geneva