Corpora: a program needed

Walker, Daniel Daniel.Walker at bowneglobal.com
Thu May 30 22:05:08 UTC 2002


Actually, I believe the numbers are supposed to be incremented when a new
type is encountered and otherwise stay the same: the numbers change less
frequently towards the end of the file, and the last one printed is the
number of different types. So, an even terser one-liner (got to love perl)
...

$ cat file
this
is
a
test
this
really
is
a
test

$ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'
1
2
3
4
4
5
5
5
5

Cordially,
Daniel Walker

-----Original Message-----
From: David Graff [mailto:graff at unagi.cis.upenn.edu]
Sent: Thursday, May 30, 2002 7:35 AM
To: Sampo Nevalainen
Cc: corpora at hd.uib.no
Subject: Re: Corpora: a program needed



Sampo,

The command line perl script I sent you earlier (which I failed to copy
to the list), could actually be expressed more briefly.  Again, granting
that the data is already tokenized to one word token per line:

cat token.stream | \
 perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'


    Best regards,

	Dave Graff



More information about the Corpora mailing list