Corpora: a program needed
Walker, Daniel
Daniel.Walker at bowneglobal.com
Thu May 30 22:05:08 UTC 2002
Actually, I believe the numbers are supposed to be incremented when a new
type is encountered and otherwise stay the same: the numbers change less
frequently towards the end of the file, and the last one printed is the
number of different types. So, an even terser one-liner (got to love perl)
...
$ cat file
this
is
a
test
this
really
is
a
test
$ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'
1
2
3
4
4
5
5
5
5
Cordially,
Daniel Walker
-----Original Message-----
From: David Graff [mailto:graff at unagi.cis.upenn.edu]
Sent: Thursday, May 30, 2002 7:35 AM
To: Sampo Nevalainen
Cc: corpora at hd.uib.no
Subject: Re: Corpora: a program needed
Sampo,
The command line perl script I sent you earlier (which I failed to copy
to the list), could actually be expressed more briefly. Again, granting
that the data is already tokenized to one word token per line:
cat token.stream | \
perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'
Best regards,
Dave Graff
More information about the Corpora
mailing list