Corpora: Relative text length ...

Wed May 1 16:14:36 UTC 2002

The original article is at

http://ojps.aip.org/journal_cgi/dbt?KEY=PRLTAO&Volume=88&Issue=4&jsessionid=1026771020268694431

It has been known for quite a long time that language is all but random.
Take any two texts or corpora and you will find huge deviations in
frequencies from what would be expected if words (or letters or any unit)
were drawn at random.

There is therefore no surprise in the "discovery" that zippers which encode
more frequent sequences with few bytes and spend more bytes only for rare
sequences will have different compression rates on different texts, and
that this fact could be used as a (rough) measure of distance among texts.

What is really surprising, actually, is not so much that some scientists
reinvent (badly) the wheel, but that so much publicity is given to these
rediscoveies (I have seen the information on this discovery on several
lists, letters, web sites, etc.) and that such prestigious journals
(Physical Review Letters) could publish them.

And why in a Physics journal above all? Will the next issue of
Computational Linguistics include our last papers on Positron Annihilation
in Molecules or Magnetic-Field Generation in Plasmas ? I suppose that we
would say stupid things.

--jv