Corpora: Relative text length ...

Paul Clough p.clough at dcs.shef.ac.uk
Wed May 1 11:05:10 UTC 2002


Dear all,

I was interested in the discussion regarding relative text length
and wondered whether this article about text compression
was related in any way:

http://www.wired.com/news/technology/0,1282,50192,00.html

"In the Jan. 28 issue of the journal Physical Review Letters, three Italian
scientists used the Unix compression program gzip on text files to address
such pattern-matching issues as language of composition and authorship."

"Since data compression entails recognizing and tagging repeated strings,
the more repeated internal patterns that a file or collection of files has,
the
more it can be compressed. Thus, if one wants to know the language in which
file X was written, just compress it with files whose language is known and
then
compare how efficiently each operation is carried out."

"If, by comparing raw and compressed file sizes, one finds that X plus an
Italian
text file zips tighter than X plus a French text or X plus an English text
or X plus
one's other linguistic reference texts, then congratulazioni! You've likely
just found
the language of X without even opening it."

"The scientists -- Dario Benedetto, Emanuele Caglioti and Vittorio Loreto of
Rome's
La Sapienza University -- used this technique to discern the language of
mystery texts
as small as 20 characters. Furthermore, using a database of 90 texts from 11
different
authors, they found their method could even pick out individual authors with
a success
rate of 93 percent."

It might be worth trying whether a simple technique like this could work
(compression
at byte-level)???

Paul.

----------------------------------------------------------------------------
---------------------
Paul Clough

Natural Language Processing Group,
Department of Computer Science,
University of Sheffield,
G35 Regent Court,
211 Portobello Street,
SHEFFIELD,
S1 4DP.

http://www.dcs.shef.ac.uk/~cloughie/index.html
----------------------------------------------------------------------------
----------------------



More information about the Corpora mailing list