[Corpora-List] Re: ANC, FROWN, Fuzzy Logic

Mon Jul 24 15:42:28 UTC 2006

Daoud Clarke escribió: 

> ...
> I think perhaps what the reference to Greg Chaitin's work was getting at 
> was perhaps related to the following. In practice we are always faced with 
> a finite corpus, whereas the theoretical corpora generated by rules are 
> infinite. We can view our finite corpus as a sample from some hypothetical 
> infinite corpus. The question is, what theory gives us the best estimate 
> of this infinite corpus, given the finite sample? Using our finite corpus 
> we can form theories about the infinite corpus, which may or may not 
> incorporate our linguistic knowledge of the language in question. From an 
> information theoretic perspective, the best theory would be the one that 
> enabled us to express the finite corpus using the least amount of 
> information -- the one that best compressed the information in the corpus. 
> 
> Of course theories become large and unwieldy, so we may prefer the minimum 
> description length principle: the best theory for a sequence of data is 
> the one that minimises the size of the theory plus the size of the data 
> described using the theory.

What? Goodness = L(theory) [pick a metric] + L(data handled)? 

I could go along with the first term, but the second term, for the 'best' 
theory (that handles all the aleph-zero [countably infinite] sentences of 
the language [maybe more (!), depending what we include in the language], 
would be infinite, which seems in some way incommensurate with the results 
for less good theories, which are presumably finite. Another flaw: this 
countably infinite number (sorry for my abbreviated expression, math geeks!) 
is mathematically the same for a theory which handles all and only sentences 
with exactly one embedded sentence in them, ie a limited, partial theory of 
the language, which might be slightly longer than the first theory, due to 
the restrictions we would have to place on the first one to get the second, 
but which of course would have the same 'size' (ie, could be placed into 
1-to-1 correspondence). In any case, though I am unfamiliar with this 
theory, it would seem more useful to place everything on a scale from 0 to 
1, say. Thinking about it, that will bring in other complications, such as 
adding in a higher-order infinity of points. 

> Some of this has been put into practice by Bill Teahan, who applies text 
> compression techniques to NLP applications. It would be extremely 
> interesting however to see whether the use of linguistic theories can help 
> provide better text compression. To my awareness this has not been looked 
> into. 
> 
> Daoud Clarke 
> 
> 

Jim 

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje, ICSyH
Benemérita Universidad Autónoma de Puebla     MÉXICO