[Corpora-List] Re: ANC, FROWN, Fuzzy Logic
FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE
jfidel at siu.buap.mx
Mon Jul 24 15:42:28 UTC 2006
Daoud Clarke escribió:
> ...
> I think perhaps what the reference to Greg Chaitin's work was getting at
> was perhaps related to the following. In practice we are always faced with
> a finite corpus, whereas the theoretical corpora generated by rules are
> infinite. We can view our finite corpus as a sample from some hypothetical
> infinite corpus. The question is, what theory gives us the best estimate
> of this infinite corpus, given the finite sample? Using our finite corpus
> we can form theories about the infinite corpus, which may or may not
> incorporate our linguistic knowledge of the language in question. From an
> information theoretic perspective, the best theory would be the one that
> enabled us to express the finite corpus using the least amount of
> information -- the one that best compressed the information in the corpus.
>
> Of course theories become large and unwieldy, so we may prefer the minimum
> description length principle: the best theory for a sequence of data is
> the one that minimises the size of the theory plus the size of the data
> described using the theory.
What? Goodness = L(theory) [pick a metric] + L(data handled)?
I could go along with the first term, but the second term, for the 'best'
theory (that handles all the aleph-zero [countably infinite] sentences of
the language [maybe more (!), depending what we include in the language],
would be infinite, which seems in some way incommensurate with the results
for less good theories, which are presumably finite. Another flaw: this
countably infinite number (sorry for my abbreviated expression, math geeks!)
is mathematically the same for a theory which handles all and only sentences
with exactly one embedded sentence in them, ie a limited, partial theory of
the language, which might be slightly longer than the first theory, due to
the restrictions we would have to place on the first one to get the second,
but which of course would have the same 'size' (ie, could be placed into
1-to-1 correspondence). In any case, though I am unfamiliar with this
theory, it would seem more useful to place everything on a scale from 0 to
1, say. Thinking about it, that will bring in other complications, such as
adding in a higher-order infinity of points.
> Some of this has been put into practice by Bill Teahan, who applies text
> compression techniques to NLP applications. It would be extremely
> interesting however to see whether the use of linguistic theories can help
> provide better text compression. To my awareness this has not been looked
> into.
>
> Daoud Clarke
>
>
Jim
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje, ICSyH
Benemérita Universidad Autónoma de Puebla MÉXICO
More information about the Corpora
mailing list