[Corpora-List] Is a complete grammar possible (beyond thecorpus itself)?

John F. Sowa sowa at bestweb.net
Tue Sep 11 06:48:05 UTC 2007


Rob,

I'll admit that something along those lines could be done:

 > When I say "generalize grammar ad-hoc from examples as you go"
 > I don't mean "as you develop your grammar". I mean "from
 > sentence-to-sentence."

But I don't believe that every sentence has a unique pattern.

I used the term "nonce grammar" for a pattern that is unique
to a particular document (a much smaller corpus than a genre).
Following is an example:

     For this process the following transaction codes are used:
     32 — loss on unbilled, 72 — gain on uncollected, and
     85 — loss on uncollected.  Any of these records that are
     actually taxes are bypassed.  Only client types 01 — Mar,
     05 — Internal Non/Billable, 06 — Internal Billable, and
     08 — BAS are selected. This is determined by a GETBDATA
     call to the client file. The unit that the gain or loss
     is assigned to is supplied at the time of its creation in EBT.

This text came from a description of the data formats that were
used by a certain program.  It's unlikely that any "broad coverage"
parser ever had a grammar rule of the following kind:

    S -> Integer "—" Phrase

But this short paragraph had seven occurrences of that peculiar
sentence type.  A single occurrence of that pattern is extremely
unlikely, but if that pattern occurs once, then it is highly
likely that it will occur again.  The same principle applies
to a large number occurrences of what I call "nonce grammar":

    A syntactic pattern that is highly unlikely, but if it occurs
    once in any document, it is likely to occur many times in that
    document.

This is an example of the kind of short-term extension to a
natural language that people invent many, many times.  And
this kind of rare, but repeated pattern is something for
which suitable machine-learning programs could be written.

John


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list