[Corpora-List] Is a complete grammar possible (beyond thecorpus itself)?
John F. Sowa
sowa at bestweb.net
Tue Sep 11 06:48:05 UTC 2007
Rob,
I'll admit that something along those lines could be done:
> When I say "generalize grammar ad-hoc from examples as you go"
> I don't mean "as you develop your grammar". I mean "from
> sentence-to-sentence."
But I don't believe that every sentence has a unique pattern.
I used the term "nonce grammar" for a pattern that is unique
to a particular document (a much smaller corpus than a genre).
Following is an example:
For this process the following transaction codes are used:
32 — loss on unbilled, 72 — gain on uncollected, and
85 — loss on uncollected. Any of these records that are
actually taxes are bypassed. Only client types 01 — Mar,
05 — Internal Non/Billable, 06 — Internal Billable, and
08 — BAS are selected. This is determined by a GETBDATA
call to the client file. The unit that the gain or loss
is assigned to is supplied at the time of its creation in EBT.
This text came from a description of the data formats that were
used by a certain program. It's unlikely that any "broad coverage"
parser ever had a grammar rule of the following kind:
S -> Integer "—" Phrase
But this short paragraph had seven occurrences of that peculiar
sentence type. A single occurrence of that pattern is extremely
unlikely, but if that pattern occurs once, then it is highly
likely that it will occur again. The same principle applies
to a large number occurrences of what I call "nonce grammar":
A syntactic pattern that is highly unlikely, but if it occurs
once in any document, it is likely to occur many times in that
document.
This is an example of the kind of short-term extension to a
natural language that people invent many, many times. And
this kind of rare, but repeated pattern is something for
which suitable machine-learning programs could be written.
John
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list