[Corpora-List] Is a complete grammar possible (beyond thecorpus itself)?

Rob Freeman lists at chaoticlanguage.com
Tue Sep 11 11:11:27 UTC 2007


John,

On 9/11/07, John F. Sowa <sowa at bestweb.net> wrote:
>
> Rob,
>
> I'll admit that something along those lines could be done:
>
> > When I say "generalize grammar ad-hoc from examples as you go"
> > I don't mean "as you develop your grammar". I mean "from
> > sentence-to-sentence."
>
> But I don't believe that every sentence has a unique pattern.


Thank you. This is a relevant objection I can argue against.

It is indeed hard to imagine you need to be able to make grammatical
generalizations at the level of each use of each word. Nevertheless, I think
it is so.

There is nothing like an example.

Look at these "errors" made by ESL students (Peter Howarth, Phraseology and
Second Language Acquisition, 1998):

"*Those learners usually _pay_ more _efforts_ in adopting a new language..."


"*_attempts_ and researches have been _done_ by psychologist..."

'*appropriate _policy_ to be _taken_ with regard to inspections'

What is wrong with these if not generalization inappropriate to the context?

Note, we are used to this level of selection for lexicon. But here we see it
in syntax, new combinations. We can see how the language gradually
generalizes, but in a context specific way. In another context "done" and
"made" would have been in the same class, but not in the context of
"attempt" (i.e. "done" and "made" are in the same class in the context of
"study": "do/make a study", but they are not the same class in the context
of "attempt".)

I used the term "nonce grammar" for a pattern that is unique
> to a particular document (a much smaller corpus than a genre).
> Following is an example:
>
>      For this process the following transaction codes are used:
>      32 — loss on unbilled, 72 — gain on uncollected, and
>      85 — loss on uncollected.  Any of these records that are
>      actually taxes are bypassed.  Only client types 01 — Mar,
>      05 — Internal Non/Billable, 06 — Internal Billable, and
>      08 — BAS are selected. This is determined by a GETBDATA
>      call to the client file. The unit that the gain or loss
>      is assigned to is supplied at the time of its creation in EBT.
>
> This text came from a description of the data formats that were
> used by a certain program.  It's unlikely that any "broad coverage"
> parser ever had a grammar rule of the following kind:
>
>     S -> Integer "—" Phrase
>
> But this short paragraph had seven occurrences of that peculiar
> sentence type.  A single occurrence of that pattern is extremely
> unlikely, but if that pattern occurs once, then it is highly
> likely that it will occur again.  The same principle applies
> to a large number occurrences of what I call "nonce grammar":
>
>     A syntactic pattern that is highly unlikely, but if it occurs
>     once in any document, it is likely to occur many times in that
>     document.
>
> This is an example of the kind of short-term extension to a
> natural language that people invent many, many times.  And
> this kind of rare, but repeated pattern is something for
> which suitable machine-learning programs could be written.


If a usage repeats you don't want to generalize grammar for it each time, I
agree. I think in this case it does become fixed in the grammar. I think
that is what we call lexicon.

-Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070911/cf579d8f/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list