[Corpora-List] Is a complete grammar possible (beyond thecorpus itself)?
chris brew
cbrew at acm.org
Mon Sep 10 14:00:58 UTC 2007
> What does it mean when we label a tree-bank, or tag a corpus? What theory
> is behind the idea of "parts-of-speech"?
>
It depends. For example, Hockenmaier and Steedman's CCGbank comes with a
long technical report explaining which
aspects of the dataset are motivated by theory, and which aspects are due to
the fact that this dataset is derived from
the original Penn Treebank. This in turn comes with a long technical report
explaining the basis for the decisions made. Some of these
are motivated by theory, others by practicalities. The decisions made are
bound to affect both the scientific future of the data
and its usefulness for engineering. Sampson's treebanks are based on a
different set of choices, but are again documented in
detail and argued for.
Hockenmaier and Steedman say that
"The point of this exercise was to deliver more accurate wide-coverage
parsers capable of building interpretable structure (which standard
context-free Treebank grammars do not in general do), which could then be
used in applications such as question answering or summarization"
This is an engineering goal (building interpretable structure) with
scientific implications (potentially showing the CCG has a level
of rigour and flexibility that allows effective parsing).
This whole enterprise (not just Hockenmaier and Steedman, but ling-banking
in general) strikes me as exactly "doing syntax", with rigour, on corpora.
What has changed is that we have stopped doing syntax. Sure, we've gained a
> lot of insight about the importance of lexicon and phraseology. That is not
> to be sniffed at. But when we try to do syntax what comes out is still
> mostly generativism, without the rigour.
>
There may be a disconnect between the live issues in current formal syntax
research and the concerns that are foregrounded in recent ACL papers, and
there may be scope for deeper thinking about what it is that the learning
systems are trying to learn, but I see plenty
of rigour and care in the machine learning work, and some deep thinking on
the bigger issues. I don't think things are that bad.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070910/d2dace76/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list