<br><div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div>What does it mean when we label a tree-bank, or tag a corpus? What theory is behind the idea of "parts-of-speech"?

</div></div></blockquote><div><br>It depends. For example, Hockenmaier and Steedman's CCGbank comes with a long technical report explaining which<br>aspects of the dataset are motivated by theory, and which aspects are due to the fact that this dataset is derived from

<br>the original Penn Treebank. This in turn comes with a long technical report explaining the basis for the decisions made. Some of these<br>are motivated by theory, others by practicalities. The decisions made are bound to affect both the scientific future of the data

<br>and its usefulness for engineering. Sampson's treebanks are based on a different set of choices, but are again documented in<br>detail and argued for. <br><br>Hockenmaier and Steedman say that <br><br>"The point of this exercise was to deliver more accurate wide-coverage parsers capable of building interpretable structure (which standard context-free Treebank grammars do not in general do), which could then be used in applications such as question answering or summarization" 

<br><br>This is an engineering goal (building interpretable structure) with scientific implications (potentially showing the CCG has a level<br>of rigour and flexibility that allows effective parsing).<br><br>This whole enterprise (not just Hockenmaier and Steedman, but ling-banking in general) strikes me as exactly "doing syntax", with rigour, on corpora.

<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div>What has changed is that we have stopped doing syntax. Sure, we've gained a lot of insight about the importance of lexicon and phraseology. That is not to be sniffed at. But when we try to do syntax what comes out is still mostly generativism, without the rigour.

</div></div></blockquote><div><br>There may be a disconnect between the live issues in current formal syntax research and the concerns that are foregrounded in recent ACL papers, and there may be scope for deeper thinking about what it is that the learning systems are trying to learn, but I see plenty

<br>of rigour and care in the machine learning work, and some deep thinking on the bigger issues. I don't think things are that bad.<br></div></div><br>