[Corpora-List] RE : Annotation layers: missing reference

Yorick Wilks Y.Wilks at dcs.shef.ac.uk
Sun Nov 21 12:55:03 UTC 2010


I dont think the dating of "corpora and grammar" so early is right. I recall a very small and self-serving paper:

A. Krotov, M. Hepple, R. Gaizauskas and Y. Wilks. 1998. Compacting the Penn Treebank Grammar. Proceedings of the COLING-ACL'98 Joint Conference (The 17th International Conference on Computational Linguistics, and 36th Annual Meeting of the Association for Computational Linguistics). pp 699-703. Montreal, Canada. August 1998.

In this work my student Alex Krotov found that if you induced the PS grammar rules from the PTB, in a pretty straightforward way from the trees, then the number of rules was enormous and, most significantly, I thought,  still rising linearly at the end of the PTB corpus, which didnt prove anything but made one wonder about all the claims of finite grammar and infinite language that we had all been indoctrinated with. Hepple showed in that paper that the huge set could be compressed to a smaller (but still very large) rule set without loss of coverage of data---but I dont think this kind of analysis was being done by corpus people much earlier than this date.
YW






On 21 Nov 2010, at 12:20, amsler at cs.utexas.edu wrote:

> However, corpora were well established as the basis for lexicography in the US by the 1970s with books such as the American Heritage Word Frequency Book serving as the basis for the "American Heritage Dictionary of the English Language" (Houghton MIfflin Co, 1969) (see foreword essay of the dictonary by Henry Kucera on "Computers in Language Analysis and in Lexicography"). This of course followed his significant "Computational Analysis of Present-Day American English" (Kucera & Francis, Brown U. Press, 1967).
> 
> Just out of curiosity, what were the discoveries about grammar and linguistics that have come from corpora that were not marketed in the US before 1970? Or is this just a philosophical attitude?  Note: I'm not taking sides here, I just don't know what grammatical/linguistic rules came from corpora studies that linguists were ignoring in the US before 1970.
> 
> 
> 
>> On 11/20/2010 10:36 AM, chris brew wrote:
>>> it's safe to assume that most things about corpora were discovered and
>>> carefully documented (but not necessarily marketed in the US) before 1970
>> 
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101121/b8c68cb4/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list