[Corpora-List] RE : Annotation layers: missing reference

Eric Atwell csc6ea at leeds.ac.uk
Mon Nov 22 22:53:44 UTC 2010


Not 1970s, but grammars were being extracted from corpus treebanks in
1980s, eg from British English LOB Corpus treebank:

Atwell, Eric. 1988. Transforming a Parsed Corpus into a Corpus Parser. 
Kyto,M, Ihalainen,O, Risanen,M (eds), Corpus Linguistics, Hard and
Soft: Proceedings of the ICAME 8th International Conference on English
Language Research on Computerised Corpora, pp61-70, Amsterdam, Rodopi.
http://books.google.com/books?id=RPvEAeJm2BMC&pg=PA61&vq=Atwell&dq=Transforming+a+Parsed+Corpus+into+a&source=gbs_search_s&sig=ACfU3U2FNcyXlWScxqAhpiBwj1bqIJ3-cg#v=onepage&q=Atwell&f=false

  - over 8500 context-free rules extracted from the Treebank, far too
    large for CFG parsers available at the time.

also, from the British English PoW Polytechnic of Wales corpus, parsed
with Systemic Functional grammar trees:

Souter, Clive. 1989. The COMMUNAL Project: Extracting a grammar from
the Polytechnic of Wales Corpus. ICAME Journal 13 pp.20-27
http://icame.uib.no/archives/No_13_ICAME_Journal_index.pdf

- over 18,000 distinct CF rules were produced, again far too large a
   grammar for existing CFG parsers.

So we gave up and tried alternatives to CFG-rule parsers...


Eric Atwell, Leeds University


On Sun, 21 Nov 2010, Yorick Wilks wrote:

> I dont think the dating of "corpora and grammar" so early is right. I recall a very small and self-serving paper:
> A. Krotov, M. Hepple, R. Gaizauskas and Y. Wilks. 1998. Compacting the Penn Treebank Grammar. Proceedings of the
> COLING-ACL'98 Joint Conference (The 17th International Conference on Computational Linguistics, and 36th Annual Meeting of
> the Association for Computational Linguistics). pp 699-703. Montreal, Canada. August 1998.
> 
> In this work my student Alex Krotov found that if you induced the PS grammar rules from the PTB, in a pretty
> straightforward way from the trees, then the number of rules was enormous and, most significantly, I thought,  still
> rising linearly at the end of the PTB corpus, which didnt prove anything but made one wonder about all the claims of
> finite grammar and infinite language that we had all been indoctrinated with. Hepple showed in that paper that the huge
> set could be compressed to a smaller (but still very large) rule set without loss of coverage of data---but I dont think
> this kind of analysis was being done by corpus people much earlier than this date.
> YW
> 
> 
> 
> 
> 
> 
> On 21 Nov 2010, at 12:20, amsler at cs.utexas.edu wrote:
>
>       However, corpora were well established as the basis for lexicography in the US by the 1970s with books such as
>       the American Heritage Word Frequency Book serving as the basis for the "American Heritage Dictionary of the
>       English Language" (Houghton MIfflin Co, 1969) (see foreword essay of the dictonary by Henry Kucera on
>       "Computers in Language Analysis and in Lexicography"). This of course followed his significant "Computational
>       Analysis of Present-Day American English" (Kucera & Francis, Brown U. Press, 1967).
>
>       Just out of curiosity, what were the discoveries about grammar and linguistics that have come from corpora
>       that were not marketed in the US before 1970? Or is this just a philosophical attitude?  Note: I'm not taking
>       sides here, I just don't know what grammatical/linguistic rules came from corpora studies that linguists were
>       ignoring in the US before 1970.
> 
> 
>
>             On 11/20/2010 10:36 AM, chris brew wrote:
>
>                   it's safe to assume that most things about corpora were discovered and
>
>                   carefully documented (but not necessarily marketed in the US) before 1970
> 
> 
> 
>
>       _______________________________________________
>       Corpora mailing list
>       Corpora at uib.no
>       http://mailman.uib.no/listinfo/corpora
> 
> 
> 
>

-- 
Eric Atwell, Senior Lecturer, Language research group,
  I-AIBS Institute for Artificial Intelligence and Biological Systems
  School of Computing, Faculty of Engineering, UNIVERSITY OF LEEDS
  Leeds LS2 9JT, England.        TEL: 0113-3435430  FAX: 0113-3435468
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list