[Corpora-List] RE : Annotation layers: missing reference

Sun Nov 21 19:31:22 UTC 2010

On Sun, Nov 21, 2010 at 7:20 AM, <amsler at cs.utexas.edu> wrote:

> However, corpora were well established as the basis for lexicography in the
> US by the 1970s with books such as the American Heritage Word Frequency Book
> serving as the basis for the "American Heritage Dictionary of the English
> Language" (Houghton MIfflin Co, 1969) (see foreword essay of the dictonary
> by Henry Kucera on "Computers in Language Analysis and in Lexicography").
> This of course followed his significant "Computational Analysis of
> Present-Day American English" (Kucera & Francis, Brown U. Press, 1967).
>
> Just out of curiosity, what were the discoveries about grammar and
> linguistics that have come from corpora that were not marketed in the US
> before 1970? Or is this just a philosophical attitude?  Note: I'm not taking
> sides here, I just don't know what grammatical/linguistic rules came from
> corpora studies that linguists were ignoring in the US before 1970.
>
>
>
I didn't and wouldn't make the claim that there were grammatical/linguistic
rules that came from corpora before 1970. Corpus builders produced reusable
knowledge about how to collect controlled samples of language, how to assess
and study variability and how to begin to answer questions about register
and usage. For me, these are linguistic questions, even though the
mainstream of generative linguistics has only recently begun to re-address
them, after decades of (arguably benign) neglect.
But what was learnt was primarily about corpora and what they are good for,
and did not particularly correspond to the concerns of the theoreticians.

Yorick is right to point to his work with Krotov et al. Richard Sharman
found similar things in (if I recall correctly) the early 90s, where the
accession rate of rules in a GPSG-ish grammar did not seem to stabilize as
the number of sentences in the sample
grew. These studies really do bear directly, and negatively, on the claims
that you can build a finite grammar for realistic language samples: the best
way to attack them would be to demonstrate a more expressive grammar
formalism that somehow
allows things to have the expectedly graceful asymptotic properties. I am
surprised that few (if any) theoretical linguists have been prepared to
undertake the mental retooling necessary in order to take on this challenge:
success would be a very compelling demonstration of their claims for the
powers of good representations.

>
>
>  On 11/20/2010 10:36 AM, chris brew wrote:
>>
>>> it's safe to assume that most things about corpora were discovered and
>>> carefully documented (but not necessarily marketed in the US) before 1970
>>>
>>
>>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Chris Brew, Ohio State University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101121/8201187e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora