<div dir="ltr">Hi, All,<div><br></div><div style>A couple of years ago, my email account (this same one) got hijacked (¿by some Russians, maybe?) for a day or two, which prompted CORPORA to give me the boot (no grudges held: I continue to recommend the list to pertinent questioners on Ask-a-Linguist), and I am just now getting around to resubscribing, with no good reason for the delay.</div>

<div style><br></div><div style>To put a PS first: Trevor, I think you mean *wrest* control of English ..., although 'rest', upon reflection, is not nonsensical.</div><div style><br></div><div style>Anyway, to join in the discussion of what words should go into a dictionary, I think one point has escaped the ken of at least some of the contributors to the discussion: there *can never be such a thing as an all-inclusive dictionary*, probably not of any language. This perhaps seemingly outlandish claim is based on two threads. The theoretical one was first sketched by George Bedell in his MIT thesis of about 1969 or 1970, and was based on the fact that there are strings of word-forming suffixes that can be nested, such as "Iz-At+ion-al", which leads us to the fact that there is theoretically no longest *word*, never mind that there is no longest sentence, which means words like 'nationalizationalize-...', while becoming ever more specific semantically and thus ever more practically useless, have no upper bound to the number of these suffixes which could be added in theory.</div>

<div style><br></div><div style>The practical thread of the argument is based on Harald Baayen's (?spelling) book _Word frequency distributions_, from about 2000 or 2001, and which I have mentioned here in the past in the same context. Based on a sophisticated statistical analysis of _Alice through the looking glass_ by Lewis Carroll, and considering it as a text to be analyzed, he shows that (and here I'm somewhat interpreting his results) the curve of vocabulary items derived by dividing the book into two equal parts and doing independent analyses of each and then combining the results into a resulting curve of the number of total vocabulary items [breathe here] does *not* have an asymptote. That is, there does not seem to be an upper limit to the *total* number of vocabulary items, at least in English. An experiment which needs to be done here would be to take some very large corpus, say of English (eg the Sketch Engine corpus, or the BNC, or the ANC, all of which have over 100M words, and at least the Sketch Engine corpus and the ANC are continually growing), divide it up into subcorpora of 1M words (several hundred subcorpora in some cases) and do a similar analysis of *each* to Baayen's analysis, and see how the resulting logarythmic curve of their union behaves. I haven't done this, but one ought to find similar results to Baayen's, ie that the number of words in English is non-finite, ie, that there is no possible dictionary of "all of English", not even "all of British English" or "all of American English" or ....</div>

<div style><br></div><div style>One further mathematical point: I can't remember who has worked on this, but it has been pointed out in print that when we examine even very large corpora, and as probability theory would inform us, a few words which occur in some frequency counts of lesser corpora and which would seem 'common' (or at least not hopelessly recherché) to most observers, do not occur in the specific corpus, however large it is. Of course, we would never expect the most common words to fall in this category, but we only have to examine the different lists in, for example, Thorndike & Lorge (especially the 'less common' ones: words that occur only 4 times in the 18 million word corpus they used) to see some surprises.</div>

<div style><br></div><div style>Well, this is long enough for now, but I hope I don't have to belabor further the fact that dictionary makers *must* perforce make a selection, since no physical dictionary could contain an infinite number of words (not even in cyberspace!), although English (eg) apparently does have an infinite number of words!</div>

<div style><br></div><div style>I have a few ideas about how to do research on the nature of very rare words, having to do with hapax legomena in very large corpora, which might also shed some light on how to address the problem of the untraceable nature of the very small number of errors (much less than 1% in the best statistical taggers), but I'll leave that for another post.</div>

<div style><br></div><div style>Jim</div><div style><br></div><div style>"I can never remember how to spell 'hapax legomena', because I've only ever seen it written once."</div><div class="gmail_extra">

<br><br><div class="gmail_quote">On Tue, Mar 5, 2013 at 12:14 PM, Trevor Jenkins <span dir="ltr"><<a href="mailto:trevor.jenkins@suneidesis.com" target="_blank">trevor.jenkins@suneidesis.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 5 Mar 2013, at 18:08, Trevor Jenkins <<a href="mailto:trevor.jenkins@suneidesis.com">trevor.jenkins@suneidesis.com</a>> wrote:<br>

<br>

> ... The British author, TV and radio presenter Melvyn Bragg (aka the Labour supporting Lord Baron Bragg of Wigton) argued that in that very point in the TV programmes of his "Adventure of English" that Webster created his first dictionary was conceived as a political act to rest control of English from the English. ...<br>


<br>

That don't make much sense do it. What comes of spending my working day translating into a signed language.<br>

<br>

Putting my English head back on what I intended to say (or at least closer to what I intended):<br>

<br>

>> ... The British author, TV and radio presenter Melvyn Bragg (aka the Labour supporting member of the House of Lords, Baron Bragg of Wigton) argued that very point in the TV programmes of his "Adventure of English" [series by claiming] that Webster conceived his first dictionary as a [deliberate] political act to rest control of English from the English. …<br>


<br>

Better? I hope so.<br>

<div class="HOEnZb"><div class="h5"><br>

Regards, Trevor.<br>

<br>

<>< Re: deemed!<br>

<br>

<br>

_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>James L. Fidelholtz<br>Posgrado en Ciencias del Lenguaje<br>Instituto de Ciencias Sociales y Humanidades<br>Benemérita Universidad Autónoma de Puebla, MÉXICO

</div></div>