[Corpora-List] "Tajweed" in English dictionaries and corpora

Jim Fidelholtz fidelholtz at gmail.com
Tue Mar 5 23:30:25 UTC 2013


Hi, All,

A couple of years ago, my email account (this same one) got hijacked (¿by
some Russians, maybe?) for a day or two, which prompted CORPORA to give me
the boot (no grudges held: I continue to recommend the list to pertinent
questioners on Ask-a-Linguist), and I am just now getting around to
resubscribing, with no good reason for the delay.

To put a PS first: Trevor, I think you mean *wrest* control of English ...,
although 'rest', upon reflection, is not nonsensical.

Anyway, to join in the discussion of what words should go into a
dictionary, I think one point has escaped the ken of at least some of the
contributors to the discussion: there *can never be such a thing as an
all-inclusive dictionary*, probably not of any language. This perhaps
seemingly outlandish claim is based on two threads. The theoretical one was
first sketched by George Bedell in his MIT thesis of about 1969 or 1970,
and was based on the fact that there are strings of word-forming suffixes
that can be nested, such as "Iz-At+ion-al", which leads us to the fact that
there is theoretically no longest *word*, never mind that there is no
longest sentence, which means words like 'nationalizationalize-...', while
becoming ever more specific semantically and thus ever more practically
useless, have no upper bound to the number of these suffixes which could be
added in theory.

The practical thread of the argument is based on Harald Baayen's
(?spelling) book _Word frequency distributions_, from about 2000 or 2001,
and which I have mentioned here in the past in the same context. Based on a
sophisticated statistical analysis of _Alice through the looking glass_ by
Lewis Carroll, and considering it as a text to be analyzed, he shows that
(and here I'm somewhat interpreting his results) the curve of vocabulary
items derived by dividing the book into two equal parts and doing
independent analyses of each and then combining the results into a
resulting curve of the number of total vocabulary items [breathe here] does
*not* have an asymptote. That is, there does not seem to be an upper limit
to the *total* number of vocabulary items, at least in English. An
experiment which needs to be done here would be to take some very large
corpus, say of English (eg the Sketch Engine corpus, or the BNC, or the
ANC, all of which have over 100M words, and at least the Sketch Engine
corpus and the ANC are continually growing), divide it up into subcorpora
of 1M words (several hundred subcorpora in some cases) and do a similar
analysis of *each* to Baayen's analysis, and see how the resulting
logarythmic curve of their union behaves. I haven't done this, but one
ought to find similar results to Baayen's, ie that the number of words in
English is non-finite, ie, that there is no possible dictionary of "all of
English", not even "all of British English" or "all of American English" or
....

One further mathematical point: I can't remember who has worked on this,
but it has been pointed out in print that when we examine even very large
corpora, and as probability theory would inform us, a few words which occur
in some frequency counts of lesser corpora and which would seem 'common'
(or at least not hopelessly recherché) to most observers, do not occur in
the specific corpus, however large it is. Of course, we would never expect
the most common words to fall in this category, but we only have to examine
the different lists in, for example, Thorndike & Lorge (especially the
'less common' ones: words that occur only 4 times in the 18 million word
corpus they used) to see some surprises.

Well, this is long enough for now, but I hope I don't have to belabor
further the fact that dictionary makers *must* perforce make a selection,
since no physical dictionary could contain an infinite number of words (not
even in cyberspace!), although English (eg) apparently does have an
infinite number of words!

I have a few ideas about how to do research on the nature of very rare
words, having to do with hapax legomena in very large corpora, which might
also shed some light on how to address the problem of the untraceable
nature of the very small number of errors (much less than 1% in the best
statistical taggers), but I'll leave that for another post.

Jim

"I can never remember how to spell 'hapax legomena', because I've only ever
seen it written once."


On Tue, Mar 5, 2013 at 12:14 PM, Trevor Jenkins <
trevor.jenkins at suneidesis.com> wrote:

> On 5 Mar 2013, at 18:08, Trevor Jenkins <trevor.jenkins at suneidesis.com>
> wrote:
>
> > ... The British author, TV and radio presenter Melvyn Bragg (aka the
> Labour supporting Lord Baron Bragg of Wigton) argued that in that very
> point in the TV programmes of his "Adventure of English" that Webster
> created his first dictionary was conceived as a political act to rest
> control of English from the English. ...
>
> That don't make much sense do it. What comes of spending my working day
> translating into a signed language.
>
> Putting my English head back on what I intended to say (or at least closer
> to what I intended):
>
> >> ... The British author, TV and radio presenter Melvyn Bragg (aka the
> Labour supporting member of the House of Lords, Baron Bragg of Wigton)
> argued that very point in the TV programmes of his "Adventure of English"
> [series by claiming] that Webster conceived his first dictionary as a
> [deliberate] political act to rest control of English from the English. …
>
> Better? I hope so.
>
> Regards, Trevor.
>
> <>< Re: deemed!
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130305/fd1dd70c/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list