[Corpora-List] Re: Passing the Turing Test: My Holy Grail

FIDELHOLTZ DOOCHIN JAMES LAWRENCE jfidel at siu.buap.mx
Sun May 18 14:28:26 UTC 2003


[note to CorporaList: here's my answer to a query I was sent.  If anyone
thinks I'm totally off the wall, please send him (& me) your comments--Jim]

Hi Michael:

answers (such as they are) below, among your Qs.

Michael Bramante escribió:

> I ask your assistance in seeking information on how to acquire the three
> things listed below (in order of importance).  Please let me know where
> and how I can acquire these things.  Also, please include all relevant
> names, addresses, phone numbers, email addresses and URL's.
>
> I am seeking:
>
> 1) The top 10,000 most commonly used words in the English language.

This one is basically impossible, since, usually except for 'the' and a
couple more words, the frequency of words depends *very much* on the
particular corpus you are using.  Probably what you want, and what would be
most useful for you, would be the top 10,000 words of a very large corpus
that tries to be representative, such as the BNC (British National Corpus).
Use Google to find urls--this is available, and you can probably even get
the info you need for free, but of that I'm not 100% sure.  check the
'corpora list' archives for more info on the BNC.
>
> 2) I am seeking exhaustive, empirical data containing the correct and
> complete assignment of all possible  parts of speech to each and every
> word in the above list.

Again, see previous comment, but the BNC, eg, is tagged with POS, so you
could get most of the info you might need.  Of course, 'exhaustive' is
tough, since recent results indicate:

1) As you add new text to a corpus, however large, you will always add new
words, although proportionally less per M words; nevertheless, this
'addition curve' does not seem to be asymptotic.  Ie, every language seems
to have literally an *infinite number of words* (my interpretation of
Baayen's recent book).  BTW, an increasing percentage of the 'new words'
encountered is proper nouns. (personal observation, and no doubt observed
also by others.)  BTWPS: This does not invalidate the observation, since,
according to me, proper nouns are just as much part of the language as
anything else, contrary to the practice of eg most dictionary makers.

2) POS tagging is at least partly an art, as even linguists are only about
99% or so in agreement on tagging specific texts (in the best of cases).
Taking that figure, that would mean at least 100 of your 10000 entries would
be questionable.  Also note that the more frequent a word, the more
different meanings it tends to have (and thus the more parts of speech as
well).  Another factor is that there are for each language a very large
number of possible sets of parts of speech, depending on the deviser's
particular ideas about some structures, on the one hand, and their desire to
be very specific or very general in their approach, on the other hand.
*Sometimes* more specific divisions are easily translatable into more
general ones (usually only if they have been devised by the same person or
team).
>
> 3) This one is a tall order and is the least important item for me:
>
>     I am interested in locating the exhaustive and finite list of all
> possible grammatically correct English sentence structures that contain
> at most ten words.  By sentence structure I mean a sentence composed of
> 'part of speech markers' instead of actual words.

This might be possible, and conceivably could have been done by someone
already, but I don't know about it.  Still, even if it has been done, you
would need to be careful, since you would need to include hierarchical
information as well as the POS of each word, since, eg, you would have to
distinguish between
    (Bill and Sally) and (John and Suzie) came home. (two couples)
and
    Bill and Sally and John and Suzie came home. (our 4 kids).
This not to mention restrictive vs. nonrestrictive modifiers, etc.  I
suspect that, practically speaking, this list would be astronomically large.
Also, of course, its size would depend on the size of the tagset used.  From
memory, smallish tagsets used in real corpora might have about 30-35
different tags.  Using 30, we might set an upper limit at 30(up-arrow)10
(this without considering hierarchical structures), or about 10 to the 15th.
We could obviously drop this down a lot (maybe to 10 to the ninth?) but
that's still a billion structures to contend with.
>
>     I am not seeking an exhaustive and finite list of all grammatically
> correct English sentences.that would be crazy.
>
>     For instance the following two sentences are identical in structure
> yet are two completely different sentences.  "The cow jumped over the
> moon." is identical in structure to "The snake slithered under the car."
>
>     The structure could be represented as:
>
>     "[Article Definite] ([Object Direct] [Noun Common] and [Noun
> Concrete] and [Noun Countable]) [Verb Active Past] [Preposition]
> [Article Definite] ([Noun Common] and [Noun Concrete] and [Noun
> Countable])"
>
>     I am enormously curious as to whether or not something like this
> exists.
>
>
> Thank you very much for your time and consideration.
>
>
> Sincerely,
>
>
> Michael J. Bramante
> Cell      (206) 227-1111
> Email   Bramante at attbi.com
>
Well, I hope this is some help.  I'm sending this along to the Corpora List,
in case anybody else might have more enllightening comments than me.

Jim

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Benemérita Universidad Autónoma de Puebla     MÉXICO



More information about the Corpora mailing list