Corpora: Subsets and "partially-tagged" corpora

James L. Fidelholtz jfidel at siu.buap.mx
Thu May 11 16:01:36 UTC 2000


On Wed, 10 May 2000, Mark Davies wrote:
[snip]
>I am considering an alternative scheme [to full tagging] in which I
>tag just the most common
>words/forms for a given syntactic or verbal category, such as the 100 most
>common nouns and infinitives, the 25 most common adjectives, the 35 most
>common preterites, etc.  The "tagged" elements would be identified by a
>prefix, such as:
>
>	VI-estar (= verb/infinitive-"to be")
>	N-hombre (= noun-"man")
>	VPT-supo (= verb/preterite-"knew")
[snip]
>So my question deals with what percentage of all of the occurrences of a
>particular category would be included in this subset of most frequent
>forms.  For example, if there are 100,000 occurrences of infinitives in a
>particular block of text (representing 2000 different forms) and I tag just
>the 100 most common forms, what percentage of all of the occurrences will
>get marked -- 25%, 50%, etc.?  I'm going to be carrying out some test
>myself, but would like to be able to compare the results to other studies
>that might have already been done.

Mark:
	I can't give you studies offhand, although I'm sure they
exist.  You could scrounge out the data for English from Thorndike &
Lorge's book, just grabbing eg all nouns/verbs/?whatever marked AA or
A (if you want to limit yourself, more, just those with AA), and then
adding up their total # of occurrences per million (this figure for
each common word is available somewhere in the book, or perhaps
elsewhere, at least for the very most common words).  From this, you
could figure out the percentage of total words, which, if you use the
100 most common nouns, 100 most common verbs, say 50 most common
adjectives, 50 most common adverbs, ought to give you, at a guess,
well over 80%, and probably well over 90%, coverage of each category.
[snip]
	I hope this helps some.  Probably someone else might be able
to give you better skinny, and maybe more recent stuff than T&L.
		Jim

James L. Fidelholtz			e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje	tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades	fax: +(01-2) 229-5681
Benemérita Universidad Autónoma de Puebla, MÉXICO



More information about the Corpora mailing list