Corpora: Re: Subsets and "partially-tagged" corpora (some actual statistics)

Fri May 12 14:06:09 UTC 2000

On Thu, 11 May 2000, Mark Davies wrote:
[snip]

>I did some tests yesterday that suggest what % of the entire occurrences of
>a particular lexical category can be found be using just the 25 or 50 most
>common forms.  For example, in an 800,000 word corpus of Spanish short
>stories there are 19,484 occurrences of infinitives, involving 1739
>different verbs. The 25 most common forms (ser, ver, hacer, decir, etc)
>provide a total of 7192 occurrences, or 37% of the total. The 50 most
>common forms give 49% and the 100 most common forms give 62%.

Mark et al.:
	As per my reply the other day, I'm a little surprised to find
the percentages as low as that, but facts are facts (even when we take
a little fudging into account--read: different possible data sets).
[snip]

>So for me, at least, the question remains whether or not the syntax of the
>subset involving the most common forms (which can be easily identified and
>tagged) will be representative of the entire list of unique forms.  In
>concrete example, would the syntax of the 50 most common imperfect
>subjunctives differ markedly from the least common forms (e.g. #200-459 on
>the frequency list)?  If not, then there might be some value in usually
>partially tagged corpora, at least as an intermediate tool where a corpus
>has not been completely tagged yet (or where it may never be).
[snip]
	OK, here, I would be especially careful (speaking as someone
with an interest in analyzing *less* frequent phenomena).  I know
that, at least phonologically for English, less frequent words behave
in some instances differently from more frequent words of identical or
similar shape.  Likewise, for example in Mexican Spanish, a few
particles of pretty high frequency evince some vowel reduction or
elimination, something that Spanish is famous for not having.
Therefore, I see no reason not to expect at least a few syntactic
weirdnesses from the uncommon verbs, although since we are talking
about syntax this is likely to be more limited, I would guess, than
any phonological weirdnesses.  The only real way to be a little surer
would be to check a few construction types for a representative sample
of the most common verbs, as well as for a few of the least common.
	Well, just a couple of thoughts.
		Jim

James L. Fidelholtz			e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje	tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades	fax: +(01-2) 229-5681
Benemérita Universidad Autónoma de Puebla, MÉXICO