Corpora: Re: Subsets and "partially-tagged" corpora (some actual statistics)
James L. Fidelholtz
jfidel at siu.buap.mx
Fri May 12 14:06:09 UTC 2000
On Thu, 11 May 2000, Mark Davies wrote:
[snip]
>I did some tests yesterday that suggest what % of the entire occurrences of
>a particular lexical category can be found be using just the 25 or 50 most
>common forms. For example, in an 800,000 word corpus of Spanish short
>stories there are 19,484 occurrences of infinitives, involving 1739
>different verbs. The 25 most common forms (ser, ver, hacer, decir, etc)
>provide a total of 7192 occurrences, or 37% of the total. The 50 most
>common forms give 49% and the 100 most common forms give 62%.
Mark et al.:
As per my reply the other day, I'm a little surprised to find
the percentages as low as that, but facts are facts (even when we take
a little fudging into account--read: different possible data sets).
[snip]
>So for me, at least, the question remains whether or not the syntax of the
>subset involving the most common forms (which can be easily identified and
>tagged) will be representative of the entire list of unique forms. In
>concrete example, would the syntax of the 50 most common imperfect
>subjunctives differ markedly from the least common forms (e.g. #200-459 on
>the frequency list)? If not, then there might be some value in usually
>partially tagged corpora, at least as an intermediate tool where a corpus
>has not been completely tagged yet (or where it may never be).
[snip]
OK, here, I would be especially careful (speaking as someone
with an interest in analyzing *less* frequent phenomena). I know
that, at least phonologically for English, less frequent words behave
in some instances differently from more frequent words of identical or
similar shape. Likewise, for example in Mexican Spanish, a few
particles of pretty high frequency evince some vowel reduction or
elimination, something that Spanish is famous for not having.
Therefore, I see no reason not to expect at least a few syntactic
weirdnesses from the uncommon verbs, although since we are talking
about syntax this is likely to be more limited, I would guess, than
any phonological weirdnesses. The only real way to be a little surer
would be to check a few construction types for a representative sample
of the most common verbs, as well as for a few of the least common.
Well, just a couple of thoughts.
Jim
James L. Fidelholtz e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades fax: +(01-2) 229-5681
Benemérita Universidad Autónoma de Puebla, MÉXICO
More information about the Corpora
mailing list