Corpora: Re: Subsets and "partially-tagged" corpora (some actual statistics)

Thu May 11 19:24:45 UTC 2000

Thanks for all of the replies, both public and private, re. the use of
"partially-tagged" corpora.

In particular, following up on the question:

> >So my question deals with what percentage of all of the occurrences of a
> >particular category would be included in this subset of most frequent
> >forms.  For example, if there are 100,000 occurrences of infinitives in a
> >particular block of text (representing 2000 different forms) and I tag just
> >the 100 most common forms, what percentage of all of the occurrences will
> >get marked -- 25%, 50%, etc.?

I did some tests yesterday that suggest what % of the entire occurrences of
a particular lexical category can be found be using just the 25 or 50 most
common forms.  For example, in an 800,000 word corpus of Spanish short
stories there are 19,484 occurrences of infinitives, involving 1739
different verbs. The 25 most common forms (ser, ver, hacer, decir, etc)
provide a total of 7192 occurrences, or 37% of the total. The 50 most
common forms give 49% and the 100 most common forms give 62%.

Not surprisingly, the more limited the number of unique forms for a
particular category, the higher the percentage of all occurrences that one
gets with using the subset of the most common forms.  For example, there
are 459 unique forms for the 3SG -ra imperfect subjunctive, giving a total
of 2346 occurrences. The 25 most common forms (pudiera, quisiera,
estuviera, etc) account for 50% of all occurrences, and the 50 most common
forms give 60%.

So for me, at least, the question remains whether or not the syntax of the
subset involving the most common forms (which can be easily identified and
tagged) will be representative of the entire list of unique forms.  In
concrete example, would the syntax of the 50 most common imperfect
subjunctives differ markedly from the least common forms (e.g. #200-459 on
the frequency list)?  If not, then there might be some value in usually
partially tagged corpora, at least as an intermediate tool where a corpus
has not been completely tagged yet (or where it may never be).

Mark D.

P.S. For those who are interested in how the data given above was
extracted, here is the procedure. First, create a word frequency list with
a concordance program (I used WordSmith). Save it as a CSV file and then
import this into a database program (I used Access). Then run a query that
matches that list against a list of all of the unique forms for a
particular category (I've created a table with all of the conjugations for
7000+ verbs in Spanish). Then (for ease in calculations) export the results
to a spreadsheet program (I used Excel), sort by frequency, and then see
the totals for the 25/50/100 most common forms, as a percentage of the
total for all forms.  Using these three programs one can calculate the %
for any given verb form in a moderately-sized corpus (1,000,000-3,000,000
words) in just a couple of minutes.

=======================================
Mark Davies, Associate Professor, Spanish Linguistics
Dept. of Foreign Languages, Illinois State University
Normal, IL 61790-4300

Voice:309/438-7975      email:mdavies at ilstu.edu
Fax:309/438-8038          http://mdavies.for.ilstu.edu/personal/
=======================================