Corpora: Subsets and "partially-tagged" corpora

Wed May 10 13:43:32 UTC 2000

I am in the process of creating a 100,000,000 word corpus of historical
Spanish texts (1200s-1900s) and have a question regarding possible
alternatives to POS tagging for the entire corpus.

It would be extremely difficult (if not impossible) to tag the entire
corpus, because of the large amount of variation in forms (e.g. hubiese,
hubiesse, ouiese, ouisse, ouyese, ouyesse, + + + for the past subjective of
haber "to have") as well as because of the sheer size and lexical
complexity of the corpus.

I am considering an alternative scheme in which I tag just the most common
words/forms for a given syntactic or verbal category, such as the 100 most
common nouns and infinitives, the 25 most common adjectives, the 35 most
common preterites, etc.  The "tagged" elements would be identified by a
prefix, such as:

	VI-estar (= verb/infinitive-"to be")
	N-hombre (= noun-"man")
	VPT-supo (= verb/preterite-"knew")

Users could then search for a construction like

	parece*  INF-*		[parecer "to seem" + infinitive]
	deb* CL-* INF-*		[deber "should" + clitic + infinitive]

which would give cases of "parecer" followed by (just) one of the 100 most
common infinitives, or "deber" followed by a clitic and and one of the 100
most common infinitives (these are just two of many possible examples).

(I'm aware of problems of polysemy, such as ser = "to be / a (human) being
(N)", habla = "speak-3SG / speech (N)", and these will have to be dealt
with as best as possible.  But a POS tagger will have similar (if not
worse) problems identifying the correct POS for each form, considering the
incredible range in forms in a corpus this size, covering a period of 800
years).

So my question deals with what percentage of all of the occurrences of a
particular category would be included in this subset of most frequent
forms.  For example, if there are 100,000 occurrences of infinitives in a
particular block of text (representing 2000 different forms) and I tag just
the 100 most common forms, what percentage of all of the occurrences will
get marked -- 25%, 50%, etc.?  I'm going to be carrying out some test
myself, but would like to be able to compare the results to other studies
that might have already been done.

The main question, then, is whether anyone might be aware of statistical
studies that have been done along these lines, especially for one of the
Western European languages.  I realize that grammatical categories are
divided differently in different languages (e.g. infinitives in German
might not compare directly to infinitives in Spanish, and the same for
clitics in French and Spanish), but what I'm looking for here are just very
approximate figures.

Again, I realize that a "partially-tagged" corpus such as this has very
real shortcomings, both in terms of theory [representativeness] and
practice [only having access to a limited number of forms for a given
category, and missing interesting occurrences with less common forms].  But
if the alternative is a corpus that is not tagged at all, it is probably
still worth doing.

Thanks in advance for any comments that you might have.

Mark D.

=======================================
Mark Davies, Associate Professor, Spanish Linguistics
Dept. of Foreign Languages, Illinois State University
Normal, IL 61790-4300

Voice:309/438-7975      email:mdavies at ilstu.edu
Fax:309/438-8038          http://mdavies.for.ilstu.edu/personal/
=======================================