Corpora: Re: Subsets and "partially-tagged" corpora

Thu May 11 11:17:38 UTC 2000

Dear Mark,

>I am considering an alternative scheme in which I tag just the most common 
>words/forms for a given syntactic or verbal category, such as the 100 most 
>common nouns and infinitives, the 25 most common adjectives, the 35 most 
>common preterites, etc.  The "tagged" elements would be identified by a 

It is not clear what you mean by "if I tag just the 100 most common
infinitive forms", since afterwards in your mail you ask for a quantitative
idea of how much would these cover in all infinitive population.

In order to know which are the 100 most common infinitive forms in your
corpus, I suppose you would have to count (and order) all infinitives -- or
at least all your candidates to infinitives -- by their number of
occurrences, and that would already give you the estimate you are looking
for. Or are you making use of an external source of reference for the list
of most frequent items?

As far as categorial ambiguity, of which you say you are aware, that's
precisely why you would need to tag (and not only have a morphological
analyser to classify) the occurrences of your corpus. If every form could
only be an infinitive or a noun, it was enough to have a lexicon plus a
morphological analyser. It is precisely because most FORMS can belong to
different categories that you need to tag a corpus. 

You ask for similar studies done on related languages. I'm not sure how
similar -- or interesting to you -- the following is: We conducted some
years ago a study on Portuguese on "partial tagging" undesrtood in a
different way: we used very broad categories (only six), multitagged a
small corpus with them (i.e., assigned all possible tags to each wordform,
with our configurable morphological analyser), and then studied the amount
of manual revision required to achieve one category per wordform (fully
disambiguated corpus, thus). This is reported in Medeiros et al. (1993) and
Santos (1996b) [both in Portuguese], available from
http://www.portugues.mct.pt/Diana/public.html

For a service of more imediate interest, I suggest you consult our AC/DC
service which serves modern Portuguese corpora at
http://cgi.portugues.mct.pt/acesso/. Two of the corpora are already parsed
with Eckhard Bick's CG parser for Portuguese, and we hope to have the
parsed version of the remaining corpora ready soon.  
Look for the distribution of infinitives, or adjectives (Example:
[pos="ADJ"]; ) and select, in the field "Resultado", the option
"Distribuição de lemas" ('Lemmata distribution'), and see whether the
result can be of use to you.

Greetings,
Diana

**************************************************************************
Diana Santos				Computational processing of Portuguese

SINTEF Telecom and Informatics	Tel. (direct line) +47 22 06 73 12
Forskningsveien 1			Tel. +47 22 06 73 00
Box 124 Blindern			Fax. +47 22 06 73 50
N-0314 Oslo				Email: Diana.Santos at informatics.sintef.no
Norway					http://www.portugues.mct.pt/
**************************************************************************