[Corpora-List] Frequency of masc./fem/neut. in German
Christian Chiarcos
christian.chiarcos at web.de
Sat Apr 18 21:00:45 UTC 2009
Although the idea to extrapolate gender frequency from definite
determiners is promising, the proposal to just count "der,die,das" has
serious flaws: these forms are highly ambiguous and differ in their
distribution, also independently from gender:
(i) demonstrative pronouns and definite article are almost equal in form,
and only the latter are relevant for gender frequency among common nouns
(the demonstrative "das" does not require a nominal antecedent)
(ii) "der" is not necessarily masculine, but could also be fem.gen.
(iii) "die" is not necessarily feminine, but could also be plural (hence,
any gender)
(iv) the distribution of determiners is not balanced for cases, "der" is
nom. only (or fem.), "das" is nom. or acc., and "die" is nom. or acc. (or
pl.)
(v) the search is biased against neut., definite determiners often fuse
with prepositions, but usually, this does affect masc. (den, dem) and
neuter (das, dem) only (= articles with short vowels ending in -s or nasal)
In order to extrapolate gender frequency from the distribution of
determiners, we have thus to (a) balance for number, (b) balance for cases
and, (c) exclude non-determiners, but include all variants of a specific
determiner.
For (a), use the indefinite articles "ein, eine, ein" in place of the
definite article, thereby avoiding interference with plural forms and
problems with prepositions (indefinites don't combine with prepositions).
For (b), one can restrict the search to determiners followed by an
adjective, thus using the adjectival morphology for case disambiguation.
By combining adjectival inflection and determiner form, all genders can be
properly distinguished in nominative and accusative case, and the forms
are specific to nominative and accusative (m. ein -er/einen -en; f. eine
-e; n. ein -es). (For non-tagged text, a fallback strategy to identify
adjectives would be to check the capitalization of the following word and
count it as an adjective if the ending matches one of the patterns.)
This also solves problem (c), as "ein, eine, ein" and a following
adjective are quite likely part of the same NP.
The counts for the DWDS core corpus (open + restricted subcorpora,
http://www.dwds.de/?corpus=1&qu="%40eine+%230+*e+with+%24p%3DADJA"&sort=1&res=-1&cp=1):
"ein"
masc. 126794 (33%)
fem. 169588 (44%)
neut. 88198 (23%)
similarly "kein"
masc. 7235 (33%)
fem. 9752 (45%)
neut. 4742 (22%)
This context-sensitive way of counting certainly gives a substantially
more reliable approximation of relative frequency than just counting
definite articles, and the relative dictionary frequencies reported before
are almost *exactly* matched (less than 3% deviation as compared to
celex+hagenlex and duden), thereby confirming these counts also in terms
of token frequency. The comparably higher frequency of masc. and the
comparably lower frequency of neut. in Andreas' counts are most likely
artifacts.
Still, also these numbers are rough approximations, only (token frequency,
not type frequency; gender frequency is confounded here with semantic and
pragmatic factors, e.g., indefiniteness and NP complexity are probably not
independent from animacy, and inanimacy favours neuter gender).
Best,
Christian
--
Christian Chiarcos
University of Potsdam/Germany
Collaborative Research Center 632
Project D1 "Linguistic Data Base for Information Structure"
snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm
office: II.24.2.68
email: chiarcos at uni-potsdam.de
web: http://www.sfb632.uni-potsdam.de/~chiarcos
tel.: +49-(0)331/977-2664
fax: +49-(0)331/977-2925
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list