[Corpora-List] Frequency of masc./fem/neut. in German

Christian Chiarcos christian.chiarcos at web.de
Sat Apr 18 21:00:45 UTC 2009


Although the idea to extrapolate gender frequency from definite  
determiners is promising, the proposal to just count "der,die,das" has  
serious flaws: these forms are highly ambiguous and differ in their  
distribution, also independently from gender:

(i) demonstrative pronouns and definite article are almost equal in form,  
and only the latter are relevant for gender frequency among common nouns  
(the demonstrative "das" does not require a nominal antecedent)
(ii) "der" is not necessarily masculine, but could also be fem.gen.
(iii) "die" is not necessarily feminine, but could also be plural (hence,  
any gender)
(iv) the distribution of determiners is not balanced for cases, "der" is  
nom. only (or fem.), "das" is nom. or acc., and "die" is nom. or acc. (or  
pl.)
(v) the search is biased against neut., definite determiners often fuse  
with prepositions, but usually, this does affect masc. (den, dem) and  
neuter (das, dem) only (= articles with short vowels ending in -s or nasal)

In order to extrapolate gender frequency from the distribution of  
determiners, we have thus to (a) balance for number, (b) balance for cases  
and, (c) exclude non-determiners, but include all variants of a specific  
determiner.

For (a), use the indefinite articles "ein, eine, ein" in place of the  
definite article, thereby avoiding interference with plural forms and  
problems with prepositions (indefinites don't combine with prepositions).
For (b), one can restrict the search to determiners followed by an  
adjective, thus using the adjectival morphology for case disambiguation.  
By combining adjectival inflection and determiner form, all genders can be  
properly distinguished in nominative and accusative case, and the forms  
are specific to nominative and accusative (m. ein -er/einen -en; f. eine  
-e; n. ein -es). (For non-tagged text, a fallback strategy to identify  
adjectives would be to check the capitalization of the following word and  
count it as an adjective if the ending matches one of the patterns.)
This also solves problem (c), as "ein, eine, ein" and a following  
adjective are quite likely part of the same NP.

The counts for the DWDS core corpus (open + restricted subcorpora,  
http://www.dwds.de/?corpus=1&qu="%40eine+%230+*e+with+%24p%3DADJA"&sort=1&res=-1&cp=1):

"ein"
masc. 126794 (33%)
fem. 169588 (44%)
neut. 88198 (23%)

similarly "kein"
masc. 7235 (33%)
fem. 9752 (45%)
neut. 4742 (22%)

This context-sensitive way of counting certainly gives a substantially  
more reliable approximation of relative frequency than just counting  
definite articles, and the relative dictionary frequencies reported before  
are almost *exactly* matched (less than 3% deviation as compared to  
celex+hagenlex and duden), thereby confirming these counts also in terms  
of token frequency. The comparably higher frequency of masc. and the  
comparably lower frequency of neut. in Andreas' counts are most likely  
artifacts.

Still, also these numbers are rough approximations, only (token frequency,  
not type frequency; gender frequency is confounded here with semantic and  
pragmatic factors, e.g., indefiniteness and NP complexity are probably not  
independent from animacy, and inanimacy favours neuter gender).

Best,
Christian
-- 
Christian Chiarcos
University of Potsdam/Germany
Collaborative Research Center 632
Project D1 "Linguistic Data Base for Information Structure"
snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm
office: II.24.2.68
email: chiarcos at uni-potsdam.de
web: http://www.sfb632.uni-potsdam.de/~chiarcos
tel.: +49-(0)331/977-2664
fax: +49-(0)331/977-2925

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list