Dear Sebastian

On your question on the distribution of number values. All the evidence available points to the singular being considerably more frequent than the plural overall (though there are plural dominant nouns like English lips which show a contrary pattern). The evidence is mainly from Indo-European, but where there is data on non-IE languages, so far the (limited) evidence is for an even greater preponderance of the singular. Here’s a current summary:

"Evidence for this general pattern is found in Greenberg (1966 :31–32), who gives data
on French, Russian, Latin and Sanskrit, and the list is expanded in Corbett (2000 :281–
282) to include Slovene and Upper Sorbian. For those (Indo-European) languages
 the singular is used typically in over 70% of the instances in running text (somewhat
lower in Upper Sorbian); 70% is also the approximate figure reported for English
(Haskell et al. 2003 :125), while 79% is reported for Finnish texts (Räsänen 1979 :24),
and 91% for the Dagestanian language Hinuq (Forker 2016b :93)."

It’s from
Grevile G. Corbett 2019. Pluralia tantum nouns and the theory of features: a typology of nouns with non-canonical number properties. Morphology Volume 29, 1, 51–108. DOI: 10.1007/s11525-018-9336-0. ( https://link.springer.com/article/10.1007/s11525-018-9336-0)
It’s Open access so the references can be checked there. As that paper suggests, the fun really starts when other number values are involved.

So whey would there be apparent counter-evidence?
1. Some linguists fail to gloss bare stems, so they’d gloss dogs as plural but would not gloss the number of dog.
2. Similarly in languages where a part of the noun inventory shows “General number” some might (reasonably) not gloss general when it equals singular, but would gloss plural

Very best

On 7 Apr 2020, at 11:22, Sebastian Nordhoff <sebastian.nordhoff at glottotopia.de<mailto:sebastian.nordhoff at glottotopia.de>> wrote:

Dear list members,
I have done some analyses of endangered language archives (SOAS/ELAR, Nijmegen, Paradisec, AILLA). I retrieved 20k ELAN files with transcriptions and annotations and ran some statistics on the transcription tiers and the translation tiers.

Altogether, transcriptions have 2.5 million words; translations have 400.000 words.

Here are a number of findings I would like to share:

== Graphemes ==

- the most frequent grapheme in transcriptions is <a>
- the next most frequent graphemes are <e>, <i>, <n>, but the order is different between archives.

The fact that <a> is the most frequent grapheme is certainly plausible. But I am interested in explanations for the differences between <e>, <i>, <n>. Would we have expected these three, and which order would we have predicted?

== Categories ==

- the most frequent glossed categories are SG and PL. ELAR and Paradisec have more SG than PL, but TLA/Dobes has more PL than SG.

What do members of this list think are the most plausible explanations for the difference here? Does this meet your intuitions about the distribution of number categories?

The most popular non-number categories are DEM and PST by the way.

== Lexical glosses ==

'go' is the most frequent lexical gloss. 'come' is on rank #5, #7, #8 and #12+ in four different archives. 'one' is also a popular gloss. 'say' is popular in ELAR and TLA, but not in PARADISEC. 'then' is found in AILLA, PARADISEC and TLA, but not so much in ELAR.

The dominance of 'go' is plausible, but I was wondering what could explain the relatively reduced frequency of 'say' and 'then' in certain archives.

For the time being, all analyses are based on tokens, and there is no control for area or language. I have plans for refining the analyses in the future, but for right now, I would need some input from this list regarding possible explanations or hypotheses to investigate.

Best wishes

