[Lingtyp] Frequencies in language archives

Tue Apr 7 09:22:28 UTC 2020

Dear list members,
I have done some analyses of endangered language archives (SOAS/ELAR, 
Nijmegen, Paradisec, AILLA). I retrieved 20k ELAN files with 
transcriptions and annotations and ran some statistics on the 
transcription tiers and the translation tiers.

Altogether, transcriptions have 2.5 million words; translations have 
400.000 words.

Here are a number of findings I would like to share:

== Graphemes ==

- the most frequent grapheme in transcriptions is <a>
- the next most frequent graphemes are <e>, <i>, <n>, but the order is 
different between archives.

The fact that <a> is the most frequent grapheme is certainly plausible. 
But I am interested in explanations for the differences between <e>, 
<i>, <n>. Would we have expected these three, and which order would we 
have predicted?

== Categories ==

- the most frequent glossed categories are SG and PL. ELAR and Paradisec 
have more SG than PL, but TLA/Dobes has more PL than SG.

What do members of this list think are the most plausible explanations 
for the difference here? Does this meet your intuitions about the 
distribution of number categories?

The most popular non-number categories are DEM and PST by the way.

== Lexical glosses ==

'go' is the most frequent lexical gloss. 'come' is on rank #5, #7, #8 
and #12+ in four different archives. 'one' is also a popular gloss. 
'say' is popular in ELAR and TLA, but not in PARADISEC. 'then' is found 
in AILLA, PARADISEC and TLA, but not so much in ELAR.

The dominance of 'go' is plausible, but I was wondering what could 
explain the relatively reduced frequency of 'say' and 'then' in certain 
archives.

For the time being, all analyses are based on tokens, and there is no 
control for area or language. I have plans for refining the analyses in 
the future, but for right now, I would need some input from this list 
regarding possible explanations or hypotheses to investigate.

Best wishes
Sebastian