[Lingtyp] Frequencies in language archives
sebastian.nordhoff at glottotopia.de
Tue Apr 7 09:22:28 UTC 2020
Dear list members,
I have done some analyses of endangered language archives (SOAS/ELAR,
Nijmegen, Paradisec, AILLA). I retrieved 20k ELAN files with
transcriptions and annotations and ran some statistics on the
transcription tiers and the translation tiers.
Altogether, transcriptions have 2.5 million words; translations have
Here are a number of findings I would like to share:
== Graphemes ==
- the most frequent grapheme in transcriptions is <a>
- the next most frequent graphemes are <e>, <i>, <n>, but the order is
different between archives.
The fact that <a> is the most frequent grapheme is certainly plausible.
But I am interested in explanations for the differences between <e>,
<i>, <n>. Would we have expected these three, and which order would we
== Categories ==
- the most frequent glossed categories are SG and PL. ELAR and Paradisec
have more SG than PL, but TLA/Dobes has more PL than SG.
What do members of this list think are the most plausible explanations
for the difference here? Does this meet your intuitions about the
distribution of number categories?
The most popular non-number categories are DEM and PST by the way.
== Lexical glosses ==
'go' is the most frequent lexical gloss. 'come' is on rank #5, #7, #8
and #12+ in four different archives. 'one' is also a popular gloss.
'say' is popular in ELAR and TLA, but not in PARADISEC. 'then' is found
in AILLA, PARADISEC and TLA, but not so much in ELAR.
The dominance of 'go' is plausible, but I was wondering what could
explain the relatively reduced frequency of 'say' and 'then' in certain
For the time being, all analyses are based on tokens, and there is no
control for area or language. I have plans for refining the analyses in
the future, but for right now, I would need some input from this list
regarding possible explanations or hypotheses to investigate.
More information about the Lingtyp