[Lingtyp] Examplology: imtvault.org
sebastian.nordhoff at glottotopia.de
Thu Mar 24 07:14:38 UTC 2022
Dear list members,
there has been some discussion about "hit", "kill", "John", "Mary", and
other usual suspects. Over the past months, we have worked on a corpus
of all examples found in Language Science Press books. This corpus is
now available in a beta version at imtvault.org. It contains 40648
interlinear examples from 124 different languages and can be filtered
along various criteria. For instance, we can search for John, Mary, or
https://imtvault.org/?q=John: 266 hits
https://imtvault.org/?q=Mary: 223 hits
https://imtvault.org/?q=Peter: 232 hits
We can look into the popularity of certain verbs:
https://imtvault.org/?q=hit: 399 hits
https://imtvault.org/?q=kill: 440 hits
https://imtvault.org/?q=love: 181 hits
https://imtvault.org/?q=kiss: 26 hits
https://imtvault.org/?q=carry: 235 hits
We have also retrieved semantic categories, so you get
which gives you examples about tobacco, rice, barley, wheat and so on.
Other categories which might be interesting:
https://imtvault.org/?parententities=Weapon: 89 hits
https://imtvault.org/?parententities=Hazard: 205 hits
You can also filter for grammatical categories. In the examples in the
corpus, 2808 have a plural morpheme in them, while 2116 have a singular
morpheme. Accusative (1937) is more popular than genitive (1601), dative
(1309) or nominative (1232).
The content of the corpus is obviously skewed by the following criteria:
1) The coverage of the input books. Australia for instance is severely
2) The length of the input books. "A grammar of Japhug" is 1600 pages,
so you are likely to get a lot of Japhug grammatical categories.
3) The source code of the books. We extract the examples from the tex
files used to generate the pdf, and assume certain conventions. If a
book author does not follow these conventions, we are not able to
retrieve the examples.
All this means that the corpus, despite its size, is still
opportunistic. But it can maybe trigger some interesting ideas, which
can be pursued further by a more systematic approach. We are also
working on making the data available for machine queries so that you can
import the corpus into R or similar and run your own statistics.
There are still some rough edges here and there, but we will be working
on ironing them out. If you have any suggestions or feature requests,
feel free to contact me.
Sebastian (also on behalf of Thomas Krämer)
PS: If you are wondering about the high frequency of Greek philosophers,
they are all from our translation of Wackernagel's "On a law of
Indo-European word order"
More information about the Lingtyp