[Lingtyp] Examplology: imtvault.org

Thu Mar 24 07:14:38 UTC 2022

Dear list members,
there has been some discussion about "hit", "kill", "John", "Mary", and 
other usual suspects. Over the past months, we have worked on a corpus 
of all examples found in Language Science Press books. This corpus is 
now available in a beta version at imtvault.org. It contains 40648 
interlinear examples from 124 different languages and can be filtered 
along various criteria. For instance, we can search for John, Mary, or 
Peter.

https://imtvault.org/?q=John: 266 hits
https://imtvault.org/?q=Mary: 223 hits
https://imtvault.org/?q=Peter: 232 hits

We can look into the popularity of certain verbs:

https://imtvault.org/?q=hit: 399 hits
https://imtvault.org/?q=kill: 440 hits
https://imtvault.org/?q=love: 181 hits
https://imtvault.org/?q=kiss: 26 hits
https://imtvault.org/?q=carry: 235 hits

We have also retrieved semantic categories, so you get
https://imtvault.org/?parententities[0]=Crop
which gives you examples about tobacco, rice, barley, wheat and so on.

Other categories which might be interesting:
https://imtvault.org/?parententities[0]=Weapon: 89 hits
https://imtvault.org/?parententities[0]=Hazard: 205 hits

You can also filter for grammatical categories. In the examples in the 
corpus, 2808 have a plural morpheme in them, while 2116 have a singular 
morpheme. Accusative (1937) is more popular than genitive (1601), dative 
(1309) or nominative (1232).

The content of the corpus is obviously skewed by the following criteria:
1) The coverage of the input books. Australia for instance is severely 
underrepresented.
2) The length of the input books. "A grammar of Japhug" is 1600 pages, 
so you are likely to get a lot of Japhug grammatical categories.
3) The source code of the books. We extract the examples from the tex 
files used to generate the pdf, and assume certain conventions. If a 
book author does not follow these conventions, we are not able to 
retrieve the examples.

All this means that the corpus, despite its size, is still 
opportunistic. But it can maybe trigger some interesting ideas, which 
can be pursued further by a more systematic approach. We are also 
working on making the data available for machine queries so that you can 
import the corpus into R or similar and run your own statistics.

There are still some rough edges here and there, but we will be working 
on ironing them out. If you have any suggestions or feature requests, 
feel free to contact me.

Best wishes
Sebastian (also on behalf of Thomas Krämer)

PS: If you are wondering about the high frequency of Greek philosophers, 
they are all from our translation of Wackernagel's "On a law of 
Indo-European word order"