[Ura-list] FYI: INEL Selkup and Kamas corpora
alexandre.arkhipov at uni-hamburg.de
Fri Jan 25 12:33:44 EST 2019
The first versions of two digital corpora developed as part of the INEL
project (https://inel.corpora.uni-hamburg.de/), Selkup and Kamas, are
Texts are provided with interlinear glossing (with lexical glosses in
English and Russian), translations into English, Russian and German.
Some texts also have (partial) annotations for syntactic functions,
semantic roles and information status, lexical borrowings and
The corpora are published in open access under Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International Public License
(CC BY-NC-SA 4.0). See below for details on using the corpora.
The corpora are primarily intended for typologically aware corpus-based
grammatical research but may also be of interest to linguists of other
branches as well as to specialists in folklore, anthropology and history.
1. INEL Selkup Corpus (v0.1)
Selkup is an endangered Samoyedic language (Uralic family), which used
to be spoken in many small settlements dispersed over a large territory
in Western Siberia.
The INEL Selkup corpus is composed of texts from the archive of Angelina
Ivanovna Kuzmina (1924–2002), who gathered a large amount of material on
Selkup in almost all regions where the Selkup people lived in 1962–1977.
Most texts in the corpus originate from the handwritten part of the
archive that she transferred to Hamburg in 2001, the others come from
her sound recordings digitized in 2001, which have been transcribed and
translated within the INEL project.
The present version of the corpus comprises 78 texts (18 673 words),
mostly representing Northern varieties of Selkup.
2. INEL Kamas Corpus (v0.1)
Kamas belongs to the Samoyedic branch of the Uralic language family. The
language became extinct by the late XXth century, with the death of its
last known speaker, Klavdiya Plotnikova (1895–1989). All the surviving
Kamas texts document Forest Kamas varieties spoken in the settlement of
Abalakovo, in the present Krasnoyarsk Krai in Southern Siberia.
The INEL Kamas corpus is the first publicly available digital resource
with annotated Kamas texts. The INEL Kamas corpus consists of two parts:
folklore texts collected by Kai Donner in 1912–1914, and transcribed
audio recordings of Klavdiya Plotnikova made between 1964 and 1970 in
Abalakovo, Tartu and Tallinn. Most of these recordings were transcribed
within the INEL project (including re-transcribing some tapes fragments
of which were published by Ago Künnap in 1976–1992).
The present version of the corpus comprises 137 texts (48 293 words);
this includes 16 texts collected by Kai Donner and 121 text from the
recordings of Klavdiya Plotnikova (ca. 10,5 hours).
Working with the corpora
The data in the corpora (annotated texts as well as corresponding
metadata) are represented in XML formats of the freely distributed
EXMARaLDA suite (http://exmaralda.org/en/).
User guides (in English) are available here:
For browsing (and playback) of individual texts, use «Sessions» tab on
the main corpus page. Each text can be viewed in one of three online
formats (e.g. Visualizations: Score) and downloaded in EXB (an EXMARaLDA
format). The sources of texts, i.e. scanned pages (PDF) or sound files
(WAV, MP3) can also be viewed/downloaded.
For searching across the whole corpus, the complete archive of the
corpus files can be downloaded and searched with the EXAKT program of
the EXMARaLDA suite.
Furthermore, in the next few weeks, an online search interface will be
open for both corpora, based on the Tsakonian Corpus Platform
(Tsakorpus, https://bitbucket.org/tsakorpus/). A test search across a
fragment of the Selkup corpus is currently available at
Please send your comments and suggestions to: inel at uni-hamburg.de.
Dr. Alexandre Arkhipov
Institut für Finnougristik/Uralistik - Akademieprojekt INEL
+49 40 42838 6890
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Ura-list