[Lingtyp] FYI: INEL Selkup and Kamas corpora

Alexandre Arkhipov sarkipo at yandex.ru
Fri Jan 25 17:31:15 UTC 2019

Dear colleagues,

The first versions of two digital corpora developed as part of the INEL 
project (https://inel.corpora.uni-hamburg.de/?page_id=920), Selkup and 
Kamas, are published online.

Texts are provided with interlinear glossing (with lexical glosses in 
English and Russian), translations into English, Russian and German. 
Some texts also have (partial) annotations for syntactic functions, 
semantic roles and information status, lexical borrowings and 

The corpora are published in open access under Creative Commons 
Attribution-NonCommercial-ShareAlike 4.0 International Public License 
(CC BY-NC-SA 4.0). See below for details on using the corpora.

The corpora are primarily intended for typologically aware corpus-based 
grammatical research but may also be of interest to linguists of other 
branches as well as to specialists in folklore, anthropology and history.

1. INEL Selkup Corpus (v0.1)

Selkup is an endangered Samoyedic language (Uralic family), which used 
to be spoken in many small settlements dispersed over a large territory 
in Western Siberia.
The INEL Selkup corpus is composed of texts from the archive of Angelina 
Ivanovna Kuzmina (1924–2002), who gathered a large amount of material on 
Selkup in almost all regions where the Selkup people lived in 1962–1977. 
Most texts in the corpus originate from the handwritten part of the 
archive that she transferred to Hamburg in 2001, the others come from 
her sound recordings digitized in 2001, which have been transcribed and 
translated within the INEL project.
The present version of the corpus comprises 78 texts (18 673 words), 
mostly representing Northern varieties of Selkup.

2. INEL Kamas Corpus (v0.1)

Kamas belongs to the Samoyedic branch of the Uralic language family. The 
language became extinct by the late XXth century, with the death of its 
last known speaker, Klavdiya Plotnikova (1895–1989). All the surviving 
Kamas texts document Forest Kamas varieties spoken in the settlement of 
Abalakovo, in the present Krasnoyarsk Krai in Southern Siberia.
The INEL Kamas corpus is the first publicly available digital resource 
with annotated Kamas texts. The INEL Kamas corpus consists of two parts: 
folklore texts collected by Kai Donner in 1912–1914, and transcribed 
audio recordings of Klavdiya Plotnikova made between 1964 and 1970 in 
Abalakovo, Tartu and Tallinn. Most of these recordings were transcribed 
within the INEL project (including re-transcribing some tapes fragments 
of which were published by Ago Künnap in 1976–1992).
The present version of the corpus comprises 137 texts (48 293 words); 
this includes 16 texts collected by Kai Donner and 121 text from the 
recordings of Klavdiya Plotnikova (ca. 10,5 hours).

Working with the corpora

The data in the corpora (annotated texts as well as corresponding 
metadata) are represented in XML formats of the freely distributed 
EXMARaLDA suite (http://exmaralda.org/en/).

User guides (in English) are available here:

For browsing (and playback) of individual texts, use «Sessions» tab on 
the main corpus page. Each text can be viewed in one of three online 
formats (e.g. Visualizations: Score) and downloaded in EXB (an EXMARaLDA 
format). The sources of texts, i.e. scanned pages (PDF) or sound files 
(WAV, MP3) can also be viewed/downloaded.

For searching across the whole corpus, the complete archive of the 
corpus files can be downloaded and searched with the EXAKT program of 
the EXMARaLDA suite.
Furthermore, in the next few weeks, an online search interface will be 
open for both corpora, based on the Tsakonian Corpus Platform 
(Tsakorpus, https://bitbucket.org/tsakorpus/). A test search across a 
fragment of the Selkup corpus is currently available at 

Please send your comments and suggestions to: inel at uni-hamburg.de.

Best regards,
Alexandre Arkhipov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20190125/34f1e73a/attachment.htm>

More information about the Lingtyp mailing list