[Lingtyp] INEL Dolgan corpus released

Alexandre Arkhipov sarkipo at yandex.ru
Fri Sep 13 12:14:39 UTC 2019


We are glad to announce that the first version of the Dolgan corpus 
developed in the INEL project <https://inel.corpora.uni-hamburg.de/> 
(https://inel.corpora.uni-hamburg.de/) is now published online.

http://hdl.handle.net/11022/0000-0007-CAE7-1

Dolgan is an endangered Turkic language of Northern Siberia. It is 
spoken by approximately 1,000 people on the Taymyr peninsula and in 
adjacent areas. Dolgan is closely related to Yakut (Sakha), but differs 
nevertheless in many aspects. Dolgan is in close contact with the 
neighboring languages Nganasan, Enets and Evenki as well as with Russian.

The corpus at hand contains both folklore and narrative texts as well as 
spontaneous conversations. All material is interlinearily glossed; 
partly annotations of Semantic Roles, Syntactic Functions, Information 
Status and Structure as well as Borrowing and Code-Switching are 
provided. Roughly half of the material is aligned to the respective 
sound file which makes up ca. 10 hours of Dolgan speech in total.

The INEL Dolgan corpus is composed of texts from different sources:
1. Published folklore texts from an edited volume ("Fol'klor Dolgan", 
P.E. Efremov 2000),
2. Transcripts of recordings provided by the Taymyr House of Folk Art 
(TDNT) in Dudinka (1970s-2000s),
3. Transcripts from the collection of Dr. Eugénie Stapert recorded on 
several fieldwork trips in 2007-2010,
4. Transcripts of recordings made on a fieldwork trip in 2017.

*Accessing the corpus
*

An online search interface, similar to the one for Selkup 
<https://inel.corpora.uni-hamburg.de/SelkupCorpus/search> and Kamas 
<https://inel.corpora.uni-hamburg.de/KamasCorpus/search> corpora, will 
be made available in the near future.

The data in the corpora (annotated texts as well as corresponding 
metadata) are represented in XML formats of the freely distributed 
EXMARaLDA suite (http://exmaralda.org/en/).

User documentation (in English) is available here: 
INEL_Dolgan_Corpus.pdf 
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/file:dolgan-1.0_INEL_Dolgan_Corpus_1.0_User_Documentation/datastream/PDF/INEL_Dolgan_Corpus.pdf>

For browsing (and playback) of individual texts, use «Sessions 
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/spoken-corpus:dolgan-1.0#corpus-content>» 
tab on the main corpus page. Each text can be viewed in one of three 
online formats (e.g. Visualizations: Score 
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/transcript:dolgan-1.0_AnIM_2009_Argish_nar/datastream/SCORE/AnIM_2009_Argish_nar-score.html>) 
and downloaded in EXB (an EXMARaLDA format). The sources of texts, i.e. 
scanned pages (PDF) or sound files (WAV, MP3) can also be viewed/downloaded.

For searching across the whole corpus, the complete archive of the 
corpus files 
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/spoken-corpus:dolgan-1.0#additional-files> 
can be downloaded and searched with the EXAKT program of the EXMARaLDA 
suite.
Furthermore, in the next few weeks, an online search interface will be 
launched, based on the Tsakonian Corpus Platform (Tsakorpus 
<https://bitbucket.org/tsakorpus/>).

Please send your comments and suggestions to: inel at uni-hamburg.de 
<mailto:inel at uni-hamburg.de>.


Best regards,
Alexandre Arkhipov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20190913/49edcf68/attachment.htm>


More information about the Lingtyp mailing list