[Lingtyp] INEL Dolgan corpus released
Alexandre Arkhipov
sarkipo at yandex.ru
Fri Sep 13 12:14:39 UTC 2019
We are glad to announce that the first version of the Dolgan corpus
developed in the INEL project <https://inel.corpora.uni-hamburg.de/>
(https://inel.corpora.uni-hamburg.de/) is now published online.
http://hdl.handle.net/11022/0000-0007-CAE7-1
Dolgan is an endangered Turkic language of Northern Siberia. It is
spoken by approximately 1,000 people on the Taymyr peninsula and in
adjacent areas. Dolgan is closely related to Yakut (Sakha), but differs
nevertheless in many aspects. Dolgan is in close contact with the
neighboring languages Nganasan, Enets and Evenki as well as with Russian.
The corpus at hand contains both folklore and narrative texts as well as
spontaneous conversations. All material is interlinearily glossed;
partly annotations of Semantic Roles, Syntactic Functions, Information
Status and Structure as well as Borrowing and Code-Switching are
provided. Roughly half of the material is aligned to the respective
sound file which makes up ca. 10 hours of Dolgan speech in total.
The INEL Dolgan corpus is composed of texts from different sources:
1. Published folklore texts from an edited volume ("Fol'klor Dolgan",
P.E. Efremov 2000),
2. Transcripts of recordings provided by the Taymyr House of Folk Art
(TDNT) in Dudinka (1970s-2000s),
3. Transcripts from the collection of Dr. Eugénie Stapert recorded on
several fieldwork trips in 2007-2010,
4. Transcripts of recordings made on a fieldwork trip in 2017.
*Accessing the corpus
*
An online search interface, similar to the one for Selkup
<https://inel.corpora.uni-hamburg.de/SelkupCorpus/search> and Kamas
<https://inel.corpora.uni-hamburg.de/KamasCorpus/search> corpora, will
be made available in the near future.
The data in the corpora (annotated texts as well as corresponding
metadata) are represented in XML formats of the freely distributed
EXMARaLDA suite (http://exmaralda.org/en/).
User documentation (in English) is available here:
INEL_Dolgan_Corpus.pdf
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/file:dolgan-1.0_INEL_Dolgan_Corpus_1.0_User_Documentation/datastream/PDF/INEL_Dolgan_Corpus.pdf>
For browsing (and playback) of individual texts, use «Sessions
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/spoken-corpus:dolgan-1.0#corpus-content>»
tab on the main corpus page. Each text can be viewed in one of three
online formats (e.g. Visualizations: Score
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/transcript:dolgan-1.0_AnIM_2009_Argish_nar/datastream/SCORE/AnIM_2009_Argish_nar-score.html>)
and downloaded in EXB (an EXMARaLDA format). The sources of texts, i.e.
scanned pages (PDF) or sound files (WAV, MP3) can also be viewed/downloaded.
For searching across the whole corpus, the complete archive of the
corpus files
<https://corpora.uni-hamburg.de/hzsk/de/islandora/object/spoken-corpus:dolgan-1.0#additional-files>
can be downloaded and searched with the EXAKT program of the EXMARaLDA
suite.
Furthermore, in the next few weeks, an online search interface will be
launched, based on the Tsakonian Corpus Platform (Tsakorpus
<https://bitbucket.org/tsakorpus/>).
Please send your comments and suggestions to: inel at uni-hamburg.de
<mailto:inel at uni-hamburg.de>.
Best regards,
Alexandre Arkhipov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20190913/49edcf68/attachment.htm>
More information about the Lingtyp
mailing list