[Algonquiana] Algonquian datasets for DoReCo ?

MONICA MACAULAY mmacaula at wisc.edu
Sun Mar 31 20:12:09 UTC 2019


Hello Algonquiana-ites. I'm forwarding this for a colleague. Please respond directly to him, not to me.


________________________________
From: Ludger Paschen <paschen at leibniz-zas.de>
Sent: Sunday, March 31, 2019 3:08 PM
Subject: AW: Algonquin datasets for DoReCo ?


The DoReCo project is looking to connect with field workers and linguists who have created speech corpora of Algonquin languages, or could point us to relevant datasets that have already been made available for scientific use.

DoReCo (Language Documentation Reference Corpus) aims at creating a cross-linguistic dataset consisting of speech corpora from at least 50 typologically diverse languages that will be used for research on phonetics and information rate, and is intended to become a powerful resource for further research. A short description of the project can be found at http://www.zas.gwz-berlin.de/index.php?id=3192&L=1 .

Below is a summary of what we are looking for and what we offer in return. Please do not hesitate to get in touch if you have any questions.

Many thanks,
Frank and Ludger
(frank.seifart at berlin.de, paschen at leibniz-zas.de)

#######

We are looking for datasets that meet the following criteria:

1) a minimum of 10,000 transcribed words (typically distributed over various recording sessions/annotation files)

2) translation into a major language

3) primarily monological texts (e.g., personal or traditional narratives)

4) time-alignment of transcription and translation with audio files at the level of sentences, paragraphs, utterance, or intonation units (i.e., "annotation units" in ELAN, time stamps in Toolbox records)

5) audio is of reasonable quality (not too much overlapping speech orbackground noise)

6) transcription/translation/annotation files (not audio/video files) can be made accessible within three years on the DoReCo platform under a Creative Commons Attribution 4.0 (CC BY 4.0) license, with strict rules for fair scientific use (see below)

Please indicate if your data also include at least 10,000 words that are additionally morphologically annotated (using Toolbox/Shoebox/ELAN/...) with (i) morpheme segmentation, (ii) morpheme glosses, and (optionally) (iii) part-of-speech tags.

We additionally ask you for the following in the course of the project:

1) A chart specifying correspondences between the orthographic characters used in the transcription and IPA symbols

2) Answering our questions regarding e.g. inconsistencies between the audio and the transcription (e.g. transcribing/glossing elements that are not transcribed)

3) Providing basic metadata per recording session if not already available (e.g. anonymized speaker codes, speaker sex and approximate age)

We offer in return:

1) refined annotation of your data regarding:
     (i) time-alignment at the phoneme level (i.e. start- and endtimes for each phoneme, and words), by using the MAUS aligner of our project partners in Munich (https://www.bas.uni-muenchen.de/Bas/BasMAUS.html) with subsequent manual correction by DoReCo assistants
     (ii) refined annotation regarding consistency of audio-transcription correspondence by DoReCo assistants including annotation of missing elements (e.g. filled pauses, false starts, repetitions)
     (iii) if necessary, consistent labeling of tiers for annotation, transcription, etc.

3) if necessary, (limited) funding to reimburse your (or your assistant's) work for resolving inconsistencies between the audio and the transcription that cannot be resolved by our assistants (in case these are extensive)

4) if necessary, (limited) funding to attend DoReCo workshops in Berlin or Lyon or adjacent to conferences

5) The presentation of your data on the DoReCo platform, together with
that of ca. 50 other languages, on the model of other existing resources
in the CLLD format (e.g., http://wals.info/languoid,
http://glottolog.org/glottolog/language). Columns in this table will
include (i) language name, (ii) Glottocode, (iii) corpus creator names,
(iv) the number of transcribed words, (v) +/- morphologically annotated,
and (vi) a link to the node in the repository (e.g., TLA) from where the
data are taken and where the audio files are archived. In this table,
files can be selected for download in various formats (TEI, EAF, CSV,
NXT). Before data can actually be downloaded, the website will
automatically create a ready-to use citation of the dataset selected for
download. This citation can be copied from the website and will also be
included in an automatically generated read-me file that will be
automatically included in the zip-file containing the downloaded corpus.
The citation will be given in various common formats (Unified style
sheet for linguistics journals, BibTex, etc.). The author names and
their order will be determined by the amount of data downloaded for the
respective languages.





More information about the Algonquiana mailing list