[Lingtyp] Empirical standards in typology: incentives
Robert Forkel
forkel at shh.mpg.de
Fri Mar 23 11:10:50 UTC 2018
Just joined the list, so cannot respond properly to the thread this
belongs to.
The CLDF specification we've been working on over the last year
(see http://cldf.clld.org) proposes a standard for the exchange of
typological datasets (among other types of data), with the explicit
goal of decoupling software tools (for analysis or visualization) from
datasets. I see this as a superset of (at least the more technical
aspects of) reproducibility, because it will allow to investigate datasets
with a broader range of tools.
For the case in point, CLDF provides a StructureDataset module [1], which
may contain a CodeTable [2], which I'd see as the machine-readable version
of the code-book. As an example, here's what a WALS feature would look like
as a CLDF StructureDataset (the whole WALS database is available as CLDF
dataset [3]). After unzipping the WALS data, you'll see a couple of CSV
files
(which can be created with any spreadsheet software). We can look at two of
these (e.g. using off the shelf software like csvkit [4]):
values.csv
$ csvgrep -c Parameter_ID -r "^20A$" values.csv | csvformat -T | head -n 5
ID Language_ID Parameter_ID Value Code_ID Comment
Source Contribution_ID
20A-cho cho 20A Exclusively concatenative 20A-1
Turner-and-Turner-1971 20
20A-jel jel 20A Exclusively isolating 20A-2 Trobs-1998 20
20A-nah nah 20A Exclusively concatenative 20A-1
Kuiper-1962 20
20A-wrm wrm 20A Exclusively concatenative 20A-1
Donohue-1999b 20
...
codes.csv
$ csvgrep -c Parameter_ID -r "^20A$" codes.csv | csvformat -T
ID Parameter_ID Name Description Number
20A-1 20A Exclusively concatenative Exclusively concatenative 1
20A-2 20A Exclusively isolating Exclusively isolating 2
20A-3 20A Exclusively tonal Exclusively tonal 3
20A-4 20A Tonal/isolating Tonal/isolating 4
20A-5 20A Tonal/concatenative Tonal/concatenative 5
20A-6 20A Ablaut/concatenative Ablaut/concatenative 6
20A-7 20A Isolating/concatenative Isolating/concatenative 7
Now the latter would be a rather minimal code-book. But ideally, the
dataset would
link back to the paper to remain useful even if separated from the paper.
This can be done transparently in CLDF adding a Source [5] column to the
CodeTable. In this case, this would look as follows:
ID Parameter_ID Name Description Number Source
20A-1 20A Exclusively concatenative Exclusively
concatenative 1
wals-20[http://wals.info/chapter/20#2._Sampling_procedure_and_feature_v
alues]
where the identifier "wals-20" refers to an entry in the dataset's
sources file [6]:
@incollection{wals-20,
address = {Leipzig},
author = {Balthasar Bickel and Johanna Nichols},
booktitle = {The World Atlas of Language Structures Online},
editor = {Matthew S. Dryer and Martin Haspelmath},
publisher = {Max Planck Institute for Evolutionary Anthropology},
title = {Fusion of Selected Inflectional Formatives},
url = {http://wals.info/chapter/20},
year = {2013}
}
While this only addresses the technical issues involved in replicability and
reproducibility, I still think it could go a long way towards establishing
better integration of datasets into the traditional publication
workflow; this
is mainly because it would allow a set of tools to evolve, which could help
editors and reviewers to evaluate not only the paper, but also the
quality of
the data (to some extent).
[1] https://github.com/cldf/cldf/tree/master/modules/StructureDataset
[2] https://github.com/cldf/cldf/tree/master/components/codes
[3]
https://cdstar.shh.mpg.de/bitstreams/EAEA0-7269-77E5-3E10-0/wals_dataset.cldf.zip
[4] https://csvkit.readthedocs.io/en/1.0.3/
[5] https://github.com/cldf/cldf/blob/master/README.md#sources
[6]
https://github.com/cldf/cldf/blob/master/README.md#sources-reference-file
More information about the Lingtyp
mailing list