[Lingtyp] Empirical standards in typology: incentives

Fri Mar 23 11:10:50 UTC 2018

Just joined the list, so cannot respond properly to the thread this 
belongs to.

The CLDF specification we've been working on over the last year
(see http://cldf.clld.org) proposes a standard for the exchange of
typological datasets (among other types of data), with the explicit
goal of decoupling software tools (for analysis or visualization) from
datasets. I see this as a superset of (at least the more technical
aspects of) reproducibility, because it will allow to investigate datasets
with a broader range of tools.

For the case in point, CLDF provides a StructureDataset module [1], which
may contain a CodeTable [2], which I'd see as the machine-readable version
of the code-book. As an example, here's what a WALS feature would look like
as a CLDF StructureDataset (the whole WALS database is available as CLDF
dataset [3]). After unzipping the WALS data, you'll see a couple of CSV 
files
(which can be created with any spreadsheet software). We can look at two of
these (e.g. using off the shelf software like csvkit [4]):

values.csv

$ csvgrep -c Parameter_ID -r "^20A$" values.csv | csvformat -T | head -n 5
ID    Language_ID    Parameter_ID    Value    Code_ID    Comment 
Source    Contribution_ID
20A-cho    cho    20A    Exclusively concatenative    20A-1 
Turner-and-Turner-1971    20
20A-jel    jel    20A    Exclusively isolating    20A-2 Trobs-1998    20
20A-nah    nah    20A    Exclusively concatenative    20A-1 
Kuiper-1962    20
20A-wrm    wrm    20A    Exclusively concatenative    20A-1 
Donohue-1999b    20
...

codes.csv

$ csvgrep -c Parameter_ID -r "^20A$" codes.csv | csvformat -T
ID    Parameter_ID    Name    Description    Number
20A-1    20A    Exclusively concatenative    Exclusively concatenative    1
20A-2    20A    Exclusively isolating    Exclusively isolating    2
20A-3    20A    Exclusively tonal    Exclusively tonal    3
20A-4    20A    Tonal/isolating    Tonal/isolating    4
20A-5    20A    Tonal/concatenative    Tonal/concatenative    5
20A-6    20A    Ablaut/concatenative    Ablaut/concatenative    6
20A-7    20A    Isolating/concatenative Isolating/concatenative    7

Now the latter would be a rather minimal code-book. But ideally, the 
dataset would
link back to the paper to remain useful even if separated from the paper.
This can be done transparently in CLDF adding a Source [5] column to the
CodeTable. In this case, this would look as follows:

ID    Parameter_ID    Name    Description    Number    Source
20A-1    20A    Exclusively concatenative    Exclusively 
concatenative    1 
wals-20[http://wals.info/chapter/20#2._Sampling_procedure_and_feature_v
alues]

where the identifier "wals-20" refers to an entry in the dataset's 
sources file [6]:

@incollection{wals-20,
   address   = {Leipzig},
   author    = {Balthasar Bickel and Johanna Nichols},
   booktitle = {The World Atlas of Language Structures Online},
   editor    = {Matthew S. Dryer and Martin Haspelmath},
   publisher = {Max Planck Institute for Evolutionary Anthropology},
   title     = {Fusion of Selected Inflectional Formatives},
   url       = {http://wals.info/chapter/20},
   year      = {2013}
}

While this only addresses the technical issues involved in replicability and
reproducibility, I still think it could go a long way towards establishing
better integration of datasets into the traditional publication 
workflow; this
is mainly because it would allow a set of tools to evolve, which could help
editors and reviewers to evaluate not only the paper, but also the 
quality of
the data (to some extent).

[1] https://github.com/cldf/cldf/tree/master/modules/StructureDataset
[2] https://github.com/cldf/cldf/tree/master/components/codes
[3] 
https://cdstar.shh.mpg.de/bitstreams/EAEA0-7269-77E5-3E10-0/wals_dataset.cldf.zip
[4] https://csvkit.readthedocs.io/en/1.0.3/
[5] https://github.com/cldf/cldf/blob/master/README.md#sources
[6] 
https://github.com/cldf/cldf/blob/master/README.md#sources-reference-file