[Lingtyp] Requests for comment: Cross-Linguistic Data Formats (CLDF)

Harald Hammarström harald at bombo.se
Fri Mar 23 13:19:08 UTC 2018


RFC: Cross-Linguistic Data Formats (CLDF), version 1.0
=====================================================

Resulting from discussions over several years, and triggered in
particular by work presented in the two workshops of the "Language
Comparison with Linguistic Databases" series [1,2], we'd like to
request your comments on version 1.0 of CLDF - a specification for
Cross-Linguistic Data Formats (see http://cldf.clld.org).

The specification proposes a standard format for
- wordlists, including cognate judgments and phonetic alignents,
- grammatical structure datasets like WALS features and other typological
surveys.

CLDF is built upon W3C's "Tabular Data and Metadata on the Web"
recommendation [3] and can be thought of as a domain specific adaption
of this in linguistics.

Extensibility is built into CLDF, to allow support of evolving
standards for more complex types of linguistic data. As of version
1.0, modules for simple dictionary data and parallel-text corpora are
included for further experimentation.

CLDF datasets can be read and written using the Python programming
library pycldf (https://pypi.python.org/pypi/pycldf), but also using
off the shelf tools like spreadsheet software or programming
environments like R, because the data file format in CLDF is based on
comma-separated values (CSV).

The CLDF specification is available at
https://github.com/cldf/cldf/blob/master/README.md

Examples of CLDF datasets and how to access CLDF data are provided at
- https://github.com/cldf/cldf/tree/master/examples and
- https://github.com/cldf/cookbook

We welcome all comments, either posted as reply to this announcement or as
issues at https://github.com/cldf/cldf/issues


[1]
http://www.mpi.nl/events/language-comparison-with-linguistic-databases-reflex-and-typological-databases
[2]
http://www.eva.mpg.de/linguistics/conferences/2014-ws-lanclid2/index.html
[3] https://www.w3.org/TR/tabular-data-model/


2018-03-23 12:10 GMT+01:00 Robert Forkel <forkel at shh.mpg.de>:

> Just joined the list, so cannot respond properly to the thread this
> belongs to.
>
> The CLDF specification we've been working on over the last year
> (see http://cldf.clld.org) proposes a standard for the exchange of
> typological datasets (among other types of data), with the explicit
> goal of decoupling software tools (for analysis or visualization) from
> datasets. I see this as a superset of (at least the more technical
> aspects of) reproducibility, because it will allow to investigate datasets
> with a broader range of tools.
>
> For the case in point, CLDF provides a StructureDataset module [1], which
> may contain a CodeTable [2], which I'd see as the machine-readable version
> of the code-book. As an example, here's what a WALS feature would look like
> as a CLDF StructureDataset (the whole WALS database is available as CLDF
> dataset [3]). After unzipping the WALS data, you'll see a couple of CSV
> files
> (which can be created with any spreadsheet software). We can look at two of
> these (e.g. using off the shelf software like csvkit [4]):
>
> values.csv
>
> $ csvgrep -c Parameter_ID -r "^20A$" values.csv | csvformat -T | head -n 5
> ID    Language_ID    Parameter_ID    Value    Code_ID    Comment Source
> Contribution_ID
> 20A-cho    cho    20A    Exclusively concatenative    20A-1
> Turner-and-Turner-1971    20
> 20A-jel    jel    20A    Exclusively isolating    20A-2 Trobs-1998    20
> 20A-nah    nah    20A    Exclusively concatenative    20A-1 Kuiper-1962
> 20
> 20A-wrm    wrm    20A    Exclusively concatenative    20A-1
> Donohue-1999b    20
> ...
>
> codes.csv
>
> $ csvgrep -c Parameter_ID -r "^20A$" codes.csv | csvformat -T
> ID    Parameter_ID    Name    Description    Number
> 20A-1    20A    Exclusively concatenative    Exclusively concatenative    1
> 20A-2    20A    Exclusively isolating    Exclusively isolating    2
> 20A-3    20A    Exclusively tonal    Exclusively tonal    3
> 20A-4    20A    Tonal/isolating    Tonal/isolating    4
> 20A-5    20A    Tonal/concatenative    Tonal/concatenative    5
> 20A-6    20A    Ablaut/concatenative    Ablaut/concatenative    6
> 20A-7    20A    Isolating/concatenative Isolating/concatenative    7
>
> Now the latter would be a rather minimal code-book. But ideally, the
> dataset would
> link back to the paper to remain useful even if separated from the paper.
> This can be done transparently in CLDF adding a Source [5] column to the
> CodeTable. In this case, this would look as follows:
>
> ID    Parameter_ID    Name    Description    Number    Source
> 20A-1    20A    Exclusively concatenative    Exclusively concatenative
> 1 wals-20[http://wals.info/chapter/20#2._Sampling_procedure_and_feature_v
> alues]
>
> where the identifier "wals-20" refers to an entry in the dataset's sources
> file [6]:
>
> @incollection{wals-20,
>   address   = {Leipzig},
>   author    = {Balthasar Bickel and Johanna Nichols},
>   booktitle = {The World Atlas of Language Structures Online},
>   editor    = {Matthew S. Dryer and Martin Haspelmath},
>   publisher = {Max Planck Institute for Evolutionary Anthropology},
>   title     = {Fusion of Selected Inflectional Formatives},
>   url       = {http://wals.info/chapter/20},
>   year      = {2013}
> }
>
> While this only addresses the technical issues involved in replicability
> and
> reproducibility, I still think it could go a long way towards establishing
> better integration of datasets into the traditional publication workflow;
> this
> is mainly because it would allow a set of tools to evolve, which could help
> editors and reviewers to evaluate not only the paper, but also the quality
> of
> the data (to some extent).
>
>
> [1] https://github.com/cldf/cldf/tree/master/modules/StructureDataset
> [2] https://github.com/cldf/cldf/tree/master/components/codes
> [3] https://cdstar.shh.mpg.de/bitstreams/EAEA0-7269-77E5-3E10-0/
> wals_dataset.cldf.zip
> [4] https://csvkit.readthedocs.io/en/1.0.3/
> [5] https://github.com/cldf/cldf/blob/master/README.md#sources
> [6] https://github.com/cldf/cldf/blob/master/README.md#sources-r
> eference-file
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> http://listserv.linguistlist.org/mailman/listinfo/lingtyp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20180323/1a38bf89/attachment.htm>


More information about the Lingtyp mailing list