[Lingtyp] Requests for comment: Cross-Linguistic Data Formats (CLDF)
Harald Hammarström
harald at bombo.se
Fri Mar 23 13:19:08 UTC 2018
RFC: Cross-Linguistic Data Formats (CLDF), version 1.0
=====================================================
Resulting from discussions over several years, and triggered in
particular by work presented in the two workshops of the "Language
Comparison with Linguistic Databases" series [1,2], we'd like to
request your comments on version 1.0 of CLDF - a specification for
Cross-Linguistic Data Formats (see http://cldf.clld.org).
The specification proposes a standard format for
- wordlists, including cognate judgments and phonetic alignents,
- grammatical structure datasets like WALS features and other typological
surveys.
CLDF is built upon W3C's "Tabular Data and Metadata on the Web"
recommendation [3] and can be thought of as a domain specific adaption
of this in linguistics.
Extensibility is built into CLDF, to allow support of evolving
standards for more complex types of linguistic data. As of version
1.0, modules for simple dictionary data and parallel-text corpora are
included for further experimentation.
CLDF datasets can be read and written using the Python programming
library pycldf (https://pypi.python.org/pypi/pycldf), but also using
off the shelf tools like spreadsheet software or programming
environments like R, because the data file format in CLDF is based on
comma-separated values (CSV).
The CLDF specification is available at
https://github.com/cldf/cldf/blob/master/README.md
Examples of CLDF datasets and how to access CLDF data are provided at
- https://github.com/cldf/cldf/tree/master/examples and
- https://github.com/cldf/cookbook
We welcome all comments, either posted as reply to this announcement or as
issues at https://github.com/cldf/cldf/issues
[1]
http://www.mpi.nl/events/language-comparison-with-linguistic-databases-reflex-and-typological-databases
[2]
http://www.eva.mpg.de/linguistics/conferences/2014-ws-lanclid2/index.html
[3] https://www.w3.org/TR/tabular-data-model/
2018-03-23 12:10 GMT+01:00 Robert Forkel <forkel at shh.mpg.de>:
> Just joined the list, so cannot respond properly to the thread this
> belongs to.
>
> The CLDF specification we've been working on over the last year
> (see http://cldf.clld.org) proposes a standard for the exchange of
> typological datasets (among other types of data), with the explicit
> goal of decoupling software tools (for analysis or visualization) from
> datasets. I see this as a superset of (at least the more technical
> aspects of) reproducibility, because it will allow to investigate datasets
> with a broader range of tools.
>
> For the case in point, CLDF provides a StructureDataset module [1], which
> may contain a CodeTable [2], which I'd see as the machine-readable version
> of the code-book. As an example, here's what a WALS feature would look like
> as a CLDF StructureDataset (the whole WALS database is available as CLDF
> dataset [3]). After unzipping the WALS data, you'll see a couple of CSV
> files
> (which can be created with any spreadsheet software). We can look at two of
> these (e.g. using off the shelf software like csvkit [4]):
>
> values.csv
>
> $ csvgrep -c Parameter_ID -r "^20A$" values.csv | csvformat -T | head -n 5
> ID Language_ID Parameter_ID Value Code_ID Comment Source
> Contribution_ID
> 20A-cho cho 20A Exclusively concatenative 20A-1
> Turner-and-Turner-1971 20
> 20A-jel jel 20A Exclusively isolating 20A-2 Trobs-1998 20
> 20A-nah nah 20A Exclusively concatenative 20A-1 Kuiper-1962
> 20
> 20A-wrm wrm 20A Exclusively concatenative 20A-1
> Donohue-1999b 20
> ...
>
> codes.csv
>
> $ csvgrep -c Parameter_ID -r "^20A$" codes.csv | csvformat -T
> ID Parameter_ID Name Description Number
> 20A-1 20A Exclusively concatenative Exclusively concatenative 1
> 20A-2 20A Exclusively isolating Exclusively isolating 2
> 20A-3 20A Exclusively tonal Exclusively tonal 3
> 20A-4 20A Tonal/isolating Tonal/isolating 4
> 20A-5 20A Tonal/concatenative Tonal/concatenative 5
> 20A-6 20A Ablaut/concatenative Ablaut/concatenative 6
> 20A-7 20A Isolating/concatenative Isolating/concatenative 7
>
> Now the latter would be a rather minimal code-book. But ideally, the
> dataset would
> link back to the paper to remain useful even if separated from the paper.
> This can be done transparently in CLDF adding a Source [5] column to the
> CodeTable. In this case, this would look as follows:
>
> ID Parameter_ID Name Description Number Source
> 20A-1 20A Exclusively concatenative Exclusively concatenative
> 1 wals-20[http://wals.info/chapter/20#2._Sampling_procedure_and_feature_v
> alues]
>
> where the identifier "wals-20" refers to an entry in the dataset's sources
> file [6]:
>
> @incollection{wals-20,
> address = {Leipzig},
> author = {Balthasar Bickel and Johanna Nichols},
> booktitle = {The World Atlas of Language Structures Online},
> editor = {Matthew S. Dryer and Martin Haspelmath},
> publisher = {Max Planck Institute for Evolutionary Anthropology},
> title = {Fusion of Selected Inflectional Formatives},
> url = {http://wals.info/chapter/20},
> year = {2013}
> }
>
> While this only addresses the technical issues involved in replicability
> and
> reproducibility, I still think it could go a long way towards establishing
> better integration of datasets into the traditional publication workflow;
> this
> is mainly because it would allow a set of tools to evolve, which could help
> editors and reviewers to evaluate not only the paper, but also the quality
> of
> the data (to some extent).
>
>
> [1] https://github.com/cldf/cldf/tree/master/modules/StructureDataset
> [2] https://github.com/cldf/cldf/tree/master/components/codes
> [3] https://cdstar.shh.mpg.de/bitstreams/EAEA0-7269-77E5-3E10-0/
> wals_dataset.cldf.zip
> [4] https://csvkit.readthedocs.io/en/1.0.3/
> [5] https://github.com/cldf/cldf/blob/master/README.md#sources
> [6] https://github.com/cldf/cldf/blob/master/README.md#sources-r
> eference-file
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> http://listserv.linguistlist.org/mailman/listinfo/lingtyp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20180323/1a38bf89/attachment.htm>
More information about the Lingtyp
mailing list