[Lingtyp] AUTOTYP database freely available

Thu May 25 06:36:04 UTC 2017

Dear all

We are happy to announce the first full release of the AUTOTYP database system. This complements the partial releases that are scattered across the supporting materials and appendices of earlier publications and brings together, for the first time, the entire range of our data in one place.

AUTOTYP differs from traditional databases in that in most cases, data is entered in a fairly raw format (comparable to individual reference grammar descriptions) and needs to be aggregated and reshaped for most analytical purposes. For example, we don't enter alignment statements ('S=A≠P' etc) but enter individual case markers with the roles they cover and the conditions under which they occur. Alignment statements are then derived from such data using scripts. The raw data supports many different such derivations (apart from alignment statements, one might be interested whether or not there is a case split, or how many cases can code the same role etc). As a result, AUTOTYP usually contains several alternative aggregations of the same raw data.

We plan the release in several steps. This first version (version 0) includes all tables that we have already aggregated in earlier research, and the few raw tables that can be used off the shelf. Future releases will include additional aggregations that we will perform and more data that we collect. We also plan a release of the raw data together with scripts for making your own aggregations as well as for exploring and mapping the data. Version numbering follows what is known as "semantic versioning" (https://semver.org/spec/v2.0.0.html).

Given the discussions on this list last year, it is probably worth reiterating what such a typological database does and what it does not do: each variable (feature, trait, character) captures one extremely specific property of a phenomenon at a time, deliberately leaving out all other properties of the same phenomenon. For example, one of our variables captures whether or not the most agentive argument of a default 2-argument verb receives dependent marking. This says nothing about any other properties of the relevant verb (e.g its syntactic transitivity) or about the relevant marker (e.g. whether it is an affix or phonologically independent, what other roles the same marker can express, whether it also expresses number or not, etc). For all these other issues, there are other, equally specific variables.

The releases are available via a GitHub repository (https://github.com/autotyp/autotyp-data). The repository also includes an extensive readme file which describes the design and the content of the database and includes instructions for download, citation etc. as well as procedures for feature requests and error reports. We append below a quick start guide to GitHub.

Enjoy!

    Balthasar Bickel and Johanna Nichols, for the entire AUTOTYP team

Quick guide to GitHub:

 - read the readme: browse the text online (scroll down)
 - download the whole database as a zip file with the green "download" button in the upper right corner. Open individual files in a spreadsheet application (for the .csv files) or a (plain) text editor (for the .yaml and .md files).  
 - a single click on any file name (including the readme.md file) lets you view the contents online, in your browser. When you view a file in this way, you can right-click on the "raw" button (in the top right corner) in order to save the file to your computer. (Note that big or binary files cannot be viewed online; in this case, clicking on the file name directs you to a page with a 'download' button). However, we recommend you to download the entire database instead, this makes sure that you won't mix various versions of the data accidentally.