publishing fieldwork data

Bakker, D. D.Bakker at UVA.NL
Wed Apr 18 22:13:40 UTC 2007

Dear all,


I think that a distinction should indeed be made between typological databases, (mainly) containing analytical data in matrix form, and corpora (mainly) containing primary language data, possibly in phonological representation, with glosses etc. This distinction should not be made because the first type is open ended and the last one fixed, in terms of David: both may be finished and frozen at some stage (end of project or publication of book), or being updated, and corrected endlessly. The owner(s) may nevertheless decide in all these cases that the database as it is, or a subset of it, may be interesting for others to explore. What is fundamentally different, however, is the way they are compiled (from descriptions or from fieldwork), the way to document them, the tools necessary to explore them, the criteria to assess their quality, and probably also the group of linguists interested in their use and the type of research they serve. Although at the end of the day all sources of knowledge about language(s) are indispensable, and should be explored in cohesion, it may nevertheless be better to first solve the points that Martin suggested for one of them. In the case of typological databases, the type I am most familiar with, quite some activities have been seen over the last decade, culminating (among others) in the TDS system mentioned in the original posting ( Such web-based initiatives mainly concentrate on the accessibility (a central homepage for a number of databases), standardization (a common interface with search engine) and cross-searchability (a meta-database with an added, bottom up derived ontology). What remains are recognition and citability, the most important incentives for linguists to make their data available to the community in the first place. And probably the major reason why they would make the extra effort of extensive documentation, reference to data sources, fool proof interfaces, etcetera, that make other people's databases usable in the first place. One may think of an e-journal as a platform, but since they still exist, and have prestige, one could also think of one of the established p-journals with a more than average interest in language typology (e.g. Linguistic Typology; Folia Linguistica; STUFF; ...). One could think of a special section 'Notes on Typological Databases', not unlike the review section, where specialized editors discuss newly available databases, or fundamentally extended versions of existing ones. Only those databases should be discussed which are available online and which conform to a set of predefined standards, to be established by the journal in question. Those standards may pertain to three areas:


a.   Database design

b.  Database contents

c.   Interface and custom built analysis tools


Apart from the general characteristics of the database under discussion, and possibly scores on several aspects of it, a database note mentions the names, affiliations and specific contributions of the authors of the database, thus helping the latter to get (some) credit for their work. One could think of the following points of attention for a database note.


0. General characteristics of the database

0.0 web address

0.1 short domain description

0.2 number of languages

0.3 number of variables

0.4 total number of datapoints (N.B. completeness)

0.5 version number / last update


1. Database design

1.0 name and affiliation of designer(s)

1.1 type of database

1.2 is there a separate level for definitions of the database contents?

1.3 is the structure conducive to storage, retrieval and analysis of the type of data concerned?

1.4 is there a (structured) questionnaire on the domain available?

1.5 is the data set based on a representative sample?

1.6 if not, is there a controlled subsample?

1.7 what types of analytical variables are present?

1.8 are multiple values for variables allowed?

1.9 are multiple values scaled (e.g. for markedness)?

1.10 is there any primary data involved?

1.11 what kind of sources have been used (grammars/questionnaire/consultants/...)?

1.12 is there a link between content and data source


2. Linguistic contents

2.0 name and affiliation of linguist(s) who provided the data

2.1 is there a clear definition of each variable?

2.2 is it clear for every datapoint what its source is?

2.3 is there in indication of the reliability of the data?

2.4 are analytical data points linked with examples?

2.5 are examples glossed and translated?

2.6 what type of glossing system is used (e.g. from source; unified)


3. Interface/Software

3.0 name and affiliation of the software engineer(s)

3.1 is it embedded in a meta database system, such as TDB?

3.2 what is the query language?

3.3 how general/user friendly is the interface?

3.4 is there a database independent set of linguistic terms for querying ('ontology')?

3.5 how good is the data presentation module?

3.6 is it possible to change/add data?

3.7 is there an option for data analysis?

3.8 are there other software tools for the manipulation integrated in the system?

3.9 is there an export function?

3.10 if so, what formats are supported?


Dik Bakker 
Dept. of General Linguistics 
Universities of Amsterdam & Lancaster
tel (+44) 1524 64975 & (+31) 20 5253864
Societas Linguistica Europaea 
Secretary/Treasurer <> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Lingtyp mailing list