publishing fieldwork data

Peter Austin pa2 at SOAS.AC.UK
Tue Apr 17 11:06:27 UTC 2007

Dear Martin,

I don't particularly like the labels "field linguists", "typologists", "theoreticians" and the divisions they suggest (which camp do I have to assign myself to, I wonder), but be that as it may, there are communities of researchers who over the past six years at least have been discussing the issues you raise - there have been various workshops and publications from the E-MELD project and the OLAC and DELAMAN archives groups (including the workshop at Paradisec that Ashild mentioned) that have addressed many of the relevant topics of recognition, citability, accessibility, standardization, cross-searchability, and preservation.

DELAMAN has been discussing the development of a persistent identifier system and bibliographical format for archival objects that would allow them to be cited and listed on people's CVs (URLs are of course notoriously unreliable for referencing resources) - Heidi Johnson and Linda Barwick have been leading the discussion on this. E-MELD and others have discussed standardisation, resource discovery and cross-searchability issues, and preservation has been a major topic for language archives across the world. OLAC has run tutorials many recent LSA meetings in the US to promote understanding of these issues among the general linguistic community there - one I was involved in a couple of years ago had a packed house of keen participants.

One could suggest several possible reasons why there aren't 'piles of data out there' provided by fieldworkers for other scholars to mine:

1. a concern about "drive by typologists" who take primary data and analyses for their own use without recognising or giving anything back to the researcher who collected and prepared the data originally;

2. a need to take care with proper recognition of intellectual property rights and moral rights (of both speakers and analysts) that some researchers simply feel has been lacking in the past.

Note also that archiving (ie. creating well structured archival objects with associated metadata to be placed in a trusted repository) is *not* the same as publishing (including web publication) - the structure and format of archive files and publication files is likely to be quite different, for example.

To my mind, a very positive development has been the recent decision of at least one major linguistics journal to require bibliographical citation of original sources for all example sentences in papers, and the encouragement of referencing for archival materials. The development of e-journals and repositories where data sets can be uploaded and referenced is also quite recent and positive. As researchers publish their secondary, tertiary and n-ary analyses with proper citation and recognition of primary sources and archival data an environment may well be created where those who collect and analyse primary data feel less reluctant to make their materials more widely available.

Peter K. Austin

-----Original Message-----
From: Martin Haspelmath <haspelmath at EVA.MPG.DE>
Date: Tue, 17 Apr 2007 09:58:52 +0200
Subject: Re: publishing fieldwork data

Yes, the issue of data publication also arises in field linguistics in a 
similar way. It has been my impression that since there are many more 
field linguists than typologists, and since there are some large-scale 
initiatives such as DoBeS (Volkswagen Foundation) and ELP (Rausing 
Foundation/SOAS), field linguists have talked much more about these 
issues. At least they have invested a lot of effort into creating 
archives for field data such as AILLA, the DoBeS archive and the ELAR 
(ELP archive).

However, it is unclear to me whether or how these archives address the 
need to fulfill the five traditional roles of paper publication: 
recognition, citability, accessibility, standardization, and 
cross-searchability. It seems that they mostly address a sixth role 
(that I had forgotten to mention in my original posting), permanence 
(though fieldworkers also seem to have discussed the issue of 

So I wonder whether someone can explain why those fieldworkers that do 
care about modern electronic methods (in my perception, the vast 
majority) have not devoted a lot of energy to electronic publication. 
Wouldn't it be great if anyone could read (and even cross-search) all 
those texts that fieldworkers have gathered and annotated? If one could 
refer to these texts as real publications, and if the researchers could 
put them on their CV along with the other publications?


Stuart Robinson wrote:
> Ashild has a good point. Part of the problem is the culture of descriptive
> linguistics, where there is still a fair bit of indifference and even
> hostility towards the technological investment required to support
> sustainable digital fieldwork data. I'm thinking, for example, of Bob
> Dixon's statement on this list when he received the Leonard Bloomfield
> award:
> "A word addressed to junior colleagues who think that it will                                                                                                             
> improve their work to immerse it in the latest electronic technology.                                                                                                     
> Don't. Because it won't. I worked on the Jarawara grammar as I did on                                                                                                     
> previous grammars of Dyirbal, of Yidi?, of Boumaa Fijian (and of                                                                                                          
> English). I used pencil, pen and spiral-bound notebooks, plus a couple of                                                                                                 
> good-quality tape recorders. No video camera (to have employed                                                                                                            
> one would have compromised my role in the community). No lap-top. No                                                                                                      
> shoebox or anything of that nature. And no also grammatical                                                                                                               
> elicitation from the lingua franca."                                                                                                                                      
> This passed without comment when it was posted roughly a year ago, but if
> people are serious about recognizing the value of electronic data, it
> shouldn't have.
> Best,                                                                                                                                                                     
> Stuart Robinson
> On Mon, 16 Apr 2007, Ashild Naess wrote:
>> Dear Martin,
>> the question you raise is just as relevant for descriptive linguistics; 
>> properly annotated corpora of descriptive data require an enormous 
>> amount of analysis work, but are generally not recognised as research 
>> output by those who count such things. Finding ways of having electronic 
>> data sets recognised as publications would be a great benefit to the 
>> whole field.
>> There was some discussion of the question at a recent conference in 
>> Sydney on electronic data collection, annotation and archiving. The 
>> following paper from the conference proceedings may be of interest:
>> Coleman, Ross. 2006. Field, file, data, conference: Towards new modes of 
>> scholarly publication. In Linda Barwick and Nicholas Thieberger (eds): 
>> Sustainable data from digital fieldwork. Sydney: Sydney University 
>> Press. 163-174.
>> The paper is available online at 
>> Best,
>> Ã…shild
>> On 13.04.2007 16:21, Martin Haspelmath wrote:
>>> Dear typologists,
>>> Last week at an informal meeting of the European Typology Network in 
>>> Leipzig, we discussed the issue of publishing typological databases. In 
>>> the past, this was a practical problem, because journals and book 
>>> publishers were reluctant to print many pages of tabular data. The basic 
>>> practical problem has disappeared with modern information technology, 
>>> but many problems remain, and it would be good if typologists made a 
>>> joint effort to address them.
>>> Traditional paper publication simultaneously fulfills at least four 
>>> distinct functions:
>>> (i) giving *recognition* (or even prestige) to a researcher's work, so 
>>> that they can list it on their CV as the visible outcome of their work
>>> (ii) *citability*, i.e. allowing users of published work to build on 
>>> this work without having to vouch for it personally, without having to 
>>> mention all the details, etc.
>>> (iii) *accessibility*, i.e. allowing users in many different places (in 
>>> principle, at any institution devoted to research, and beyond) to access 
>>> the results of the work
>>> (iv) *standardization*, i.e. things like uniform glossing, 
>>> bibliographical references, section organization, or even uniform 
>>> terminology (in some particular context, e.g. an edited volume)
>>> All of these functions are important also for typological databases, but 
>>> while some progress has been made with regard to (iii) (accessibility), 
>>> the other requirements (recognition, citability, and standardization) 
>>> still need a lot of thinking and work on our part. You can access some 
>>> typological databases such as the Surrey morphology databases 
>>> (, the Berlin-Utrecht Reciprocals Survey 
>>> (, the Graz Reduplication 
>>> database (, but these websites 
>>> generally don't say how to cite data from these databases, so they do 
>>> not give enough recognition to the authors.
>>> Standardization has been addressed by the Typological Database System 
>>> (, and this project additionally aims 
>>> for a fifth function, *cross-searchability*, that was not possible with 
>>> traditional paper publication at all.
>>> Another problem is how to divide databases into units: Some databases 
>>> (such as the database of the World Atlas of Language Structures, which 
>>> will become available on the web in 2008) are aggregates of datasets 
>>> contributed by many different authors, which should be citable 
>>> separately. Also for the databases created by a smaller team, it may be 
>>> desirable to specifiy more precisely which author did what. In 
>>> traditional paper publications, we had two kinds of units, articles and 
>>> books, which could be single-authored or multi-authored (occasionally 
>>> with some ranking of the authors). Maybe it would be desirable to allow 
>>> more different units, and more different roles (e.g. content provider 
>>> vs. database designer?).
>>> Any ideas how typologists should go about solving these problems?
>>> Martin

Martin Haspelmath (haspelmath at
Max-Planck-Institut fuer evolutionaere Anthropologie, Deutscher Platz 6	
D-04103 Leipzig      
Tel. (MPI) +49-341-3550 307, (priv.) +49-341-980 1616

atabases created by a smaller team, it may be 
>>> desirable to specifiy more precisely which author did what. In 
>>> traditional paper publications, we had two kinds of units, articles and 
>>> books, which could be single-authored or multi-authored (occasionally 
>>> with some ranking of the authors). Maybe it would be desirable to allow 
>>> more different units, and more different roles (e.g. content provider 
>>> vs. database designer?).
>>> Any ideas how typologists should go about solving the
Prof Peter K. Austin
Marit Rausing Chair in Field Linguistics
Director, Endangered Languages Academic Program
Department of Linguistics, SOAS
Thornhaugh Street, Russell Square
London WC1H 0XG
United Kingdom


More information about the Lingtyp mailing list