publishing fieldwork data

Vanhove vanhove at VJF.CNRS.FR
Tue Apr 17 09:33:39 UTC 2007

Dear all,

You have a wonderful electronic archive of annotated texts (+ morphological 
data, etc.) done by fieldlinguists accessible at and at The original schema was set up by 
the Lacito (CNRS France) and Michel Jacobson, and it is opened to other 

There is a growing awareness in France that electronic publications have to 
count as publications in our evaluation system, but citability is still a 
problem (not always though). At the CNRS, we are encouraged to put this 
kind of publications on our CVs.

There is an ongoing joint project of fieldlinguists (just starded, a 
website should be available soon) on "Oral corpuses in Afro-Asiatic 
languages: Prosodic and morphosyntactic analysis" funded by the Agence 
Nationale de la Recherche (responsible Amina Mettouchi, Nantes University), 
which seems to fullfill all five traditional roles of paper publication. 
Following is the description of the project:

The aim of this project is to establish a methodology in order to unify and 
share spoken field data in one phylum, Afroasiatic. This methodology is 
based on the linguistic analysis of the prosodic and morphosyntactic structure
of the languages studied in the project. We aim at compiling a pilot corpus 
accessible on-line to the scientific community, in particular for 
typological studies. The term ‘corpus’ implies that we are not compiling an 
archive for conservation purposes, but a structured body of systematically 
unified transcripts, accompanied by morphosyntactic annotations, and 
associating sound and text. This creation is grounded in the theoretical 
analysis of spoken field data.
This effort towards the unification of the data and its sharing is linked 
to two levels of analysis, implying both a theoretical stake and a 
practical one.
• the level of prosodic analysis: which units of spoken language are 
relevant for the languages under study, and on which principles are they 
founded (cognitive, phonological, pragmatic
• the level of morphosyntactic analysis: how can we code in a unified 
manner the minimal segmental units of the languages, for the whole sample?
Through this project, we would like to contribute to answering the 
following questions: • What are the units of spoken language?
• Do those units differ on the basis of the tonal or accentual nature of 
the intonation systems of the languages?
• How are prosody and morphosyntax articulated (especially at 
information-structure level)?
• What is the optimal degree of unification of the annotations, in order to 
both respect the specificities of languages, and provide a comparative 
basis for typology?
In order to provide answers to those questions, we will compile a 
pilot-corpus built according to the following criteria:
o it will be freely accessible on-line in xml format,
o it will be constituted of languages belonging to the Afroasiatic phylum, 
with three hours of recorded materials per language,
o it will be segmented into prosodic units
o it will minimally contain: a transcript, a translation, interlinear 
glossing, and the sound (downloadable on-line) will be indexed to the texts.



A 09:58 17/04/2007 +0200, Martin Haspelmath a écrit :
>Yes, the issue of data publication also arises in field linguistics in a 
>similar way. It has been my impression that since there are many more 
>field linguists than typologists, and since there are some large-scale 
>initiatives such as DoBeS (Volkswagen Foundation) and ELP (Rausing 
>Foundation/SOAS), field linguists have talked much more about these 
>issues. At least they have invested a lot of effort into creating archives 
>for field data such as AILLA, the DoBeS archive and the ELAR (ELP archive).
>However, it is unclear to me whether or how these archives address the 
>need to fulfill the five traditional roles of paper publication: 
>recognition, citability, accessibility, standardization, and 
>cross-searchability. It seems that they mostly address a sixth role (that 
>I had forgotten to mention in my original posting), permanence (though 
>fieldworkers also seem to have discussed the issue of standardization).
>So I wonder whether someone can explain why those fieldworkers that do 
>care about modern electronic methods (in my perception, the vast majority) 
>have not devoted a lot of energy to electronic publication. Wouldn't it be 
>great if anyone could read (and even cross-search) all those texts that 
>fieldworkers have gathered and annotated? If one could refer to these 
>texts as real publications, and if the researchers could put them on their 
>CV along with the other publications?
>Stuart Robinson wrote:
>>Ashild has a good point. Part of the problem is the culture of descriptive
>>linguistics, where there is still a fair bit of indifference and even
>>hostility towards the technological investment required to support
>>sustainable digital fieldwork data. I'm thinking, for example, of Bob
>>Dixon's statement on this list when he received the Leonard Bloomfield
>>"A word addressed to junior colleagues who think that it 
>>improve their work to immerse it in the latest electronic 
>>Don't. Because it won't. I worked on the Jarawara grammar as I did 
>>previous grammars of Dyirbal, of Yidi?, of Boumaa Fijian (and 
>>English). I used pencil, pen and spiral-bound notebooks, plus a couple 
>>good-quality tape recorders. No video camera (to have 
>>one would have compromised my role in the community). No lap-top. 
>>shoebox or anything of that nature. And no also 
>>elicitation from the lingua 
>>This passed without comment when it was posted roughly a year ago, but if
>>people are serious about recognizing the value of electronic data, it
>>shouldn't have.
>>Stuart Robinson
>>On Mon, 16 Apr 2007, Ashild Naess wrote:
>>>Dear Martin,
>>>the question you raise is just as relevant for descriptive linguistics; 
>>>properly annotated corpora of descriptive data require an enormous 
>>>amount of analysis work, but are generally not recognised as research 
>>>output by those who count such things. Finding ways of having electronic 
>>>data sets recognised as publications would be a great benefit to the 
>>>whole field.
>>>There was some discussion of the question at a recent conference in 
>>>Sydney on electronic data collection, annotation and archiving. The 
>>>following paper from the conference proceedings may be of interest:
>>>Coleman, Ross. 2006. Field, file, data, conference: Towards new modes of 
>>>scholarly publication. In Linda Barwick and Nicholas Thieberger (eds): 
>>>Sustainable data from digital fieldwork. Sydney: Sydney University 
>>>Press. 163-174.
>>>The paper is available online at 
>>>On 13.04.2007 16:21, Martin Haspelmath wrote:
>>>>Dear typologists,
>>>>Last week at an informal meeting of the European Typology Network in 
>>>>Leipzig, we discussed the issue of publishing typological databases. In 
>>>>the past, this was a practical problem, because journals and book 
>>>>publishers were reluctant to print many pages of tabular data. The 
>>>>basic practical problem has disappeared with modern information 
>>>>technology, but many problems remain, and it would be good if 
>>>>typologists made a joint effort to address them.
>>>>Traditional paper publication simultaneously fulfills at least four 
>>>>distinct functions:
>>>>(i) giving *recognition* (or even prestige) to a researcher's work, so 
>>>>that they can list it on their CV as the visible outcome of their work
>>>>(ii) *citability*, i.e. allowing users of published work to build on 
>>>>this work without having to vouch for it personally, without having to 
>>>>mention all the details, etc.
>>>>(iii) *accessibility*, i.e. allowing users in many different places (in 
>>>>principle, at any institution devoted to research, and beyond) to 
>>>>access the results of the work
>>>>(iv) *standardization*, i.e. things like uniform glossing, 
>>>>bibliographical references, section organization, or even uniform 
>>>>terminology (in some particular context, e.g. an edited volume)
>>>>All of these functions are important also for typological databases, 
>>>>but while some progress has been made with regard to (iii) 
>>>>(accessibility), the other requirements (recognition, citability, and 
>>>>standardization) still need a lot of thinking and work on our part. You 
>>>>can access some typological databases such as the Surrey morphology 
>>>>databases (, the Berlin-Utrecht 
>>>>Reciprocals Survey (, the 
>>>>Graz Reduplication database (, 
>>>>but these websites generally don't say how to cite data from these 
>>>>databases, so they do not give enough recognition to the authors.
>>>>Standardization has been addressed by the Typological Database System 
>>>>(, and this project additionally 
>>>>aims for a fifth function, *cross-searchability*, that was not possible 
>>>>with traditional paper publication at all.
>>>>Another problem is how to divide databases into units: Some databases 
>>>>(such as the database of the World Atlas of Language Structures, which 
>>>>will become available on the web in 2008) are aggregates of datasets 
>>>>contributed by many different authors, which should be citable 
>>>>separately. Also for the databases created by a smaller team, it may be 
>>>>desirable to specifiy more precisely which author did what. In 
>>>>traditional paper publications, we had two kinds of units, articles and 
>>>>books, which could be single-authored or multi-authored (occasionally 
>>>>with some ranking of the authors). Maybe it would be desirable to allow 
>>>>more different units, and more different roles (e.g. content provider 
>>>>vs. database designer?).
>>>>Any ideas how typologists should go about solving these problems?
>Martin Haspelmath (haspelmath at
>Max-Planck-Institut fuer evolutionaere Anthropologie, Deutscher Platz 6
>D-04103 Leipzig
>Tel. (MPI) +49-341-3550 307, (priv.) +49-341-980 1616
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Lingtyp mailing list