publishing fieldwork data

Stuart Robinson stuart at ZAPATA.ORG
Tue Apr 17 22:52:32 UTC 2007


Hi, Martin.

> However, it is unclear to me whether or how these archives address the 
> need to fulfill the five traditional roles of paper publication: 
> recognition, citability, accessibility, standardization, and 
> cross-searchability. It seems that they mostly address a sixth role 
> (that I had forgotten to mention in my original posting), permanence 
> (though fieldworkers also seem to have discussed the issue of 
> standardization).

In order to address the five roles of paper publication that you identify,
there are a number of hurdles that have to be overcome. Some of these
hurdles have to do with academic culture, as I pointed out in my last
posting. Others are technical.

I think that recognition, citability, and accessbility are primarily
issues of academic culture while standardization and cross-searchability
are primarily technical issues. Permanence is a mix of both (as are some 
of the others, but to a lesser degree, I think).

Recognition is a hard problem, because it has to do with how the job
market works. That is to say, if you don't get recognized for that work,
then it will be hard to do it and get a job. And if you can't do it and
get a job, then you can't expect anyone but volunteers to advance it. This
relates to the issues of accessibility. There are two kinds of
accessibility: access to the data and access to the code that serves up
the data. Both types of access need to be addressed. The former is thorny
because you need good access control to respect the wishes of the
communites from whom the data is obtained. The latter is important because
proprietary software is no good for volunteer work. (I think open soure is
the key.) Now, as for citability, I think it is the easiest hurdle to
overcome. The people who create the databases simply need to say how to
cite it. There are some issues, such as date of publication, and version
control, but I don't think they're as difficult as some of the other 
issues out there.

The big problem is technical, I think, because most linguists don't know
how to program and therefore can't get involved in tool building.
Standardization and cross-searchability are related technical issues,
since in order to search across different archives, or even within the
same one, there need to be consistent data formats. Interdependent
problems can be hard to solve, since you can't solve one at a time, but
have to solve both simultaneously. One additional wrinkle is that
standardization also requires that people made a good faith effort to
adhere to standards. If there are good tools that work once you do so,
then people will do it because it's in their best interest.

So, the big question is, how can people (typologists or otherwise) 
contribute? Learning to program is I suppose one way to contribute, so 
that you can get involved in tool building, but that may be too much to 
ask of people who are already busy enough with other things. I've been 
writing with Harald Baayen a book that teaches linguists with prior 
programming experience how to program. You can find a rough draft here: 

http://www.zapata.org/stuart/python/python-textbook/
 
Also, to help contribute to the problem of standardization, I got involved
with the Natural Language Toolkit for Python (nltk.sourceforge.net) and
helped create a code library that will allow someone with basic
programming skills in Python to manipulate Shoebox/Toolbox data so that
they can easily convert it to whatever standard format the archives need.  
You can find more info here:

http://nltk.sourceforge.net/lite/doc/en/data.html

-Stuart



More information about the Lingtyp mailing list