[Lingtyp] Fwd: Re: Empirical standards in typology: incentives

Sun Mar 25 15:37:30 UTC 2018

Dear Dorothee,

as one of the members of the CLDF initiative, which Robert Forkel, our
chief researcher on that, introduced in another thread, as he didn't
have access to the list before and could not directly answer, we are
planning to extend our machine- and human-readable standards further in
the future to also address inter-linear-glossed texts. We already cover
parallel texts, as you can see from our draft proposal online at
http://cldf.clld.org. We're very keen on discussing this further with
the community. We ourselves are strong on wordlists and grammatical
feature datasets, as you know from the numerous CLLD applications (like
WALS, etc.), but we hope a lot on a broader discussion on how to
integrate further cross-linguistic data types.

We launched several calls to different lists, asking for our colleagues
in diversity linguistics to attend and share their opinions (we do this
via GitHub, so making an account is required, but it's easy to use and
free of charge).

Here's our call on linguist list, FYI (launched by Harald Hammarström):

* https://linguistlist.org/issues/29/29-1295.html

I hope that as many of you who are interested in this will join the
discussion. We can't have another decade of articles where the big bulk
of the data is not shared openly, nor can we afford for our community to
put more and more largely incomparable datasets out there. CLDF is an
attempt to propose concrete solutions for several of these issues, but
we can't do without the community, and we don't have the intention to do
it without it.

Best,

Mattis

-- 
Dr. Johann-Mattis List
Max Planck Institute for the Science of Human History
Kahlaische Straße 10
07743 Jena
Germany
http://lingulist.de

On 25.03.2018 16:51, Dorothee Beermann wrote:
> Dear all,
> 
> I have followed the discussion on this thread with interest. Let me ask
> you, would any of what you discuss and suggest here also apply to
> Interlinear Glossed Data?
> 
> Sebastian talked about making  "typological research more replicable". A
> related issue is reproducible research in linguists. I guess a good
> starting point for whatever we do as linguists is to keep things
> 
> transparent, and to give public access to data collections. Especially
> for languages with little to no public resources (except for what one
> finds in articles), this seems essential.
> 
> Here is an example of what I have in mind:  We just released 41
> Interlinear Glossed Texts in Akan. The data can be downloaded as XML from:
> 
> https://typecraft.org/tc2wiki/The_TypeCraft_Akan_Corpus
> 
> The corpus is described on the download page, and also in the notes
> contained in the download. (Note that we can offer the material in
> several other formats.)
> 
> 
> Dorothee
> 
> Professor Dorothee Beermann, PhD
> Norwegian University of Science and Technology (NTNU)
> Dept. of Language and Literature
> Surface mail to: NO-7491 Trondheim, Norway/Norge
> 
> Visit: Building 4, level 5, room 4512, Dragvoll,
> E-mail:  dorothee.beermann at ntnu.no
> 
> Homepage:http://www.ntnu.no/ansatte/dorothee.beermann
> TypeCraft:http://typecraft.org/tc2wiki/User:Dorothee_Beermann
> 
> 
> 
> 
> 
> -------- Forwarded Message --------
> Subject: 	Re: [Lingtyp] Empirical standards in typology: incentives
> Date: 	Fri, 23 Mar 2018 11:59:18 +1100
> From: 	Hedvig Skirgård <hedvig.skirgard at gmail.com>
> To: 	Johanna NICHOLS <johanna at berkeley.edu>
> CC: 	Linguistic Typology <lingtyp at listserv.linguistlist.org>
> 
> 
> 
> Dear all, 
> 
> I think Sebastian's suggestion is very good. 
> 
> Is this something LT would consider, Masja?
> 
> Johanna's point is good as well, but it shouldn't matter for Sebastian's
> suggestion as I understand it. We're not being asked to submit the
> coding criteria prior to the survey being completed, but only at the
> time of publication. There are initiatives in STEM that encourages
> research teams to submit what they're planning to do prior to doing if
> (to avoid biases), but that's not baked into what Sebastian is
> suggestion, from what I can tell.
> 
> I would also add a 4 star category which includes inter-coderreliabiity
> tests, i.e. the original author(s) have given different people the same
> instructions and tested how often they do the same thing with the same
> grammar.
> 
> /Hedvig
> 
> *
> *
> 
> *Med vänliga hälsningar**,*
> 
> *Hedvig Skirgård*
> 
> 
> PhD Candidate
> 
> The Wellsprings of Linguistic Diversity
> 
> ARC Centre of Excellence for the Dynamics of Language
> 
> School of Culture, History and Language
> College of Asia and the Pacific
> 
> The Australian National University
> 
> Website <https://sites.google.com/site/hedvigskirgard/>
> 
> 
> 
> 
> 2018-03-23 0:49 GMT+11:00 Johanna NICHOLS <johanna at berkeley.edu
> <mailto:johanna at berkeley.edu>>:
> 
>     What's in the codebook -- the coding categories and the criteria? 
>     That much is usually in the body of the paper.
> 
>     Also, a minor but I think important point:  Ordinarily the codebook
>     doesn't in fact chronologically precede the spreadsheet.  A draft or
>     early version of it does, and that gets revised many times as you
>     run into new and unexpected things.  (And every previous entry in
>     the spreadsheet gets checked and edited too.)  By the time you've
>     finished your survey the categories and typology can look different
>     from what you started with.  You publish when you're comfortably
>     past the point of diminishing returns.  In most sciences this is bad
>     method, but in linguistics it's common and I'd say normal.  The
>     capacity to handle it needs to be built into the method in advance. 
> 
>     Johanna
> 
>     On Thu, Mar 22, 2018 at 2:10 PM, Sebastian Nordhoff
>     <sebastian.nordhoff at glottotopia.de
>     <mailto:sebastian.nordhoff at glottotopia.de>> wrote:
> 
>         Dear all,
>         taking up a thread from last November, I would like to start a
>         discussion about how to make typological research more
>         replicable, where
>         replicable means "less dependent on the original researcher". This
>         includes coding decisions, tabular data, quantitative analyses etc.
> 
>         Volker Gast wrote (full quote at bottom of mail):
>         > Let's assume that self-annotation cannot be avoided for financial
>         > reasons. What about establishing a standard saying that, for
>         instance,
>         > when you submit a quantitative-typological paper to LT you have to
>         > provide the data in such a way that the coding decisions are made
>         > sufficiently transparent for readers to see if they can go
>         along with
>         > the argument?
> 
>         I see two possibilities for that: Option 1: editors will refuse
>         papers
>         which do not adhere to this standard. That will not work in my view.
>         What might work (Option 2) is a star/badge system. I could
>         imagine the
>         following:
> 
>         - no stars: only standard bibliographical references
>         - *         raw tabular data (spreadsheet) available as a supplement
>         - **        as above, + code book available as a supplement
>         - ***       as above, + computer code in R or similar available
> 
>         For a three-star article, an unrelated researcher could then
>         take the
>         original grammars and the code book and replicate the
>         spreadsheet to see
>         if it matches. They could then run the computer code to see if they
>         arrive at the same results.
> 
>         This will not be practical for every research project, but some
>         might
>         find it easier than others, and, in the long run, it will
>         require good
>         arguments to submit a 0-star (i.e. non-replicable) quantitative
>         article.
> 
>         Any thoughts?
>         Sebastian
> 
>         PS: Note that the codebook would actually chronologically
>         precede the
>         spreadsheet, but I fill that spreadsheets are more easily
>         available than
>         codebooks, so in order to keep the entry barrier low, this order is
>         reversed for the stars.
> 
> 
> 
> 
> 
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> http://listserv.linguistlist.org/mailman/listinfo/lingtyp
> 

-- 
Dr. Johann-Mattis List
DFG Forschungsstipendiat
Centre de recherches linguistiques sur l'Asie Orientale
École des Hautes Études en Sciences Sociales
2 Rue de Lille
75007 Paris

Team Adaptation, Integration, Reticulation, Evolution
Université Pierre et Marie Curie
9 quai St Bernard
75005 Paris

Tel: +49-1575-2057010
Email: mattis.list at lingpy.org
Homepage: http://lingulist.de