[Corpora-List] Q: openly available data sets from corpus studies and linguistic experiments

Roland Schäfer roland.schaefer at fu-berlin.de
Fri Jul 18 10:16:37 UTC 2014


Hi all,

for a small survey, I am looking for any openly available linguistic
data set which fulfills the following criteria:

1. used in original and published work (paper, book, thesis, in
collection) from the year 1995 or later (roughly from the past 20 years)

2. the publication is written in English, French, German or Swedish,
regardless of the target language of the study/experiment

3. the data is corpus data and/or data from a linguistic experiment

4. the authors apply some kind of inferential statistics test, (G)L(M)M,
(G)A(M)M, etc. in order to test hypotheses from theoretical linguistics
in the broadest sense (descriptive statistics won't suffice)

5. the released data is complete in as much as it allows for the
(attempt of a) reproduction of the results in the paper

– ideally also –

6. the data is published under a license compatible to variants of CC-BY

Please allow me to clarify that I am NOT looking for corpora, but rather
for SAMPLES (e.g., annotated concordances) used in published studies,
and based on which specific linguistic hypotheses were tested. I have
already fully taken into account data sets mentioned/used in available
introductions to corpus linguistics and/or statistics for linguists
(Baayen, Johnson, Gries, etc.).

If there are any recent surveys cataloging such data sets, I'd be keen
to learn about them.

Any suggestions are highly appreciated. If you prefer to reply off-list,
please feel free to do so. I will post a summary.

Regards,
Roland

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list