CfP: SLE 2018 workshop 'Comparative corpus linguistics'

Annemarie Verkerk verkerk at shh.mpg.de
Thu Oct 12 15:00:15 EDT 2017

*Call for papers*

*Comparative corpus linguistics: new perspectives and applications*

Workshop proposal for the 51^st Meeting of the SLE(Tallinn, 29 August – 
1 September 2018)


Convenors:Natalia Levshina, Annemarie Verkerk, and Steven Moran

Although the main bulk of existing corpus-based research is probably 
formed by language-specific descriptive studies, corpora have long been 
used successfully for large-scale language comparison and for testing 
linguistic generalizations, e.g. Zipf (1935) and Greenberg (1960). 
Nowadays, linguists can enjoy the abundance of large comparable and 
parallel corpora and other multilingual resources, such as the Universal 
Dependencies Corpora (Nivre et al. 2017), the parallel Bible 
translations (Mayer & Cysouw 2014), OPUS corpus (Tiedemann 2012), 
Multi-CAST (Haig & Schnell 2016) and Google Books Ngrams. The 
availability of such resources provides functional linguists, 
typologists, historical linguists and psycholinguists with new exciting 
opportunities to answer big theoretical questions, exemplified by 
successful applications of comparative corpus-based approaches such as 
the following:

- formulation, refinement and explanation of linguistic generalizations, 
e.g. Zipf’s Law of Abbreviation (Piatandosi et al. 2011; Bentz & 
Ferrer-i-Cancho 2016), the principle of dependency length minimization 
(Futrell et al. 2015) and the principle of economy in morphosyntactic 
alternations (Haspelmath et al. 2014);

- computation of corpus-based measures that represent typological 
parameters, such as analyticity, syntheticity and complexity (e.g. Juola 
1998; Szmrecsanyi 2009; Ehret & Szmrecsanyi 2016);

- using massively parallel and comparable corpora for unsupervised 
pattern detection, e.g. finding the universal conceptual dimensions of 
motion verbs (Wälchli & Cysouw 2012) and automatic extraction of 
typological features (Virk et al. 2017);

- development of new statistical methods, and probabilistic and 
connectionist approaches to the study of language acquisition (e.g. 
Chater & Manning 2006, Behrens 2008), in particular from a 
cross-linguistic perspective (MacWhinney & Snow 1985; Moran et al 2016);

- quantitative diachronic typology, e.g. development of manner and path 
verbs in Indo-European (Verkerk 2015);

- detection of areal patterns in genealogically related languages (e.g. 
van der Auwera et al. 2005; von Waldenfels 2015);

- usage-based explanations of the evolution of linguistic types, e.g. 
studies related to the Preferred Argument Structure hypothesis (Du Bois 
1987; Haig & Schnell 2016);

- cross-linguistic comparison of probabilistic constraints on 
multifactorial language variation, e.g. the use of analytic and lexical 
causatives (Levshina 2016).

The aim of this workshop is to bring together typologists, functional 
linguists, psycholinguists and other specialists who use 
cross-linguistic corpora for testing their hypotheses, and corpus 
linguists who build and use such corpora to address research questions 
in linguistic diversity. We want to discuss the recent developments, 
perspectives and challenges of corpus-based language comparison. We seek 
contributions that sample a sizable amount of the world’s languages, 
whether at the global level, or within particular families or areas. A 
list of potential contributions includes, but is not limited to, the 

- case studies showing how one can use the information derived from 
corpora for the purposes of typological classification;

- corpus investigations of linguistic generalizations and explaining 
these findings in terms of processing-related, communicative and 
learning constraints or biases;

- corpus-based language comparison from a genealogical and/or areal 

- corpus-based studies in diachronic typology and historical linguistics;

- studies addressing the problem of comparative concepts (Haspelmath 
2010) and its consequences for comparative corpus linguistics, in 
particular, for the development of cross-linguistic annotation schemas;

- presentation of newly developed cross-linguistic corpora, preferably 
with a case study revealing their possibilities;

- discussion of statistical methods and visualization tools for 
analysing cross-linguistic corpus data.

If you are interested in participating in this workshop, please send 
your short abstract (up to 300 words), along with the name(s), 
affiliation(s) and contact information of all co-authors, to Natalia 
Levshina (natalevs at gmail.com <mailto:natalevs at gmail.com>) before 
*November 10* 2017. Earlier inquiries are also welcome. If the proposal 
is accepted, the contributors will have to submit full versions of their 
abstracts on January 15 2018, which will be reviewed by the SLE 
scientific committee. We will keep you informed of all practical steps.


Behrens, H. (ed.). (2008). /Corpora in language acquisition research: 
History, methods, perspectives/ (Vol. 6). Amsterdam: John Benjamins.

Bentz, Ch., & Ferrer-i-Cancho, R. (2016). Zipf’s law of abbreviation as 
a language universal. In Bentz, Christian, Gerhard Jäger and Igor 
Yanovich (eds.), /Proceedings of the Leiden Workshop on Capturing 
Phylogenetic Algorithms for Linguistics/. University of Tubingen, online 
publication system: 

Chater, N., & Manning, C. D. (2006). Probabilistic models of language 
processing and acquisition. Trends in Cognitive Sciences, 10(7), 335-344.

Du Bois, John W. (1987). The discourse basis of ergativity. /Language, 
/64, 805–55.

Ehret, K. & Szmrecsanyi, B. (2016). An information-theoretic approach to 
assess linguistic complexity. In R. Baechler & G. Seiler 
(eds.),/ Complexity and Isolation/, 71-94. Berlin: de Gruyter.

Futrell, R., Mahowald, K., & Gibson, E. (2015). Large-scale evidence of 
dependency length minimization in 37 languages. /Proceedings of the 
National Academy of Sciences,/ 112(33),10336–10341.

Haig, G. & Schnell, S. (eds.). (2016). Multi-CAST (Multilingual Corpus 
of Annotated Spoken Texts), https://lac.uni-koeln.de/multicast/.

Haspelmath, M. (2010). Comparative concepts and descriptive categories 
in crosslinguistic studies. /Language/, /86/(3), 663–687.

Haspelmath, M., Calude, A., Spagnol, M., Narrog, H., & Bamyaci, E. 
(2014). Coding causal–noncausal verb alternations: A form–frequency 
correspondence explanation. /Journal of Linguistics/, /50/, 587–625. 

Greenberg, J. H. (1960). A quantitative approach to the morphological 
typology of language. /International Journal of American Linguistics, 
/26(3), 178–94.

Juola, P. (1998). Measuring linguistic complexity: the morphological 
tier. /Journal of Quantitative Linguistics, /5(3), 206–213.

Levshina, N. (2016). Why we need a token-based typology: A case study of 
analytic and lexical causatives in fifteen European languages. /Folia 
Linguistica,/ 50(2), 507–542.

Mayer, T., & Cysouw, M. (2014). Creating a massively parallel Bible 
corpus. /Proceedings of the International Conference on Language 
Resources and Evaluation (LREC)/, Reykjavik, 3158-3163.

MacWhinney, B., & Snow, C. (1985). The child language data exchange 
system. /Journal of Child Language/, /12/(2), 271-295.

Moran, S., Schikowski, R., Pajovic, D., Hysi, C., & Stoll, S. (2016). 
The ACQDIV Database: Min (d) ing the Ambient Language. In /Proceedings 
of /theTenth International Conference on Language Resources and 
Evaluation (LREC 2016), Portorož, Slovenia, 23 May 2016 - 28 May 2016, 

Nivre, J.Agić, Ž., Ahrenberg, L. et al. (2017).Universal Dependencies 
2.0, LINDAT/CLARIN digital library at the Institute of Formal and 
Applied Linguistics, Charles University in Prague, 

Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are 
optimized for efficient communication. /Proceedings of the National 
Academy of Sciences of the United States of America/, /108/(9), 3526–3529.

Szmrecsanyi, B. (2009). Typological parameters of intralingual 
variability: grammatical analyticity versus syntheticity in varieties of 
English. /Language Variation and Change, /21(3), 319-353.

Tiedemann, J. (2012). Parallel Data, Tools and Interfaces in OPUS. In 
/Proceedings of the 8th International Conference on Language Resources 
and Evaluation (LREC'2012)/

van der Auwera, J., Schalley, E., & Nuyts, J. (2005). Epistemic 
possibility in a Slavonic parallel corpus: A pilot study. In B. Hansen & 
P. Karlik (eds.), /Modality in Slavonic Languages. New Perspectives/, 
201–217. München: Sagner.

Verkerk, A. (2015). Where do all the motion verbs come from? The speed 
of development of manner verbs and path verbs in Indo-European. 
/Diachronica/, 32(1), 69-104.

Virk, Sh. M., Borin, L., Saxena, A. & Hammarström, H. (2017). Automatic 
Extraction of Typological Linguistic Features from Descriptive Grammars. 
In Kamil Ekštein & Václav Matoušek (eds.), /TSD 2017: Text, Speech, and 
Dialogue,/ 111–119. Cham: Springer.

von Waldenfels, R. (2015). Inner-Slavic contact from a corpus driven 
perspective. In E. Kelih, S. M. Newerkla, & J. Fuchsbauer (eds.), 
/Lehnwörter im Slawischen: Empirische und crosslinguistische 
Perspektiven/, 237-263. Frankfurt: Peter Lang.

Zipf, G. K. (1935). /The Psycho-Biology of Language: An Introduction to 
Dynamic Philology/. Cambridge, MA: MIT Press.

