2nd CfP: Workshop on Corpus-based Quantitative Typology (CoQuaT 2013)

Wed Feb 27 14:52:30 UTC 2013

Workshop on Corpus-based Quantitative Typology (CoQuaT 2013)

Pre-conference workshop in connection with the 10th Biennial
Conference of the Association of Linguistic Typology (ALT X), Leipzig,
Germany, August 15-18, 2013

Location: University of Leipzig, Germany
Date: August 14, 2013
URL: http://paralleltext.info/coquat2013/

Submission deadline: March 31, 2013
Notification of acceptance: May 1, 2013

CONVENORS

Michael Cysouw (Philipps University of Marburg)
Dirk Goldhahn (University of Leipzig)
Thomas Mayer (Philipps University of Marburg)
Uwe Quasthoff (University of Leipzig)

INVITED SPEAKERS

Östen Dahl (Stockholm University)
Kevin Scannell (Saint Louis University)

WORKSHOP DESCRIPTION

The amount of available (textual) corpora of the world’s languages is
currently rising at an incredible rate. The aim of this workshop is to
bring together researchers dealing with corpus-based quantitative
language comparison and to encourage typological studies that rely on
corpus data.

A growing body of research uses corpora to investigate the structure
of individual languages. There also exists a large amount of research
on the world-wide linguistic diversity, though mostly on the basis of
information manually extracted from published sources. In contrast,
the combination of the two is still rare. There are only few
quantitative typological investigations with a world-wide scope that
use corpora to infer cross-linguistic generalizations and insights.
Some previous work compiled quantitative data through manual corpus
annotation (e.g. Greenberg 1960; Wälchli 2005) or automatically with
the help of computer programs (e.g. Mayer and Cysouw 2012). In
addition, there is some relevant work using corpora to compare a
smaller number of (genealogically related) languages (e.g. Bickel
2003; van der Auwera 2005).

Cross-linguistic corpora, in particular (massively) parallel corpora
(cf. Cysouw and Wälchli 2007) or comparable corpora compiled through
web crawling (e.g. Scannell 2007; Goldhahn et al. 2012), provide an
enormous amount of information about the world's languages. Although
such data is often not ideal from a linguistic point of view
(involving problems of translationese, or being restricted to special
textual genres), it would be a waste not at least to try to use them
for comparative linguistic purposes.

One of the reasons for the shortage of quantitative cross-linguistic
work is the lack of adequate resources for a representative sample of
languages. Consequently, on top of the laborious manual analysis,
typologically interested researchers are faced with the time-consuming
task to build their own corpora from scratch. One goal of this
workshop is therefore to collect (online) resources (especially for
lesser studied languages) and to exchange experience with crawling
texts from the web. Furthermore, we intend to discuss in which formats
cross-linguistic corpora should be made publicly available so that
typologists can best benefit from them without violating copyright
laws.

2nd CALL FOR PAPERS

For this workshop, we welcome any type of cross-linguistic
quantitative corpus-based work. We are interested both in the
collection and preparation of (massively) cross-linguistic corpora and
in investigations that rely on such a resource for language
comparison.

A) Possible topics concerning the collection and preparation of text
data for a larger number of languages:

- presentations about projects collecting and organizing (massively)
parallel or comparable corpora
- presentations about projects crawling web data to build a
cross-linguistic corpus
- approaches to (semi-)automatic annotation of corpora for typological research
- proposals of corpus formats that are useful for typological research
and can easily be imported into standard formats

B) Specific examples of corpus-based language comparison, focusing on
a particular linguistic topic of choice, using approaches like:

- (massively) parallel text analysis
- corpus-based multivariate quantitative comparison of languages
- unsupervised or semi-supervised language analysis for language comparison
- evaluation of cross-linguistic corpus-based studies

SUBMISSION PROCEDURE

Please send an abstract of approx. 500 words (excluding references) to
coquat2013 at gmail.com. Abstracts should contain the author's name,
affiliation and contact email. The deadline for the submission of
proposals is March 31, 2013. Notification of acceptance is May 1,
2013.

REFERENCES

Bickel, B. 2003. Referential density in discourse and syntactic
typology. Language 79. 708-739.

Cysouw, M. and B. Wälchli. (eds.), 2007. Parallel Texts. Using
Translational Equivalents in Linguistic Typology. Theme issue in
Sprachtypologie & Universalienforschung STUF 60.2.

Goldhahn, D., T. Eckart and U. Quasthoff. 2012. Building large
monolingual dictionaries at the Leipzig Corpora collection: From 100
to 200 languages. In Proceedings of the Eighth International
Conference on Language Resources and Evaluation (LREC’12), 23-25.

Greenberg, J. H. 1960. A quantitative approach to the morphological
typology of language. International Journal of American Linguistics
26. 178-194.

Mayer, T. and M. Cysouw. 2012. Language comparison through sparse
multilingual word alignment. In Proceedings of the EACL 2012 Joint
Workshop of LINGVIS & UNCLH. 54–62.

Scannell, K. P. 2007. The Crúbadán Project: Corpus building for
under-resourced languages. In C. Fairon, H. Naets, A. Kilgarriff, and
G-M. de Schryver (eds.), Building and exploring web corpora:
proceedings of the 3rd Web as Corpus Workshop, Cahiers du Central: 4,
5-15. Louvain: Presses Universitaires de Louvain.

van der Auwera, J., E. Schalley and J. Nuyts, 2005. Epistemic
possibility in a Slavonic parallel corpus - a pilot study. In B.
Hansen and P. Karlik (eds.), Modality in Slavonic Languages, New
Perspectives, München: Sagner. 201-17.

Wälchli, B. 2005. Co-compounds and Natural Coordination. Oxford:
Oxford University Press.