workshop - annotation de corpus multilingues - 3/10/2011

Isabelle LEGLISE leglise at VJF.CNRS.FR
Tue Sep 13 15:24:36 UTC 2011


Veuillez trouver ci-dessous et en attachement le programme du Workshop 
International sur l'annotation de corpus multilingues organisé le lundi 
3 octobre prochain dans le cadre de l'atelier transversal Corpus O' 

Bien cordialement,
Isabelle Léglise

*International workshop on Multilingual Corpora annotation*


*//Lundi 3 octobre - salle de conférence - Bat D - Campus CNRS de 


Multilingual corpora represent an interesting concentrated mixture of 
most of the problems raised by monolingual corpora and some extra 
challenges. Central issues are related to problems of variation and 
non-standard forms, often ignored by big national corpora or controlled 
by rather general criteria (like quite general typologies of texts and 
discourses defining the kind of data collected). These variations 
transcend internal variation observable within a single language; they 
often even question the categorization of linguistic forms as belonging 
to a given language or a given variety /vs./ another. Corpora containing 
code-switching, code-mixing, hybridization phenomena, heterogeneous uses 
of a lingua franca variably mobilized within diverse social practices 
and linguistic competences, raise a range of methodological and 
theoretical questions - such as problems of identification, notation, 
transcription, and categorization of hybrid forms. These problems have 
crucial consequences for the annotation of corpora, for the definition 
and delimitation of what a multilingual corpus is, for the choice of 
relevant contexts of practice to be documented, etc.

The workshop aims at debating these problems, on the basis of

a) data bases of multilingual corpora already achieved - for which 
examples of the problems and solutions will be given.

b) excerpts of multilingual corpora on which problems of transcription, 
annotation, and exploitations will be illustrated and discussed.

*9h30 Coffee & Sweets - Welcome & Introduction - Isabelle Léglise & 
Lorenza Mondada *


*10h-10h45Yaron Matras *(Univ. Manchester)*Documenting languages and 
dialects in contact***

The presentation will outline several database resources developed in 
recent years at the University of Manchester for the documentation of 
languages in contact, societal multilingualism, and the dialects of 
Romani. The latter online resource == the Romani Morpho-Syntax (RMS) 
Database == is in many ways the most advanced technological tool for the 
online documentation of related varieties, and special emphasis will be 
put on the history of its emergence and its online functionality.


*10h50-11h35Naomi Nagy *(Univ. Toronto) *The Toronto Heritage Language 
Documentation Corpus (HerLD)***

Since 2009, the Heritage Language Variation and Change in Toronto 
Project has been building a corpus of conversational speech in a range 
of Heritage Languages in Toronto. Our aim is to bring together elements 
of code-switching theory which looks at when each language is selected, 
but not at which forms of the language are selected, with the 
variationist approach, which quantifies the effects of various 
contextual forces on the selection of forms within one language. I'll 
describe how we indicate use of multiple languages within one 
conversation and efforts to maintain consistency across protocols from 
six different languages/communities, developed by teams of students from 
each community.


*11h40-12h25 Thomas Schmidt *(Univ. Hamburg) *Multilingual corpora - 
technical aspects and a wishlist*

In its twelve years of existence, the Research Centre on Multilingualism 
at the University of Hamburg has built up a large database of spoken and 
written corpora. These corpora have been used (and will be further used) 
to study various aspects of multilingualism in individuals and in 
society. In my talk, I will give an overview of the corpora and address 
a couple of technical questions regarding their computer-assisted 
creation, analysis and dissemination. I will also discuss some of the 
lessons learned in the attempt to make these resources sustainable 
beyond the lifetime of the projects in which they were created. I will 
conclude with a set of desiderata for future developments in the field 
of multilingual corpora.

*12h30-14h Lunch - Buffet *(room 511)


*14h-14h45 **Carole Etienne *(CNRS, ICAR),*Lorenza Mondada *(Univ. Lyon 
2, ICAR)*, Véronique Traverso *(CNRS, ICAR)**

On the basis of the corpus data bank and workbench CLAPI (Corpus de 
LAngue Parlée en Interaction), which has been developed at the ICAR 
research lab for the last decade, we will discuss some problems related 
to the treatment of heterogeneous, hybrid, plurilingual data in such an 
environment. Our presentation focuses on the one side on what has been 
achieved for CLAPI and the search tools developed for a semi-automatic 
exploration of large corpora. On the other side, we make explicit some 
of our analytical concerns concerning the transcription of plurilingual 
data and the problems they rise for a data bank. **


*14h50-15h30Sophie Alby *(UAG & SEDYL)*, Isabelle Léglise *(CNRS, 
SEDYL)*& Pascal Vaillant *(Paris 13)*- **From linguistic annotation in 
multilingual corpora to the annotation of language contact phenomena*

Taking examples of our ANR CLAPOTY Project (Towards a multi-level, 
typological and computer-assisted analysis of contact-induced language 
change), we will illustrate how difficult and counter-productive it can 
be to annotate languages in multilingual corpora and mention some 
solutions we are experimenting, annotating  and analyzing remarkable 
language contact phenomena both at morphosyntactic, discursive and 
interactional levels.


*15h35-16h20 Peter Auer *(Univ. Freiburg)***Languages leak. Some reasons 
why it is difficult to label bilingual talk in a data base*

In my presentation I will argue for a non-binary approach to 
"bi"-lingual talk, i.e. one in which there is more than language A and 
language B. This raises a number of issues for the compilation and 
labelling of large-scale, electronically searchable corpora. In 
particular, I want to show that

- the distance between the same two languages is not the same at all 
points in an emerging utterance, due to structural overlaps, and due to 
on-the-spot as well as longue durée convergences,

- that the alternation between A- and B-language materials can have 
different status, depending on the amount of grammaticisation.

This suggests a cautious approach to labelling in order not to fall into 
the trap of a crypto-structuralist assumption of the alternating 
languages as self-contained systems


*16h25 Coffee break*


*16h35-17h30Final discussion*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: multilingual-corpora-annotation.xps
Type: application/
Size: 568725 bytes
Desc: not available
URL: <>

More information about the Parislinguists mailing list