workshop - annotation de corpus multilingues - 3/10/2011
Isabelle LEGLISE
leglise at VJF.CNRS.FR
Tue Sep 13 15:24:36 UTC 2011
Bonjour,
Veuillez trouver ci-dessous et en attachement le programme du Workshop
International sur l'annotation de corpus multilingues organisé le lundi
3 octobre prochain dans le cadre de l'atelier transversal Corpus O'
(TUL-ILF).
Bien cordialement,
Isabelle Léglise
*International workshop on Multilingual Corpora annotation*
**
*//Lundi 3 octobre - salle de conférence - Bat D - Campus CNRS de
Villejuif.*
*//*
Multilingual corpora represent an interesting concentrated mixture of
most of the problems raised by monolingual corpora and some extra
challenges. Central issues are related to problems of variation and
non-standard forms, often ignored by big national corpora or controlled
by rather general criteria (like quite general typologies of texts and
discourses defining the kind of data collected). These variations
transcend internal variation observable within a single language; they
often even question the categorization of linguistic forms as belonging
to a given language or a given variety /vs./ another. Corpora containing
code-switching, code-mixing, hybridization phenomena, heterogeneous uses
of a lingua franca variably mobilized within diverse social practices
and linguistic competences, raise a range of methodological and
theoretical questions - such as problems of identification, notation,
transcription, and categorization of hybrid forms. These problems have
crucial consequences for the annotation of corpora, for the definition
and delimitation of what a multilingual corpus is, for the choice of
relevant contexts of practice to be documented, etc.
The workshop aims at debating these problems, on the basis of
a) data bases of multilingual corpora already achieved - for which
examples of the problems and solutions will be given.
b) excerpts of multilingual corpora on which problems of transcription,
annotation, and exploitations will be illustrated and discussed.
*9h30 Coffee & Sweets - Welcome & Introduction - Isabelle Léglise &
Lorenza Mondada *
*
*
*10h-10h45Yaron Matras *(Univ. Manchester)*Documenting languages and
dialects in contact***
The presentation will outline several database resources developed in
recent years at the University of Manchester for the documentation of
languages in contact, societal multilingualism, and the dialects of
Romani. The latter online resource == the Romani Morpho-Syntax (RMS)
Database == is in many ways the most advanced technological tool for the
online documentation of related varieties, and special emphasis will be
put on the history of its emergence and its online functionality.
*
*
*10h50-11h35Naomi Nagy *(Univ. Toronto) *The Toronto Heritage Language
Documentation Corpus (HerLD)***
Since 2009, the Heritage Language Variation and Change in Toronto
Project has been building a corpus of conversational speech in a range
of Heritage Languages in Toronto. Our aim is to bring together elements
of code-switching theory which looks at when each language is selected,
but not at which forms of the language are selected, with the
variationist approach, which quantifies the effects of various
contextual forces on the selection of forms within one language. I'll
describe how we indicate use of multiple languages within one
conversation and efforts to maintain consistency across protocols from
six different languages/communities, developed by teams of students from
each community.
*
*
*11h40-12h25 Thomas Schmidt *(Univ. Hamburg) *Multilingual corpora -
technical aspects and a wishlist*
In its twelve years of existence, the Research Centre on Multilingualism
at the University of Hamburg has built up a large database of spoken and
written corpora. These corpora have been used (and will be further used)
to study various aspects of multilingualism in individuals and in
society. In my talk, I will give an overview of the corpora and address
a couple of technical questions regarding their computer-assisted
creation, analysis and dissemination. I will also discuss some of the
lessons learned in the attempt to make these resources sustainable
beyond the lifetime of the projects in which they were created. I will
conclude with a set of desiderata for future developments in the field
of multilingual corpora.
*12h30-14h Lunch - Buffet *(room 511)
*
*
*14h-14h45 **Carole Etienne *(CNRS, ICAR),*Lorenza Mondada *(Univ. Lyon
2, ICAR)*, Véronique Traverso *(CNRS, ICAR)**
On the basis of the corpus data bank and workbench CLAPI (Corpus de
LAngue Parlée en Interaction), which has been developed at the ICAR
research lab for the last decade, we will discuss some problems related
to the treatment of heterogeneous, hybrid, plurilingual data in such an
environment. Our presentation focuses on the one side on what has been
achieved for CLAPI and the search tools developed for a semi-automatic
exploration of large corpora. On the other side, we make explicit some
of our analytical concerns concerning the transcription of plurilingual
data and the problems they rise for a data bank. **
*
*
*14h50-15h30Sophie Alby *(UAG & SEDYL)*, Isabelle Léglise *(CNRS,
SEDYL)*& Pascal Vaillant *(Paris 13)*- **From linguistic annotation in
multilingual corpora to the annotation of language contact phenomena*
Taking examples of our ANR CLAPOTY Project (Towards a multi-level,
typological and computer-assisted analysis of contact-induced language
change), we will illustrate how difficult and counter-productive it can
be to annotate languages in multilingual corpora and mention some
solutions we are experimenting, annotating and analyzing remarkable
language contact phenomena both at morphosyntactic, discursive and
interactional levels.
*
*
*15h35-16h20 Peter Auer *(Univ. Freiburg)***Languages leak. Some reasons
why it is difficult to label bilingual talk in a data base*
In my presentation I will argue for a non-binary approach to
"bi"-lingual talk, i.e. one in which there is more than language A and
language B. This raises a number of issues for the compilation and
labelling of large-scale, electronically searchable corpora. In
particular, I want to show that
- the distance between the same two languages is not the same at all
points in an emerging utterance, due to structural overlaps, and due to
on-the-spot as well as longue durée convergences,
- that the alternation between A- and B-language materials can have
different status, depending on the amount of grammaticisation.
This suggests a cautious approach to labelling in order not to fall into
the trap of a crypto-structuralist assumption of the alternating
languages as self-contained systems
*
*
*16h25 Coffee break*
*
*
*16h35-17h30Final discussion*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/parislinguists/attachments/20110913/792bae60/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: multilingual-corpora-annotation.xps
Type: application/vnd.ms-xpsdocument
Size: 568725 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/parislinguists/attachments/20110913/792bae60/attachment.bin>
More information about the Parislinguists
mailing list