<html>

<body>

<div align="center">This message was posted to several lists. We

apologize for any cross-postings.<br>

 <br>

</div>

<div align="right"> <br><br>

<br>

</div>

<div align="center"><h1><font size=4><b>FINAL CALL FOR PAPERS - Note:

EXTENDED DEADLINE 18 February <br><br>

<br>

Workshop on</b></font></h1></div>

 <br>

<div align="center"><font size=5><b>COMPILING AND PROCESSING SPOKEN

LANGUAGE CORPORA<br>

</font></div>

 <br>

</b><div align="center"><a href="http://lands.let.kun.nl/CPSLC/" eudora="autourl">http://lands.let.kun.nl/CPSLC/<br>

</a></div>

 <br>

<div align="center"><b>Centro Cultural de Belem, Lisbon, Portugal<br>

24th<sup> </sup>May 2004<br>

</b></div>

 <br>

 <br>

<div align="center">Workshop to be held in conjunction with <br>

the 4th International Conference on Language Resources and Evaluation

(LREC 2004)<br>

Main conference: 26-27-28 May 2004<br>

<a href="http://www.lrec-conf.org/lrec2004/" eudora="autourl">http://www.lrec-conf.org/lrec2004/<br>

</a></div>

<div align="right"> <br>

 <br>

</div>

 <br><br>

<br>

<h2><font size=4><b>Aim</b></font></h2>The aim of the workshop is to

bring together people working on the development (compilation and

processing) of spoken language corpora.* The workshop will provide

participants with the opportunity to exchange views and share

experiences. Moreover, the workshop is instrumental in taking stock of

and evaluating the present state-of-the-art. The workshop thus aims to

contribute to the development of a future roadmap that will guide the

development of standards, tools, etc. for use with spoken language

corpora.<br>

 <br>

*The term ‘spoken language corpora’ is used here to distinguish such

corpora from speech corpora or speech databases: speech corpora are

collections of spoken data that are typically recorded for specific

purposes by specific users (speech corpora/databases such as SpeechDat

Car that are used for developing consumer applications). Usually such

databases lack the richness of linguistic annations that is pursued for

spoken language corpora.<br>

 <br>

  <br><br>

<br>

<h2><font size=4><b>Background and motivation</b></font></h2>Despite the

wide experience gained in the compilation of written language corpora,

working with spoken language data is not immediately straightforward as

spoken language involves many novel aspects that need to be taken care

of. The fact that spoken language is transient is sometimes offered as an

explanation for why it is more difficult to collect spoken data than it

is to compile a corpus of written data. However, it is not just the

capturing of data that is anything but trivial. Once the (audio) data

have been collected and stored, the next step is to produce some kind of

transcript (whether orthographic or phonetic). Further annotations such

as POS tagging, lemmatisation, syntactic annotation, and prosodic

annotation may then build upon this transcription. Among the problems

encountered in the processing of spoken language data are the following:

<br>

<ul>

<li>       There is as yet little

experience with the large scale transcription of spoken language data.

Procedures and guidelines must be developed, and tools implemented. 

<li>      Well-established practices that have

originated from working on written language corpora do not hold up when

trying to cope with the idiosyncracies of the spoken language. This is

true for all levels of linguistic annotation. Annotation schemes need to

be reconsidered and tools must be adapted. 

<li>       In so far as standards have

emerged (eg CES), they need to be adapted in order to be able to cater

for the needs of spoken language corpora. 

<li>         By their very

nature, spoken language corpora bring together speech and language

technologists and linguists from various backgrounds. Ideally, such

corpora should address the needs of all these different user groups.

Often, however, there is a conflict of interest. For example, the quality

of recordings of spontaneous conversations in noisy environments although

highly interesting and worthwhile from a linguistic perspective will

prove too poor to be of any use to someone doing research into speech

recognition. 

</ul> <br><br>

<font size=4><b>Workshop topics<br><br>

</b></font>Topics of interest include orthographic transcription,

phonetic transcription, prosodic annotation, segmentation, POS tagging

and lemmatisation, parsing, and discourse analysis. Contributions on the

development and implementation of standards or guidelines for spoken

language corpora (annotation schemes, meta-data descriptions) are also

invited, as are contributions describing software for the exploitation of

spoken language corpora.<br>

 <br>

 <br><br>

<font size=4><b>Format of the Workshop</font> <br><br>

</b>The workshop will comprise of oral presentations of previously

submitted papers that went through a double peer review process. The

proceedings of the workshop will be published by the local organising

committee.<br><br>

<br>

 <br>

<font size=4><b>Important dates<br><br></font>

<dl>

<dd>18th February 2004</b>       

<b>Extended deadline</b> for submission of (full) papers 

<dd>1st<sup>  </sup>March

2004             

Notification of acceptance and preliminary programme 

<dd>21st March

2004            

Deadline for submission of final versions of accepted papers for the

proceedings 

<dd>3rd<sup> </sup>April

2004               

 Definitive programme 

<dd>24th May

2004               

Workshop 

</dl> <br>

 <br><br>

<br>

<h3><font size=4><b>Submissions</b></font></h3>Prospective authors are

invited to submit papers for oral presentation. Only full papers in

English will be accepted, and the length of the paper should not exceed

6000 words (or the equivalent in space for diagrams).  Submissions

in MS Word, Postscript, PDF or RTF should be submitted through the

workshop website:

<a href="http://lands.let.kun.nl/CPSLC/" eudora="autourl">http://lands.let.kun.nl/CPSLC/<br>

</a> <br>

 <br>

<font size=4><b>Registration<br><br>

</b></font>Workshop participants need to register through the LREC

website:

<a href="http://www.lrec-conf.org/lrec2004/" eudora="autourl">http://www.lrec-conf.org/lrec2004/<br>

</a>The fee for this half-day workshop is 50 Euro for conference

participants and 85 for others and includes a coffee break and the

workshop proceedings.<br>

 <br>

 <br>

<font size=4><b>Organising committee<br><br>

</b></font>Nelleke OOSTDIJK, University of Nijmegen<br>

Gjert KRISTOFFERSEN, University of Bergen<br>

Geoffrey SAMPSON, University of Sussex<br>

 <br>

 <br>

<font size=4><b>Programme committee<br><br>

</b></font>Daan

BROEDER                               

Max Planck Institute<br>

Emanuela

CRESTI                            

University of Florence<br>

Gjert

KRISTOFFERSEN                   

University of Bergen<br>

Tony

MCENERY                              

University of Lancaster<br>

Nelleke

OOSTDIJK                            

University of Nijmegen<br>

Pavel

IRCING                                   

University of Western Bohemia<br>

Geoffrey

SAMPSON                          

University of Sussex<br>

Antonio Moreno

SANDOVAL              

University of Madrid<br>

Jean

VERÓNIS                                

Université de Provence<br>

</body>

<br>

</html>