<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>

<META name=GENERATOR content="MSHTML 8.00.6001.19170">

<STYLE></STYLE>

</HEAD>

<BODY bgColor=#ffffff>

<DIV><FONT size=2 face=Arial>Apologies for multiple postings<BR>Please 

distribute to colleagues</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 

face=Arial>============================================================</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  5th WORKSHOP ON BUILDING AND USING 

COMPARABLE CORPORA</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  Language Resources for Machine 

Translation<BR>  in Less-Resourced Languages and Domains</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  Co-located with LREC 2012<BR>  Lütfi 

Kirdar Istanbul Exhibition and Congress Centre<BR>  Saturday, 26 May 

2012</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  DEADLINE FOR PAPERS: 15 February 

2012</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  <A 

href="http://hnk.ffzg.hr/5bucc2012">http://hnk.ffzg.hr/5bucc2012</A></FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  Endorsed by<BR>   * ACL SIGWAC 

(Special Interest Group on Web as Corpus)<BR>   * FLaReNet (Fostering 

Language Resources Network)<BR>   * META-NET (Multilingual Europe 

Technology Alliance)</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>  INVITED SESSION ON PROJECTS INVOLVING 

COMPARABLE CORPORA:</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>   * ACCURAT - Inguna Skadina (<A 

href="http://www.accurat-project.eu/">http://www.accurat-project.eu/</A>)<BR>   

* LetsMT! - Andrejs Vasiljevs (<A 

href="https://www.letsmt.eu/">https://www.letsmt.eu/</A>)<BR>   * 

PANACEA - Nuría Bel (<A 

href="http://panacea-lr.eu/">http://panacea-lr.eu/</A>)<BR>   * 

PRESEMT - Adam Kilgarriff (<A 

href="http://www.presemt.eu/">http://www.presemt.eu/</A>)<BR>   * TTC 

- Béatrice Daille (<A 

href="http://www.ttc-project.eu/">http://www.ttc-project.eu/</A>)</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 

face=Arial>============================================================</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>MOTIVATION</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>In the language engineering and the linguistics 

communities,<BR>research in comparable corpora has been motivated by two 

main<BR>reasons. In language engineering, it is chiefly motivated by the<BR>need 

to use comparable corpora as training data for statistical<BR>NLP applications 

such as statistical machine translation or<BR>cross-lingual retrieval. In 

linguistics, on the other hand,<BR>comparable corpora are of interest in 

themselves by making<BR>possible inter-linguistic discoveries and comparisons. 

It is<BR>generally accepted in both communities that comparable corpora<BR>are 

documents in one or several languages that are comparable in<BR>content and form 

in various degrees and dimensions. We believe<BR>that the linguistic definitions 

and observations related to<BR>comparable corpora can improve methods to mine 

such corpora for<BR>applications of statistical NLP. As such, it is of great 

interest<BR>to bring together builders and users of such corpora.</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>The scarcity of parallel corpora has motivated 

research concerning<BR>the use of comparable corpora: pairs of monolingual 

corpora selected<BR>according to the same set of criteria, but in different 

languages<BR>or language varieties. Non-parallel yet comparable corpora 

overcome<BR>the two limitations of parallel corpora, since sources for 

original,<BR>monolingual texts are much more abundant than translated 

texts.<BR>However, because of their nature, mining translations in 

comparable<BR>corpora is much more challenging than in parallel corpora. 

What<BR>constitutes a good comparable corpus, for a given task or per 

se,<BR>also requires specific attention: while the definition of a 

parallel<BR>corpus is fairly straightforward, building a non-parallel 

corpus<BR>requires control over the selection of source texts in both 

languages.</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>Parallel corpora are a key resource as training 

data for statistical<BR>machine translation, and for building or extending 

bilingual lexicons<BR>and terminologies. However, beyond a few language pairs 

such as English-<BR>French or English-Chinese and a few contexts such as 

parliamentary debates<BR>or legal texts, they remain a scarce resource, despite 

the creation of<BR>automated methods to collect parallel corpora from the Web. 

To exemplify<BR>such issues in a practical setting, this year's special focus 

will be on</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>   Language Resources for Machine 

Translation<BR>   in Less-Resourced Languages and Domains</FONT></DIV>

<DIV> </DIV>

<DIV><FONT size=2 face=Arial>with the aim of overcoming the shortage of parallel 

resources<BR>when building MT systems for less-resourced languages and 

domains,<BR>particularly by usage of comparable corpora for finding parallel 

data<BR>within and by reaching out for "hidden" parallel data. Lack of 

sufficient<BR>language resources for many language pairs and domains is 

currently one<BR>of the major obstacles in further advancement of machine 

translation.</FONT></DIV>

<DIV> </DIV><FONT size=2 face=Arial>

<DIV><BR>TOPICS</DIV>

<DIV> </DIV>

<DIV>We solicit contributions including but not limited to the following 

topics:</DIV>

<DIV> </DIV>

<DIV>Topics related to the special theme:</DIV>

<DIV> </DIV>

<DIV>* comparable corpora use in MT<BR>* comparable corpora processing 

tools/kits for MT<BR>* parallel corpora usage<BR>* parallel corpora processing 

tools/platforms<BR>* MT for less-resourced languages<BR>* MT for less-resourced 

domains<BR>* open source SMT systems (Moses, etc.)<BR>* publicly available 

SMT</DIV>

<DIV> </DIV>

<DIV>Building Comparable Corpora:</DIV>

<DIV> </DIV>

<DIV> * Human translations<BR> * Automatic and semi-automatic 

methods<BR> * Methods to mine parallel and non-parallel corpora from the 

Web<BR> * Tools and criteria to evaluate the comparability of 

corpora<BR> * Parallel vs non-parallel corpora, monolingual 

corpora<BR> * Rare and minority languages<BR> * Across language 

families<BR> * Multi-media/multi-modal comparable corpora</DIV>

<DIV> </DIV>

<DIV>Applications of comparable corpora:</DIV>

<DIV> </DIV>

<DIV> * Human translations<BR> * Language learning<BR> * 

Cross-language information retrieval & document categorization<BR> * 

Bilingual projections<BR> * Machine translation<BR> * Writing 

assistance</DIV>

<DIV> </DIV>

<DIV>Mining from Comparable Corpora:</DIV>

<DIV> </DIV>

<DIV> * Extraction of parallel segments or paraphrases from 

comparable<BR>   corpora<BR> * Extraction of bilingual and 

multilingual translations of single<BR>   words and multi-word 

expressions; proper names, named entities,<BR>   etc.</DIV>

<DIV> </DIV>

<DIV><BR>IMPORTANT DATES (TENTATIVE)</DIV>

<DIV> </DIV>

<DIV>  15 February 2012    Deadline for submission of full 

papers<BR>     10 March 2012    Notification 

of acceptance<BR>     20 March 2012    

Camera-ready papers due<BR>       26 May 

2012    Workshop date</DIV>

<DIV> </DIV>

<DIV><BR>SUBMISSION INFORMATION</DIV>

<DIV> </DIV>

<DIV>Papers should follow the LREC main conference formatting details (to 

be<BR>announced on the conference website <A 

href="http://www.lrec-conf.org/lrec2012/">http://www.lrec-conf.org/lrec2012/</A>)<BR>and 

should be submitted as a PDF-file of no more than ten pages via the<BR>START 

workshop manager: <A 

href="https://www.softconf.com/lrec2012/BUCC2012/">https://www.softconf.com/lrec2012/BUCC2012/</A><BR>Reviewing 

will be double blind, so the papers should not reveal the<BR>authors' identity. 

Accepted papers will be published in the workshop<BR>proceedings.</DIV>

<DIV> </DIV>

<DIV>Double submission policy: Parallel submission to other meetings 

or<BR>publications are possible but must be immediately notified to 

the<BR>workshop organizers.</DIV>

<DIV> </DIV>

<DIV>When submitting a paper through the START page, authors will be asked<BR>to 

provide information about the resources that have been used for the 

work<BR>described in their paper or are an outcome of their research. For 

details on<BR>this initiative, please refer to <A 

href="http://www.lrec-conf.org/lrec2012/?LRE-Map-2012">http://www.lrec-conf.org/lrec2012/?LRE-Map-2012</A>.<BR>Authors 

will also be asked to contribute to the Language Library, the new<BR>initiative 

of LREC 2012.</DIV>

<DIV> </DIV>

<DIV>For further information, please contact<BR>   Reinhard Rapp 

reinhardrapp (at) gmx (dot) de<BR>   or Marko Tadic marko.tadic (at) 

ffzg (dot) hr</DIV>

<DIV> </DIV>

<DIV><BR>ORGANISERS</DIV>

<DIV> </DIV>

<DIV>  Reinhard Rapp, Universities of Mainz (Germany) and Leeds 

(UK)<BR>  Marko Tadic,  University of Zagreb (Croatia)<BR>  Serge 

Sharoff, University of Leeds (UK)<BR>  Andrejs Vasiljevs, Tilde SIA, Riga 

(Latvia)<BR>  Pierre Zweigenbaum, LIMSI, CNRS, Orsay, and ERTIM, INALCO, 

Paris (France)</DIV>

<DIV> </DIV>

<DIV><BR>SCIENTIFIC COMMITTEE</DIV>

<DIV> </DIV>

<DIV>* Srinivas Bangalore (AT&T Labs, USA)<BR>* Caroline Barrière (National 

Research Council Canada)<BR>* Chris Biemann (Microsoft / Powerset, San 

Francisco, USA)<BR>* Lynne Bowker (University of Ottawa, Canada)<BR>* Hervé 

Déjean (Xerox Research Centre Europe, Grenoble, France)<BR>* Andreas Eisele 

(DFKI, Saarbrücken, Germany)<BR>* Rob Gaizauskas (University of Sheffield, 

UK)<BR>* Éric Gaussier (Université Joseph Fourier, Grenoble, France)<BR>* Nikos 

Glaros (ILSP, Athens, Greece)<BR>* Gregory Grefenstette (Exalead/Dassault 

Systemes, Paris, France)<BR>* Silvia Hansen-Schirra (University of Mainz, 

Germany)<BR>* Kyo Kageura (University of Tokyo, Japan)<BR>* Adam Kilgarriff 

(Lexical Computing Ltd, UK)<BR>* Natalie Kübler (Université Paris Diderot, 

France)<BR>* Philippe Langlais (Université de Montréal, Canada)<BR>* Tony 

McEnery (Lancaster University, UK)<BR>* Emmanuel Morin (Université de Nantes, 

France)<BR>* Dragos Stefan Munteanu (Language Weaver Inc., USA)<BR>* Lene 

Offersgaard (University of Copenhagen, Denmark)<BR>* Reinhard Rapp (Universities 

of Mainz, Germany, and Leeds, UK)<BR>* Sujith Ravi (Yahoo! Research, Santa 

Clara, CA, USA)<BR>* Serge Sharoff (University of Leeds, UK)<BR>* Michel Simard 

(National Research Council Canada)<BR>* Inguna Skadina (Tilde, Riga, 

Latvia)<BR>* Monique Slodzian (INALCO, Paris, France)<BR>* Benjamin Tsou (The 

Hong Kong Institute of Education, China)<BR>* Dan Tufis (Romanian Academy, 

Bucharest, Romania)<BR>* Justin Washtell (University of Leeds, UK)<BR>* Oliver 

Wilson (University of Edinburgh, UK)<BR>* Michael Zock (LIF, CNRS Marseille, 

France)<BR>* Pierre Zweigenbaum (LIMSI-CNRS, Orsay, 

France)<BR></FONT></DIV></BODY></HTML>