<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    [Apologies for cross-postings]<br>

    <br>

    <b>CALL FOR PAPER</b><br>

    Workshop on Language Technology for Patent Data: Language Resources

    and Evaluation<br>

    To be held in conjunction with the 8th International Language

    Resources and Evaluation Conference (LREC 2012)<br>

    <br>

    27 May 2012 (afternoon)<br>

    <br>

    Lütfi Kirdar Istanbul Exhibition and Congress Centre, Istanbul,

    Turkey<br>

    <br>

    <a class="moz-txt-link-freetext" href="http://workshops.elda.org/ltpd2012/">http://workshops.elda.org/ltpd2012/</a><br>

    <br>

    <b>Workshop Description</b><br>

    In the last few years, the use of patents in automatic processing

    has shown a growing interest in the<br>

    NLP community. This has been particularly the case in the context of

    Machine Translation (MT) or<br>

    Cross-Lingual Information Retrieval (CLIR). Nowadays this has become

    a major topic and besides<br>

    the development of the technology itself, some key points remain

    regarding the resources available<br>

    and the way of evaluating the quality of the technology.<br>

    A large number of language resources is already available for the

    community, but the development<br>

    of systems, in particular the statistical ones, always requires more

    and more data. As there is a<br>

    growing interest for patents and their processing, a workshop on the

    topic which gathers all those<br>

    involved in the different aspects concerned is a good opportunity to

    move forward.<br>

    The domain of patents itself is increasing and the amount of

    potential material does not cease to<br>

    increase. It is this potential material that gives hope to the

    community for improving the systems.<br>

    For instance, in China, the number of patents have been multiplied

    by 3 in 5 years and they exceed<br>

    1 million published documents per year by now. EPO (the European

    Patent Office) uses more than<br>

    150 translation pairs per day. Every patent office receives more and

    more patents every day, needs a<br>

    daily use of automatic tools to translate the documents, looks for

    existing patents and their<br>

    translation, manages complex content, etc. As we can see, this is a

    domain in considerable demand<br>

    and since the content of the patents is technical and needs high

    skills in a specific domain, providing<br>

    documents that are sufficiently understandable to the end users is

    very complex. This is a real<br>

    challenge for all NLP developers.<br>

    Above all, this challenge is about corpora and their management. The

    main topic concerns their<br>

    acquisition and how to collect useful data. For most of the

    researchers, this consists in harvesting<br>

    web pages, cleaning them, getting the useful content according to a

    specific task, aligning the<br>

    sentences, etc. The acquisition task may also be done using OCR

    tools on PDF. Monolingual<br>

    corpora are easier to retrieve (e.g. from databases) compared to

    parallel corpora. However, parallel<br>

    translations exist and aligned corpora as well, or corpora that

    could be easily aligned. Following the<br>

    question of the acquisition of such documents, there is that of

    database management. One could say<br>

    that all these questions are not only related to patent data,

    however this workshop would like focus<br>

    on this particular domain and make some effort to improve things.<br>

    Currently, the corpora are mainly used for MT. For a technical

    end-user in a patent office, the end<br>

    goal is to manage to understand the content of a document. This may

    not require a very high quality<br>

    translation since this person only needs to grasp the relevance of

    the document. However, in MT,<br>

    we still need to measure quantitatively the performance of the

    systems. This is basically made using<br>

    automatic and/or human measures, while most of the system developers

    are using typical automatic<br>

    metrics such as BLEU to get their results. Even if the drawbacks of

    such metrics are well-known, it<br>

    could be still relevant, for instance, to compare different versions

    of a system. However, even when<br>

    using BLEU, the content of patent documents is very particular,

    which implies that different kinds<br>

    of linguistic specificity need to be tackled: these include the

    already expected terminological level,<br>

    but also a syntactic level, a semantic one, and even the structure

    of the documents may be different<br>

    from that of other documents (for instance, patents typically

    comprise of a title, an abstract, a<br>

    technical description of the invention, and a list of novel claims).

    Human measures may be also<br>

    difficult to apply as patent documents are written in a way which

    makes them difficult to read for<br>

    the layman. Furthermore, both automatic and human evaluations should

    have the chance to realise a<br>

    deep analysis of the results, which is not trivial working with

    patents. However, given the often<br>

    formulaic nature of the text found in patents – which is enforced on

    the author due to legal<br>

    constraints – there may be opportunities to exploit this for

    evaluation. For instance, claims are<br>

    constructed as a single sentence with an introductory phrase and a

    body linked by frequently<br>

    occurring terms such as “in a certain embodiment”, “consisting

    essentially of”, and clauses and lists<br>

    introduced using colons, e.g. “comprising: …”<br>

    The use of patents in CLIR suffers from the same kind of issues,

    either for the evaluation of systems<br>

    or for the collection of corpora. Sentence alignment may also have

    specific issues related to the<br>

    content of the documents, and many other types of tools may have

    their own thoughts using patents.<br>

    Through all those technologies, one can see their usage implies

    several challenges, such as the<br>

    integration of tools into patent information applications. The

    different tools should help end-users to<br>

    search, examine or classify patent documents, most of the time from

    translations and not available<br>

    in English. Web services should also be an extension of the tools

    and web services should be<br>

    connected through workflows, helping end-users in their daily work.<br>

    Among all the topics previously mentioned, we would like to

    contribute to the improvement of the<br>

    challenging patent field, by sharing the knowledge from the whole

    community.<br>

    <br>

    The different topics addressed during the workshop will be (but are

    not limited to):<br>

    - Corpora aspects: collecting data, cleaning, alignment, parallel

    corpora, etc.;<br>

    - Evaluation of technologies: definition of metrics, patent

    specificity;<br>

    - Integration of patent applications: web services, end-user

    applications;<br>

    - IPR issues and licensing.<br>

    <br>

    <b>Organising committee</b><br>

    Heidi Depraetere (Crosslang, Belgium)<br>

    Olivier Hamon (ELDA – Evaluations and Language resources

    Distribution Agency, France)<br>

    John Tinsley (PLUTO – Patent Language Translations Online, Ireland)<br>

    <br>

    <b>Programme committee</b><br>

    Victoria Arranz (ELDA – Evaluations and Language resources

    Distribution Agency, France)<br>

    Alexandru Ceasusu (PLUTO - Patent Language Translations Online,

    Ireland)<br>

    Khalid Choukri (ELDA, France)<br>

    Terumasa Ehara (Yamanashi Eiwa College, Japan)<br>

    Cristina España-Bonet (UPC, Spain)<br>

    Mihai Lupu (IRF and ESTeam, Austria)<br>

    Bertrand Le Chapelain (EPO, Netherlands)<br>

    Bente Maegaard (University of Copenhagen, Denmark)<br>

    Bruno Pouliquen (World Intellectual Property Organization,

    Switzerland)<br>

    Lucia Specia (University of Sheffield, United Kingdom)<br>

    Gregor Thurmair (Linguatec, Germany)<br>

    Dan Wang (China Patent Information Center, China)<br>

    Shoichi Yokoyama (Yamagata University, Japan)<br>

    More to follow...<br>

    <br>

    <b>Important dates</b><br>

    Deadline for submission: Friday 24 February 2010<br>

    Notification of acceptance: Friday 23 March 2010<br>

    Final version due: Friday 30 March 2010<br>

    Workshop : 27 May 2010 (afternoon)<br>

    <br>

    <b>Submission Format</b><br>

    Full papers up to 8 pages should be formatted according to LREC 2012

    guidelines and be submitted<br>

    through the online submission form

    (<a class="moz-txt-link-freetext" href="https://www.softconf.com/lrec2012/PATENT2012/">https://www.softconf.com/lrec2012/PATENT2012/</a>) on<br>

    START. For further queries, please contact Olivier Hamon at

    hamon_at_elda_dot_org.<br>

    When submitting a paper from the START page, authors will be asked

    to provide essential<br>

    information about resources (in a broad sense, i.e. also

    technologies, standards, evaluation kits, etc.)<br>

    that have been used for the work described in the paper or are a new

    result of your research. For<br>

    further information on this new initiative, please refer to

    <a class="moz-txt-link-freetext" href="http://www.lrec-conf.org/lrec2012/?LREMap">http://www.lrec-conf.org/lrec2012/?LREMap</a>-<br>

    2012.

  </body>

</html>