<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
[Apologies for cross-postings]<br>
<br>
<b>CALL FOR PAPER</b><br>
Workshop on Language Technology for Patent Data: Language Resources
and Evaluation<br>
To be held in conjunction with the 8th International Language
Resources and Evaluation Conference (LREC 2012)<br>
<br>
27 May 2012 (afternoon)<br>
<br>
Lütfi Kirdar Istanbul Exhibition and Congress Centre, Istanbul,
Turkey<br>
<br>
<a class="moz-txt-link-freetext" href="http://workshops.elda.org/ltpd2012/">http://workshops.elda.org/ltpd2012/</a><br>
<br>
<b>Workshop Description</b><br>
In the last few years, the use of patents in automatic processing
has shown a growing interest in the<br>
NLP community. This has been particularly the case in the context of
Machine Translation (MT) or<br>
Cross-Lingual Information Retrieval (CLIR). Nowadays this has become
a major topic and besides<br>
the development of the technology itself, some key points remain
regarding the resources available<br>
and the way of evaluating the quality of the technology.<br>
A large number of language resources is already available for the
community, but the development<br>
of systems, in particular the statistical ones, always requires more
and more data. As there is a<br>
growing interest for patents and their processing, a workshop on the
topic which gathers all those<br>
involved in the different aspects concerned is a good opportunity to
move forward.<br>
The domain of patents itself is increasing and the amount of
potential material does not cease to<br>
increase. It is this potential material that gives hope to the
community for improving the systems.<br>
For instance, in China, the number of patents have been multiplied
by 3 in 5 years and they exceed<br>
1 million published documents per year by now. EPO (the European
Patent Office) uses more than<br>
150 translation pairs per day. Every patent office receives more and
more patents every day, needs a<br>
daily use of automatic tools to translate the documents, looks for
existing patents and their<br>
translation, manages complex content, etc. As we can see, this is a
domain in considerable demand<br>
and since the content of the patents is technical and needs high
skills in a specific domain, providing<br>
documents that are sufficiently understandable to the end users is
very complex. This is a real<br>
challenge for all NLP developers.<br>
Above all, this challenge is about corpora and their management. The
main topic concerns their<br>
acquisition and how to collect useful data. For most of the
researchers, this consists in harvesting<br>
web pages, cleaning them, getting the useful content according to a
specific task, aligning the<br>
sentences, etc. The acquisition task may also be done using OCR
tools on PDF. Monolingual<br>
corpora are easier to retrieve (e.g. from databases) compared to
parallel corpora. However, parallel<br>
translations exist and aligned corpora as well, or corpora that
could be easily aligned. Following the<br>
question of the acquisition of such documents, there is that of
database management. One could say<br>
that all these questions are not only related to patent data,
however this workshop would like focus<br>
on this particular domain and make some effort to improve things.<br>
Currently, the corpora are mainly used for MT. For a technical
end-user in a patent office, the end<br>
goal is to manage to understand the content of a document. This may
not require a very high quality<br>
translation since this person only needs to grasp the relevance of
the document. However, in MT,<br>
we still need to measure quantitatively the performance of the
systems. This is basically made using<br>
automatic and/or human measures, while most of the system developers
are using typical automatic<br>
metrics such as BLEU to get their results. Even if the drawbacks of
such metrics are well-known, it<br>
could be still relevant, for instance, to compare different versions
of a system. However, even when<br>
using BLEU, the content of patent documents is very particular,
which implies that different kinds<br>
of linguistic specificity need to be tackled: these include the
already expected terminological level,<br>
but also a syntactic level, a semantic one, and even the structure
of the documents may be different<br>
from that of other documents (for instance, patents typically
comprise of a title, an abstract, a<br>
technical description of the invention, and a list of novel claims).
Human measures may be also<br>
difficult to apply as patent documents are written in a way which
makes them difficult to read for<br>
the layman. Furthermore, both automatic and human evaluations should
have the chance to realise a<br>
deep analysis of the results, which is not trivial working with
patents. However, given the often<br>
formulaic nature of the text found in patents – which is enforced on
the author due to legal<br>
constraints – there may be opportunities to exploit this for
evaluation. For instance, claims are<br>
constructed as a single sentence with an introductory phrase and a
body linked by frequently<br>
occurring terms such as “in a certain embodiment”, “consisting
essentially of”, and clauses and lists<br>
introduced using colons, e.g. “comprising: …”<br>
The use of patents in CLIR suffers from the same kind of issues,
either for the evaluation of systems<br>
or for the collection of corpora. Sentence alignment may also have
specific issues related to the<br>
content of the documents, and many other types of tools may have
their own thoughts using patents.<br>
Through all those technologies, one can see their usage implies
several challenges, such as the<br>
integration of tools into patent information applications. The
different tools should help end-users to<br>
search, examine or classify patent documents, most of the time from
translations and not available<br>
in English. Web services should also be an extension of the tools
and web services should be<br>
connected through workflows, helping end-users in their daily work.<br>
Among all the topics previously mentioned, we would like to
contribute to the improvement of the<br>
challenging patent field, by sharing the knowledge from the whole
community.<br>
<br>
The different topics addressed during the workshop will be (but are
not limited to):<br>
- Corpora aspects: collecting data, cleaning, alignment, parallel
corpora, etc.;<br>
- Evaluation of technologies: definition of metrics, patent
specificity;<br>
- Integration of patent applications: web services, end-user
applications;<br>
- IPR issues and licensing.<br>
<br>
<b>Organising committee</b><br>
Heidi Depraetere (Crosslang, Belgium)<br>
Olivier Hamon (ELDA – Evaluations and Language resources
Distribution Agency, France)<br>
John Tinsley (PLUTO – Patent Language Translations Online, Ireland)<br>
<br>
<b>Programme committee</b><br>
Victoria Arranz (ELDA – Evaluations and Language resources
Distribution Agency, France)<br>
Alexandru Ceasusu (PLUTO - Patent Language Translations Online,
Ireland)<br>
Khalid Choukri (ELDA, France)<br>
Terumasa Ehara (Yamanashi Eiwa College, Japan)<br>
Cristina España-Bonet (UPC, Spain)<br>
Mihai Lupu (IRF and ESTeam, Austria)<br>
Bertrand Le Chapelain (EPO, Netherlands)<br>
Bente Maegaard (University of Copenhagen, Denmark)<br>
Bruno Pouliquen (World Intellectual Property Organization,
Switzerland)<br>
Lucia Specia (University of Sheffield, United Kingdom)<br>
Gregor Thurmair (Linguatec, Germany)<br>
Dan Wang (China Patent Information Center, China)<br>
Shoichi Yokoyama (Yamagata University, Japan)<br>
More to follow...<br>
<br>
<b>Important dates</b><br>
Deadline for submission: Friday 24 February 2010<br>
Notification of acceptance: Friday 23 March 2010<br>
Final version due: Friday 30 March 2010<br>
Workshop : 27 May 2010 (afternoon)<br>
<br>
<b>Submission Format</b><br>
Full papers up to 8 pages should be formatted according to LREC 2012
guidelines and be submitted<br>
through the online submission form
(<a class="moz-txt-link-freetext" href="https://www.softconf.com/lrec2012/PATENT2012/">https://www.softconf.com/lrec2012/PATENT2012/</a>) on<br>
START. For further queries, please contact Olivier Hamon at
hamon_at_elda_dot_org.<br>
When submitting a paper from the START page, authors will be asked
to provide essential<br>
information about resources (in a broad sense, i.e. also
technologies, standards, evaluation kits, etc.)<br>
that have been used for the work described in the paper or are a new
result of your research. For<br>
further information on this new initiative, please refer to
<a class="moz-txt-link-freetext" href="http://www.lrec-conf.org/lrec2012/?LREMap">http://www.lrec-conf.org/lrec2012/?LREMap</a>-<br>
2012.
</body>
</html>