Appel: Workshop on Language Technology for Patent Data: Language Resources and Evaluation, LREC 2012 Workshops

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Sun Dec 18 13:33:36 UTC 2011

Date: Fri, 16 Dec 2011 18:25:42 +0100
From: ELRA ELDA Information <info at>
Message-ID: <4EEB7F16.9070301 at>

[Apologies for cross-postings]

Workshop on Language Technology for Patent Data: Language Resources and
To be held in conjunction with the 8th International Language Resources
and Evaluation Conference (LREC 2012)

27 May 2012 (afternoon)

Lütfi Kirdar Istanbul Exhibition and Congress Centre, Istanbul, Turkey

*Workshop Description*

In the last few years, the use of patents in automatic processing has
shown a growing interest in the NLP community. This has been
particularly the case in the context of Machine Translation (MT) or
Cross-Lingual Information Retrieval (CLIR). Nowadays this has become a
major topic and besides the development of the technology itself, some
key points remain regarding the resources available and the way of
evaluating the quality of the technology.  A large number of language
resources is already available for the community, but the development of
systems, in particular the statistical ones, always requires more and
more data. As there is a growing interest for patents and their
processing, a workshop on the topic which gathers all those involved in
the different aspects concerned is a good opportunity to move forward.
The domain of patents itself is increasing and the amount of potential
material does not cease to increase. It is this potential material that
gives hope to the community for improving the systems.  For instance, in
China, the number of patents have been multiplied by 3 in 5 years and
they exceed 1 million published documents per year by now. EPO (the
European Patent Office) uses more than 150 translation pairs per
day. Every patent office receives more and more patents every day, needs
a daily use of automatic tools to translate the documents, looks for
existing patents and their translation, manages complex content, etc. As
we can see, this is a domain in considerable demand and since the
content of the patents is technical and needs high skills in a specific
domain, providing documents that are sufficiently understandable to the
end users is very complex. This is a real challenge for all NLP
developers.  Above all, this challenge is about corpora and their
management. The main topic concerns their acquisition and how to collect
useful data. For most of the researchers, this consists in harvesting
web pages, cleaning them, getting the useful content according to a
specific task, aligning the sentences, etc. The acquisition task may
also be done using OCR tools on PDF. Monolingual corpora are easier to
retrieve (e.g. from databases) compared to parallel corpora. However,
parallel translations exist and aligned corpora as well, or corpora that
could be easily aligned. Following the question of the acquisition of
such documents, there is that of database management. One could say that
all these questions are not only related to patent data, however this
workshop would like focus on this particular domain and make some effort
to improve things.  Currently, the corpora are mainly used for MT. For a
technical end-user in a patent office, the end goal is to manage to
understand the content of a document. This may not require a very high
quality translation since this person only needs to grasp the relevance
of the document. However, in MT, we still need to measure quantitatively
the performance of the systems.  This is basically made using automatic
and/or human measures, while most of the system developers are using
typical automatic metrics such as BLEU to get their results. Even if the
drawbacks of such metrics are well-known, it could be still relevant,
for instance, to compare different versions of a system. However, even
when using BLEU, the content of patent documents is very particular,
which implies that different kinds of linguistic specificity need to be
tackled: these include the already expected terminological level, but
also a syntactic level, a semantic one, and even the structure of the
documents may be different from that of other documents (for instance,
patents typically comprise of a title, an abstract, a technical
description of the invention, and a list of novel claims).  Human
measures may be also difficult to apply as patent documents are written
in a way which makes them difficult to read for the layman. Furthermore,
both automatic and human evaluations should have the chance to realise a
deep analysis of the results, which is not trivial working with patents.
However, given the often formulaic nature of the text found in patents
-- which is enforced on the author due to legal constraints -- there may
be opportunities to exploit this for evaluation. For instance, claims
are constructed as a single sentence with an introductory phrase and a
body linked by frequently occurring terms such as "in a certain
embodiment", "consisting essentially of", and clauses and lists
introduced using colons, e.g. "comprising: ..."  The use of patents in
CLIR suffers from the same kind of issues, either for the evaluation of
systems or for the collection of corpora. Sentence alignment may also
have specific issues related to the content of the documents, and many
other types of tools may have their own thoughts using patents.  Through
all those technologies, one can see their usage implies several
challenges, such as the integration of tools into patent information
applications. The different tools should help end-users to search,
examine or classify patent documents, most of the time from translations
and not available in English. Web services should also be an extension
of the tools and web services should be connected through workflows,
helping end-users in their daily work.  Among all the topics previously
mentioned, we would like to contribute to the improvement of the
challenging patent field, by sharing the knowledge from the whole

The different topics addressed during the workshop will be (but are not
limited to):
- Corpora aspects: collecting data, cleaning, alignment, parallel
  corpora, etc.;
- Evaluation of technologies: definition of metrics, patent specificity;
- Integration of patent applications: web services, end-user
- IPR issues and licensing.

*Organising committee*
Heidi Depraetere (Crosslang, Belgium)
Olivier Hamon (ELDA -- Evaluations and Language resources Distribution
Agency, France)
John Tinsley (PLUTO -- Patent Language Translations Online, Ireland)

*Programme committee*
Victoria Arranz (ELDA -- Evaluations and Language resources Distribution
Agency, France)
Alexandru Ceasusu (PLUTO - Patent Language Translations Online, Ireland)
Khalid Choukri (ELDA, France)
Terumasa Ehara (Yamanashi Eiwa College, Japan)
Cristina España-Bonet (UPC, Spain)
Mihai Lupu (IRF and ESTeam, Austria)
Bertrand Le Chapelain (EPO, Netherlands)
Bente Maegaard (University of Copenhagen, Denmark)
Bruno Pouliquen (World Intellectual Property Organization, Switzerland)
Lucia Specia (University of Sheffield, United Kingdom)
Gregor Thurmair (Linguatec, Germany)
Dan Wang (China Patent Information Center, China)
Shoichi Yokoyama (Yamagata University, Japan)
More to follow...

*Important dates*
Deadline for submission: Friday 24 February 2010
Notification of acceptance: Friday 23 March 2010
Final version due: Friday 30 March 2010
Workshop : 27 May 2010 (afternoon)

*Submission Format*
Full papers up to 8 pages should be formatted according to LREC 2012
guidelines and be submitted through the online submission form
( on START. For further
queries, please contact Olivier Hamon at hamon_at_elda_dot_org.  When
submitting a paper from the START page, authors will be asked to provide
essential information about resources (in a broad sense, i.e. also
technologies, standards, evaluation kits, etc.)  that have been used for
the work described in the paper or are a new result of your
research. For further information on this new initiative, please refer

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list