[Corpora-List] **EXTENDED DEADLINE** CfP Workshop on Language Technology for Patent Data LREC 2012
Olivier Hamon
hamon at elda.org
Wed Feb 22 23:15:32 UTC 2012
[Apologies for multiple postings]
*
***EXTENDED DEADLINE: FRI 2 MARCH 2012***
FINAL CALL FOR PAPERS*_/
/_/Workshop
on Language Technology for Patent Data: Language Resources and Evaluation/
To be held in conjunction with the 8th International
LanguageResources and Evaluation Conference (LREC 2012)
27 May 2012 (afternoon)
Lütfi Kirdar Istanbul Exhibition and Congress Centre, Istanbul, Turkey
http://workshops.elda.org/ltpd2012/
*Workshop Description*
In the last few years, the use of* patents *in automatic processing has
shown a growing interest in the
NLP community. This has been particularly the case in the context of
*Machine Translation (MT)* or
*Cross-Lingual Information Retrieval (CLIR)*. Nowadays this has become a
major topic and besides
the development of the technology itself, some key points remain
regarding the resources available
and the way of evaluating the quality of the technology.
A large number of languageresources is already available for the
community, but the development
of systems, in particular the statistical ones, always requires more and
more data. As there is a
growing interest for patents and their processing, a workshop on the
topic which gathers all those
involved in the different aspects concerned is a good opportunity to
move forward.
The domain of patents itself is increasing and the amount of potential
material does not cease to
increase. It is this potential material that gives hope to the community
for improving the systems.
For instance, in China, the number of patents have been multiplied by 3
in 5 years and they exceed
1 million published documents per year by now. EPO (the European
Patent Office) uses more than
150 translation pairs per day. Every patent office receives more and
more patents every day, needs a
daily use of automatic tools to translate the documents, looks
for existing patents and their
translation, manages complex content, etc. As we can see, this is a
domain in considerable demand
and since the content of the patents is technical and needs high skills
in a specific domain, providing
documents that are sufficiently understandable to the end users is very
complex. This is a real
challenge for all NLP developers.
Above all, this challenge is about corpora and their management. The
main topic concerns their
acquisition and how to collect useful data. For most of the researchers,
this consists in harvesting
web pages, cleaning them, getting the useful content according to a
specific task, aligning the
sentences, etc. The acquisition task may also be done using *OCR tools
on PDF*. Monolingual
corpora are easier to retrieve (e.g. from databases) compared to
parallel corpora. However, parallel
translations exist and aligned corpora as well, or corpora that could be
easily aligned. Following the
question of the acquisition of such documents, there is that of database
management. One could say
that all these questions are not only related to patent data, however
this workshop would like focus
on this particular domain and make some effort to improve things.
Currently, the corpora are mainly used for MT. For a technical end-user
in a patent office, the end
goal is to manage to understand the content of a document. This may not
require a very high quality
translation since this person only needs to grasp the relevance of the
document. However, in MT,
we still need to measure quantitatively the performance of the systems.
This is basically made using
automatic and/or human measures, while most of the system developers are
using typical automatic
metrics such as BLEU to get their results. Even if the drawbacks of such
metrics are well-known, it
could be still relevant, for instance, to compare different versions of
a system. However, even when
using BLEU, the content of patent documents is very particular, which
implies that different kinds
of linguistic specificity need to be tackled: these include the already
expected terminological level,
but also a syntactic level, a semantic one, and even the structure of
the documents may be different
from that of other documents (for instance, patents typically comprise
of a title, an abstract, a
technical description of the invention, and a list of novel claims).
Human measures may be also
difficult to apply as patent documents are written in a way which makes
them difficult to read for
the layman. Furthermore, both automatic and human evaluations should
have the chance to realise a
deep analysis of the results, which is not trivial working with patents.
However, given the often
formulaic nature of the text found in patents -- which is enforced on
the author due to legal
constraints -- there may be opportunities to exploit this
for evaluation. For instance, claims are
constructed as a single sentence with an introductory phrase and a body
linked by frequently
occurring terms such as "in a certain embodiment", "consisting
essentially of", and clauses and lists
introduced using colons, e.g. "comprising: ..."
The use of patents in CLIR suffers from the same kind of issues, either
for the evaluation of systems
or for the collection of corpora. Sentence alignment may also have
specific issues related to the
content of the documents, and many other types of tools may have their
own thoughts using patents.
Through all those technologies, one can see their usage implies several
challenges, such as the
integration of tools into patent information applications. The different
tools should help end-users to
search, examine or classify patent documents, most of the time from
translations and not available
in English. Web services should also be an extension of the tools and
web services should be
connected through workflows, helping end-users in their daily work.
Among all the topics previously mentioned, we would like to contribute
to the improvement of the
challenging patent field, by sharing the knowledge from the whole community.
The different topics addressed during the workshop will be (but are not
limited to):
- Corpora aspects: collecting data, cleaning, alignment, parallel
corpora, etc.;
- Evaluation of technologies: definition of metrics, patent specificity;
- Integration of patent applications: web services, end-user applications;
- IPR issues and licensing.
*Organising committee*
Heidi Depraetere (Crosslang, Belgium)
Olivier Hamon (ELDA -- Evaluations and Languageresources Distribution
Agency, France)
John Tinsley (PLUTO -- PatentLanguage Translations Online, Ireland)
*Programme committee*
Victoria Arranz (ELDA -- Evaluations and Languageresources Distribution
Agency, France)
Alexandru Ceausu (PLUTO - PatentLanguage Translations Online, Ireland)
Khalid Choukri (ELDA, France)
Terumasa Ehara (Yamanashi Eiwa College, Japan)
Cristina España-Bonet (UPC, Spain)
Mihai Lupu (IRF and ESTeam, Austria)
Bertrand Le Chapelain (EPO, Netherlands)
Bente Maegaard (University of Copenhagen, Denmark)
Walid Magdy (Dublin City Univerisry, Ireland)
Bruno Pouliquen (World Intellectual Property Organization, Switzerland)
Lucia Specia (University of Sheffield, United Kingdom)
Gregor Thurmair (Linguatec, Germany)
Dan Wang (China Patent Information Center, China)
Shoichi Yokoyama (Yamagata University, Japan)
More TBC...
*Important dates*
Deadline for submission: Friday 2 March 2012
Notification of acceptance: Friday 23 March 2012
Final version due: Friday 30 March 2012
Workshop : 27 May 2012 (afternoon)
*Submission Format*
Full papers up to 8 pages should be formatted according to LREC 2012
guidelines and be submitted
through the online submission form
(https://www.softconf.com/lrec2012/PATENT2012/) on
START. For further queries, please contact Olivier Hamon at
hamon_at_elda_dot_org.
When submitting a paper from the START page, authors will be asked to
provide essential
information about resources (in a broad sense, i.e. also technologies,
standards, evaluation kits, etc.)
that have been used for the work described in the paper or are a new
result of your research. For
further information on this new initiative, please refer to
http://www.lrec-conf.org/lrec2012/?LREMap-
2012.
--
---------------------------------------------------------------------------------------------------
Dr. Olivier HAMON hamon at elda.org
Project Manager - ELDA
55-57, rue Brillat Savarin Tel : +33 1 43 13 33 43
75013 Paris - France Fax : +33 1 43 13 33 30
http://www.elda.org http://www.lrec-conf.org
http://catalog.elra.info http://www.hlt-evaluation.org
---------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120223/6f29616b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list