Corpora: job: computational linguistics/lexicography

Kees Koster kees at cs.kun.nl
Tue Aug 22 09:38:58 UTC 2000


Beste Jan,

Ik weet dat je een goede baan hebt, maar wellicht wil je naar Nederland
terugkeren. Heb je belangstelling voor bijgaand project, of weet je een
geschikte linguist (of een hele goede programmeur)?

Vriendelijke groet,

  -- Kees Koster

----------------------------------------------------------------------

The PEKING project (People and Knowledge Information Gathering) is a 5th
framework project, addressing the problems of supervised and unsupervised
classification and (cross-lingual) matching of documents in organizations.

The proposal was submitted to the EC in May 2000 by the following partners:
 - META4 R&D (coordinator), Univ. of Barcelona, Univ. of Madrid Carlos III
   and CINDOC in Spain
 - Quinary and CRF-FIAT in Italy
 - Univ. of Nijmegen (KUN), Edmond bv and Fiscaal up to Date in The
   Netherlands.
It has been positively recieved by the Reviewers of the Commission regarding
its scientific and commercial merits, and contract negotaiations are
taking place. The project will start at the end of 2000.

In the PEKING project KUN and Edmond will address the real-life situation
of one Dutch User (the FISCAAL firm) which is typical for many firms and
institutions which derive their income from providing access to a large
amount of systematically collected documents. The documents are presently
manually classified according to a hierarchical thesaurus, which is hard to
keep up to date and to modify. Furthermore, certain index terms have been
added to the documents manually, and a conventional keyword-based search
facility is available.
Since the manual classification and index term assignment is
expensive, inflexible and rather subjective, there is a pressing need for
an automatic disclosure mechanism to replace or at least support the manual
classification process.

The key questions on the application side are:

 - Can an automatically learning system be made to provide a hierarchical
   classification which is good enough for the users?
 - Can the consistency and quality of automatic classification approximate
   the experience and insight of experts performing manual classification?
 - In reality, it is to be expected that an automatic system may process
   the bulk of the documents leaving only a few hard cases to the human
   experts. Can such a mixed system provide an economically attrative
   solution to the disclosure problems of firms like FISCAAL?

The technical problems to be solved are

 - learning reliably from unreliably classified documents
 - exploiting the notion of uncertainty in improving classification results
 - deriving normalized phrasal representations from documents, and
 - using those phrase representations in conjunction with statistical
   learning methods to increase precision in learning.

The use of phrases also presents new potentials and problems in
interlinguality which have to be addressed.

KUN proposes to extend the existing LCS prototype into a system capable of
dealing with the requirements of the Dutch User FISCAAL, which should provide
ample opportunity for inventing, implementing and evaluating novel ideas in
term representations and classification strategies.

KUN is now looking for two postdocs:
 - a computer scientist with an interest in Information Retrieval and a
   solid experience in C++ programming
 - a computational linguist with an interest in Information Retrieval and
   a specialization in syntax of natural languages.
Contracts are for two year, with a possible extension.



More information about the Corpora mailing list