[Corpora-List] PhD studentship at the Open University, UK
Alistair Willis
A.G.Willis at open.ac.uk
Mon Apr 21 15:54:54 UTC 2008
The Department of Computing at the Open University has a number of
competitive funded studentships. The studentships are for 3 years of
full-time doctoral research starting in October 2008. Students will be
based in the Department of Computing, part of the Centre for Research in
Computing in Milton Keynes, UK.
One of the possible topic areas is titled Information Extraction from
the Taxonomic Literature, and is concerned with the automatic
identification, curation and management of species names from legacy
literature. The project description is attached below, and full details
are available at:
http://www.computing.open.ac.uk/research-degrees/studentships
Suitable candidates will have previous experience in Natural Language
Processing or Information Extraction, and some interest in applying this
to the domain of biological taxonomy.
The position is competitive, so please contact us at the earliest
opportunity if you wish to apply.
Closing date is 30th April.
For more information, or to discuss an application, please feel free to
contact Dr David Morse by email or phone.
Dr David Morse,
Senior Lecturer in Computing
E-mail: d.r.morse at open.ac.uk
Tel: +44 1908 658463
--
Information extraction from the taxonomic literature
One of the major challenges that biodiversity informatics (the
application of computer science to the management of information about
living things) faces is the creation and maintenance of a complete list
of the world’s known species. In recognition of the fact that no
complete list of published names exists, there are now several
international projects seeking to establish one (see, for example, GBIF
, Species 2000, and the Encyclopaedia of Life).
In order to build a catalogue of taxon names, first it is necessary to
capture them from the legacy printed literature, together with their
taxonomic relationships, and then integrate the entire list into a
coherent whole. This curation cannot be completed manually – the
taxonomic literature is too large, too widely dispersed and there are
simply too few trained taxonomists to complete the task. Text mining –
the process of automatically discovering high quality information from
text – holds considerable promise as an enabling technology to automate
aspects of the knowledge discovery task.
Text mining has been successfully applied to some areas of the
biological sciences, particularly molecular biology. It has been used to
extract information on genes, proteins and their interactions from
scientific papers. The information extracted may be used by scientists
in their research, or it may be used to maintain or update biological
databases.
The Project
This studentship will investigate the feasibility of automating the
capture of taxonomic names and the relationships between them from the
scientific literature by text mining. Within this overall goal there are
many unsolved issues and research questions. For example:
1. Investigating the applicability of existing techniques from
natural language processing, information extraction and information
retrieval to automating the extraction of taxonomic names.
2. With what level of precision and recall can taxonomic names be
recognised? Can the relationships between the names be identified?
3. Much of the taxonomic literature is not available in digital
form, so source documents may have to be scanned digitally and then
Optical Character Recognition (OCR) used to convert the documents to
digital form. Since the first two steps in this chain (scanning and OCR)
will result in text that is not error-free, one question that could be
investigated is the extent to which the text mining techniques developed
are robust to errors in the source text.
4. Mapping the information extracted into models of taxonomic
nomenclature in order to facilitate incorporation of the information
into taxonomic information systems such as the Encyclopaedia of Life.
5. Species’ common names are widely used, highly variable and are
language specific. Developing techniques to support the curation of
common names would be of considerable benefit to the wider community.
The precise nature of the project will be developed and refined by the
successful candidate in collaboration with the supervisors.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list