[Corpora-List] PhD studentship at the Open University, UK

Mon Apr 21 15:54:54 UTC 2008

The Department of Computing at the Open University has a number of 
competitive funded studentships. The studentships are for 3 years of 
full-time doctoral research starting in October 2008. Students will be 
based in the Department of Computing, part of the Centre for Research in 
Computing in Milton Keynes, UK.

One of the possible topic areas is titled Information Extraction from 
the Taxonomic Literature, and is concerned with the automatic 
identification, curation and management of species names from legacy 
literature. The project description is attached below, and full details 
are available at:

	http://www.computing.open.ac.uk/research-degrees/studentships

Suitable candidates will have previous experience in Natural Language 
Processing or Information Extraction, and some interest in applying this 
to the domain of biological taxonomy.

The position is competitive, so please contact us at the earliest 
opportunity if you wish to apply.
Closing date is 30th April.

For more information, or to discuss an application, please feel free to 
contact Dr David Morse by email or phone.

Dr David Morse,
Senior Lecturer in Computing

E-mail: d.r.morse at open.ac.uk
Tel: +44 1908 658463

--

Information extraction from the taxonomic literature

One of the major challenges that biodiversity informatics (the 
application of computer science to the management of information about 
living things) faces is the creation and maintenance of a complete list 
of the world’s known species. In recognition of the fact that no 
complete list of published names exists, there are now several 
international projects seeking to establish one (see, for example, GBIF 
, Species 2000, and the Encyclopaedia of Life).

In order to build a catalogue of taxon names, first it is necessary to 
capture them from the legacy printed literature, together with their 
taxonomic relationships, and then integrate the entire list into a 
coherent whole. This curation cannot be completed manually – the 
taxonomic literature is too large, too widely dispersed and there are 
simply too few trained taxonomists to complete the task. Text mining – 
the process of automatically discovering high quality information from 
text – holds considerable promise as an enabling technology to automate 
aspects of the knowledge discovery task.

Text mining has been successfully applied to some areas of the 
biological sciences, particularly molecular biology. It has been used to 
extract information on genes, proteins and their interactions from 
scientific papers. The information extracted may be used by scientists 
in their research, or it may be used to maintain or update biological 
databases.

The Project

This studentship will investigate the feasibility of automating the 
capture of taxonomic names and the relationships between them from the 
scientific literature by text mining. Within this overall goal there are 
many unsolved issues and research questions. For example:

    1. Investigating the applicability of existing techniques from 
natural language processing, information extraction and information 
retrieval to automating the extraction of taxonomic names.
    2. With what level of precision and recall can taxonomic names be 
recognised? Can the relationships between the names be identified?
    3. Much of the taxonomic literature is not available in digital 
form, so source documents may have to be scanned digitally and then 
Optical Character Recognition (OCR) used to convert the documents to 
digital form. Since the first two steps in this chain (scanning and OCR) 
will result in text that is not error-free, one question that could be 
investigated is the extent to which the text mining techniques developed 
are robust to errors in the source text.
    4. Mapping the information extracted into models of taxonomic 
nomenclature in order to facilitate incorporation of the information 
into taxonomic information systems such as the Encyclopaedia of Life.
    5. Species’ common names are widely used, highly variable and are 
language specific. Developing techniques to support the curation of 
common names would be of considerable benefit to the wider community.

The precise nature of the project will be developed and refined by the 
successful candidate in collaboration with the supervisors.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora