[Corpora-List] PhD studentship

Paul Bennett paul.bennett at manchester.ac.uk
Fri May 23 10:19:14 UTC 2008


See below for details of an ESRC PhD studentship

University of Manchester
School of Languages, Linguistics and Cultures

GerManC Plus. A representative historical corpus of German 1650-1800
ESRC PhD studentship

The Economic and Social Research Council (ESRC) is funding the above 
project, led by Professor Martin Durrell, for three years from July 
2008. Full details on the project may be found on the website:
http://www.llc.manchester.ac.uk/research/projects/germanc/

The aim of the project is to compile a representative historical corpus 
of written German for the years 1650-1800 which will be comparable with 
extant English corpora for this period, and constitute a research tool 
for comparative studies of the development of the two languages. It 
will build on an earlier ESRC-funded project (GerManC) completed in 
spring 2007 which was restricted to newspaper texts and which will 
ultimately be incorporated into GerManC Plus. GerManC Plus will consist 
of a collection of sample texts of 2000 words each from the main text 
types attested at this time, selected with a view to achieving 
chronological and regional representativeness. The finished corpus will 
consist of about 800,000 words in total and will present a picture of 
the development of the language in time and space which will be as 
representative as possible. The corpus will be fully annotated in 
accordance with Text Encoding Initiative (TEI) standards.

Building on the achievements of the pilot project, software programs 
will be developed to assist the linguistic analysis of the corpus 
material. In particular, we shall be aiming first to lemmatize the 
corpus fully, which involves solving problems associated with spelling 
variation and inflectional forms. It is also intended to adapt software 
in order to tag the corpus for parts of speech, and where appropriate 
to classify words according to grammatical category. Finally we shall 
be aiming to develop programs capable of undertaking syntactic parsing 
of the corpus.

The project team will consist of Professor Martin Durrell (Principal 
Investigator), Dr. Paul Bennett and two Research Associates, one 
specialising in the history of the German language and one in 
computational linguistics.

To run concurrently with this Project, the ESRC is offering one PhD 
studentship for three years of full-time funding (fees + maintenance at 
c. £12,600 per annum).
The provisional proposed title for the PhD thesis is “Historical Text 
Processing for Pre-standard German: Tools for Morphosyntactic Corpus 
Analysis”. It will involve, first, a comparative evaluation of a number 
of part-of-speech taggers that have been developed for modern German in 
terms of how well they handle the texts in our corpus. The evaluation 
would include an examination of how the taggers may be adapted to 
process pre-modern texts, e.g. by extending their lexica. It would also 
cover taggers that operate on the basis of machine learning. The 
outcome of this part of the work would be (a) an understanding of which 
tagger(s) is/are best suited to tagging the texts in our corpus, (b) a 
tagged version of the GerManC Plus corpus with as few errors as 
possible, and (c) increased general appreciation of the procedures for 
adapting/building taggers for pre-modern corpora. As the pilot GerManC 
corpus is already available the student can start working with this 
from the outset without having to wait until the other part of the 
corpus is complete.

The second aspect of the student’s research would involve, if possible, 
the development of tools for further syntactic analysis of our texts. 
This would involve in the first instance chunking the texts, i.e. 
dividing sentences into their main non-recursive phrases. If possible 
it would extend to fuller parsing, i.e. assignment of a full syntactic 
tree structure. A plausible methodology would be again be to adapt 
existing tools for chunking and/or parsing modern German and to see how 
useful they would be for the pre-modern language. The outcome of this 
work would be (a) a chunked or parsed version of all or part of our 
corpus, and (b) appreciation of how existing tools can be adapted for 
processing earlier stages of both German in particular and languages 
more generally.
This thesis, the precise focus of which may be revised in the light of 
the research as it progresses or the interests and qualifications of 
the award holder, is effectively a discrete component of the project. 
The Research Associate specializing in linguistic engineering will have 
overall responsibility for developing the software necessary for the 
analysis of the corpus. The PhD student will be working closely with 
this Research Associate, but independently from him/her, in evaluating 
the current tools in detail (effectively surveying current research 
critically), and then developing one specific set of tools on his/her 
own. The student will be supervised by Dr Paul Bennett and another 
colleague in the School of Langauges, Linguistics and Cultures.

Applications are invited from students who hold a first-class or good 
upper second-class degree (or equivalent) in a relevant discipline 
(including, but not restricted to, Linguistics, Informatics or Computer 
Science), and who either hold or are about to complete a Masters degree 
(or equivalent) in Computational Linguistics, Language Engineering, 
Speech and Language Processing or a similar discipline. Successful 
candidates must also possess a good knowledge of German. Non-native 
speakers of English must meet the respective postgraduate IELTS/TOEFL 
requirements of the School.

Due to regulations governing postgraduate funding, the applicant must 
satisfy ESRC eligibility criteria, as given in the ‘Guidance Notes for 
Applicants’ on the ESRC website www.esrc.ac.uk. In particular, anyone 
who has not been resident in the UK for three years prior to taking up 
the award, whether a UK/EU citizen or not, is eligible for an award to 
cover fees, but not for the maintenance grant. All ESRC funding is 
subject to satisfactory progress, reviewed annually.

Each student will go through the standard research and skills training 
provided by the Faculty of Humanities, and will work alongside a lively 
cohort of research students in the School. Supervision will follow good 
practice as established by the Faculty of Humanities, fully complying 
with ESRC guidelines. The student's progress will be monitored every 
six months by a PhD supervisory panel, and at the end of the first year 
the student will go through a thorough probation review.

Applicants are invited to address informal inquiries to Professor 
Martin Durrell (e-mail Martin.Durrell at manchester.ac.uk). To apply 
please send a full curriculum vitae (including statement of linguistic 
competence) and a 500-1000 word outline of why you are interested in 
the project and how you might approach it.  Please include the names of 
two referees.

The closing date for applications is 15 July 2008 (with a view to the 
successful candidate registering in September 2008).

Applications should be made directly to:
Professor Martin Durrell
School of Languages, Linguistics and Cultures
University of Manchester, M13 9PL, United Kingdom


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list