[Corpora-List] PhD studentship
Paul Bennett
paul.bennett at manchester.ac.uk
Fri May 23 10:19:14 UTC 2008
See below for details of an ESRC PhD studentship
University of Manchester
School of Languages, Linguistics and Cultures
GerManC Plus. A representative historical corpus of German 1650-1800
ESRC PhD studentship
The Economic and Social Research Council (ESRC) is funding the above
project, led by Professor Martin Durrell, for three years from July
2008. Full details on the project may be found on the website:
http://www.llc.manchester.ac.uk/research/projects/germanc/
The aim of the project is to compile a representative historical corpus
of written German for the years 1650-1800 which will be comparable with
extant English corpora for this period, and constitute a research tool
for comparative studies of the development of the two languages. It
will build on an earlier ESRC-funded project (GerManC) completed in
spring 2007 which was restricted to newspaper texts and which will
ultimately be incorporated into GerManC Plus. GerManC Plus will consist
of a collection of sample texts of 2000 words each from the main text
types attested at this time, selected with a view to achieving
chronological and regional representativeness. The finished corpus will
consist of about 800,000 words in total and will present a picture of
the development of the language in time and space which will be as
representative as possible. The corpus will be fully annotated in
accordance with Text Encoding Initiative (TEI) standards.
Building on the achievements of the pilot project, software programs
will be developed to assist the linguistic analysis of the corpus
material. In particular, we shall be aiming first to lemmatize the
corpus fully, which involves solving problems associated with spelling
variation and inflectional forms. It is also intended to adapt software
in order to tag the corpus for parts of speech, and where appropriate
to classify words according to grammatical category. Finally we shall
be aiming to develop programs capable of undertaking syntactic parsing
of the corpus.
The project team will consist of Professor Martin Durrell (Principal
Investigator), Dr. Paul Bennett and two Research Associates, one
specialising in the history of the German language and one in
computational linguistics.
To run concurrently with this Project, the ESRC is offering one PhD
studentship for three years of full-time funding (fees + maintenance at
c. £12,600 per annum).
The provisional proposed title for the PhD thesis is “Historical Text
Processing for Pre-standard German: Tools for Morphosyntactic Corpus
Analysis”. It will involve, first, a comparative evaluation of a number
of part-of-speech taggers that have been developed for modern German in
terms of how well they handle the texts in our corpus. The evaluation
would include an examination of how the taggers may be adapted to
process pre-modern texts, e.g. by extending their lexica. It would also
cover taggers that operate on the basis of machine learning. The
outcome of this part of the work would be (a) an understanding of which
tagger(s) is/are best suited to tagging the texts in our corpus, (b) a
tagged version of the GerManC Plus corpus with as few errors as
possible, and (c) increased general appreciation of the procedures for
adapting/building taggers for pre-modern corpora. As the pilot GerManC
corpus is already available the student can start working with this
from the outset without having to wait until the other part of the
corpus is complete.
The second aspect of the student’s research would involve, if possible,
the development of tools for further syntactic analysis of our texts.
This would involve in the first instance chunking the texts, i.e.
dividing sentences into their main non-recursive phrases. If possible
it would extend to fuller parsing, i.e. assignment of a full syntactic
tree structure. A plausible methodology would be again be to adapt
existing tools for chunking and/or parsing modern German and to see how
useful they would be for the pre-modern language. The outcome of this
work would be (a) a chunked or parsed version of all or part of our
corpus, and (b) appreciation of how existing tools can be adapted for
processing earlier stages of both German in particular and languages
more generally.
This thesis, the precise focus of which may be revised in the light of
the research as it progresses or the interests and qualifications of
the award holder, is effectively a discrete component of the project.
The Research Associate specializing in linguistic engineering will have
overall responsibility for developing the software necessary for the
analysis of the corpus. The PhD student will be working closely with
this Research Associate, but independently from him/her, in evaluating
the current tools in detail (effectively surveying current research
critically), and then developing one specific set of tools on his/her
own. The student will be supervised by Dr Paul Bennett and another
colleague in the School of Langauges, Linguistics and Cultures.
Applications are invited from students who hold a first-class or good
upper second-class degree (or equivalent) in a relevant discipline
(including, but not restricted to, Linguistics, Informatics or Computer
Science), and who either hold or are about to complete a Masters degree
(or equivalent) in Computational Linguistics, Language Engineering,
Speech and Language Processing or a similar discipline. Successful
candidates must also possess a good knowledge of German. Non-native
speakers of English must meet the respective postgraduate IELTS/TOEFL
requirements of the School.
Due to regulations governing postgraduate funding, the applicant must
satisfy ESRC eligibility criteria, as given in the ‘Guidance Notes for
Applicants’ on the ESRC website www.esrc.ac.uk. In particular, anyone
who has not been resident in the UK for three years prior to taking up
the award, whether a UK/EU citizen or not, is eligible for an award to
cover fees, but not for the maintenance grant. All ESRC funding is
subject to satisfactory progress, reviewed annually.
Each student will go through the standard research and skills training
provided by the Faculty of Humanities, and will work alongside a lively
cohort of research students in the School. Supervision will follow good
practice as established by the Faculty of Humanities, fully complying
with ESRC guidelines. The student's progress will be monitored every
six months by a PhD supervisory panel, and at the end of the first year
the student will go through a thorough probation review.
Applicants are invited to address informal inquiries to Professor
Martin Durrell (e-mail Martin.Durrell at manchester.ac.uk). To apply
please send a full curriculum vitae (including statement of linguistic
competence) and a 500-1000 word outline of why you are interested in
the project and how you might approach it. Please include the names of
two referees.
The closing date for applications is 15 July 2008 (with a view to the
successful candidate registering in September 2008).
Applications should be made directly to:
Professor Martin Durrell
School of Languages, Linguistics and Cultures
University of Manchester, M13 9PL, United Kingdom
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list