14.3351, Diss: Translation: Tiedemann: 'Recycling...'

LINGUIST List linguist at linguistlist.org
Fri Dec 5 18:22:33 UTC 2003


LINGUIST List:  Vol-14-3351. Fri Dec 5 2003. ISSN: 1068-4875.

Subject: 14.3351, Diss: Translation: Tiedemann: 'Recycling...'

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Takako Matsui <tako at linguistlist.org>
 ==========================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
=================================Directory=================================

1)
Date:  Fri, 5 Dec 2003 08:10:41 -0500 (EST)
From:  joerg at stp.ling.uu.se
Subject:  Recycling Translations

-------------------------------- Message 1 -------------------------------

Date:  Fri, 5 Dec 2003 08:10:41 -0500 (EST)
From:  joerg at stp.ling.uu.se
Subject:  Recycling Translations

Institution: Uppsala University
Program: Department of Linguistics
Dissertation Status: Completed
Degree Date: 2003

Author: Jörg Tiedemann

Dissertation Title: Recycling Translations - Extraction of Lexical
Data from Parallel Corpora and their Application in Natural Language
Processing

Dissertation URL: http://stp.ling.uu.se/~joerg/phd/

Linguistic Field: Translation, Text/Corpus Linguistics,
Applied Linguistics

Dissertation Director 1: Anna Sågvall Hein

Dissertation Abstract:

The focus of this thesis is on re-using translations in natural
language processing. It involves the collection of documents and
their translations in an appropriate format, the automatic
extraction of translation data, and the application of the
extracted data to different tasks in natural language processing.

Five parallel corpora containing more than 35 million words in 60
languages have been collected within co-operative projects. All
corpora are sentence aligned and parts of them have been analyzed
automatically and annotated with linguistic markup.

Lexical data are extracted from the corpora by means of word
alignment. Two automatic word alignment systems have been developed,
the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an
iterative ''knowledge-poor'' word alignment approach using association
measures and alignment heuristics. The Clue Aligner provides an
innovative framework for the combination of statistical and linguistic
resources in aligning single words and multi-word units. Both aligners
have been applied to several corpora. Detailed evaluations of the
alignment results have been carried out for three of them using
fine-grained evaluation techniques.

A corpus processing toolbox, Uplug, has been developed. It includes
the implementation of UWA and is freely available for research
purposes. A new version, Uplug II, includes the Clue Aligner. It can
be used via an experimental web interface (UplugWeb).

Lexical data extracted by the word aligners have been applied to
different tasks in computational lexicography and machine
translation. The use of word alignment in monolingual lexicography has
been investigated in two studies. In a third study, the feasibility of
using the extracted data in 20 interactive machine translation has
been demonstrated. Finally, extracted lexical data have been used for
enhancing the lexical components of two machine translation systems.

---------------------------------------------------------------------------
LINGUIST List: Vol-14-3351



More information about the LINGUIST mailing list