[Corpora-List] NSF-supported Summer Internships

Fred Jelinek via listmember Jason Eisner jason at cs.jhu.edu
Tue Feb 11 12:50:11 UTC 2003


Dear Colleague:

The Center for Language and Speech Processing at the Johns Hopkins
University is offering a unique summer internship opportunity, which we
would like you to bring to the attention of your best students in the
current junior class.  Preliminary applications for these internships
are due at the end of this week.

This internship is unique in the sense that the selected students will
participate in cutting edge research as full members alongside leading
scientists from industry, academia, and the government.  The exciting
nature of the internship is the exposure of the undergraduate students to
the emerging fields of language engineering, such as automatic speech
recognition (ASR), natural language processing (NLP) and machine
translation (MT).

We are specifically looking to attract new talent into the field and, as
such, do not require the students to have prior knowledge of language
engineering technology.  Please take a few moments to nominate suitable
bright students for this internship.  On-line applications for the program
can be found at http://www.clsp.jhu.edu/ along with additional information
regarding plans for the 2003 Workshop and information on past workshops.
The application deadline is February 15, 2003.

If you have questions, please contact us by phone (410-516-4237), e-mail
(sec at clsp.jhu.edu) or via the Internet http://www.clsp.jhu.edu


Sincerely,

Frederick Jelinek
J.S. Smith Professor and Director

---------------------------------------------------------------------------
Team Project Descriptions for this Summer
---------------------------------------------------------------------------

1. Syntax for Statistical Machine Translation

In recent evaluations of machine translation systems, statistical systems
based on probabilistic models have outperformed classical approaches based
on interpretation, transfer, and generation. Nonetheless, the output of
statistical systems often contains obvious grammatical errors. This can be
attributed to the fact that the syntactic well-formedness is only
influenced by local n-gram language models and simple alignment models. We
aim to integrate syntactic structure into statistical models to address
this problem. A very convenient and promising approach for this
integration is the maximum entropy framework, which allows to integrate
many different knowledge sources into an overall model and to train the
combination weights discriminatively. This approach will allow us to
extend a baseline system easily by adding new feature functions.

The workshop will start with a strong baseline -- the alignment template
statistical machine translation system that obtained best results in the
2002 DARPA MT evaluations. During the workshop, we will incrementally add
new features representing syntactic knowledge that deal with specific
problems of the underlying baseline. We want to investigate a broad range
of possible feature functions, from very simple binary features to
sophisticated tree-to-tree translation models. Simple feature functions
might test if a certain constituent occurs in the source and the target
language parse tree. More sophisticated features will be derived from an
alignment model where whole sub-trees in source and target can be aligned
node by node. We also plan to investigate features based on projection of
parse trees from one language onto strings of another, a useful technique
when parses are available for only one of the two languages. We will
extend previous tree-based alignment models by allowing partial tree
alignments when the two syntactic structures are not isomorphic.

We will work with the Chinese-English data from the recent evaluations, as
large amounts of sentence-aligned training corpora, as well as multiple
reference translations are available. This will also allow us to compare
our results with the various systems participating in the evaluations. In
addition, annotation is underway on a Chinese-English parallel tree-bank.
We plan to evaluate the improvement of our system using both automatic
metrics for comparison with reference translations (BLEU and NIST) as well
as subjective evaluations of adequacy and fluency. We hope both to improve
machine translation performance and advance the understanding of how
linguistic representations can be integrated into statistical models of
language.


---------------------------------------------------------------------------

2. Semantic Analysis Over Sparse Data

The aim of the task is to verify the feasibility of a machine
learning-based semantic approach to the data sparseness problem that is
encountered in many areas of natural language processing such as language
modeling, text classification, question answering and information
extraction.  The suggested approach takes advantage of several
technologies for supervised and unsupervised sense disambiguation that
have been developed in the last decade and of several resources that have
been made available.

The task is motivated by the fact that current language processing models
are considerably affected by sparseness of training data, and current
solutions, like class-based approaches, do not elicit appropriate
information: the semantic nature and linguistic expressiveness of
automatically derived word classes is unclear. Many of these limitations
originate from the fact that fine-grained automatic sense disambiguation
is not applicable on a large scale.

The workshop will develop a weakly supervised method for sense modeling
(i.e. reduction of possible word senses in corpora according to their
genre) and apply it to a huge corpus in order to coarsely
sense-disambiguate it. This can be viewed as an incremental step towards
fine-grained sense disambiguation. The created semantic repository as well
as the developed techniques will be made available as resources for future
work on language modeling, semantic acquisition for text extraction,
question answering, summarization, and most other natural language
processing tasks.


---------------------------------------------------------------------------

3. Dialectal Chinese Speech Recognition

There are eight major dialectal regions in addition to Mandarin (Northern
China) in China, including Wu (Southern Jiangsu, Zhejiang, and Shanghai),
Yue (Guangdong, Hong Kong, Nanning Guangxi), Min (Fujian, Shantou
Guangdong, Haikou Hainan, Taipei Taiwan), Hakka (Meixian Guangdong,
Hsin-chu Taiwan), Xiang (Hunan), Gan (Jiangxi), Hui (Anhui), and Jin
(Shanxi). These dialects can be further divided into more than 40
sub-categories. Although the Chinese dialects share a written language and
standard Chinese (Putonghua) is widely spoken in most regions, speech is
still strongly influenced by the native dialects. This great linguistic
diversity poses problems for automatic speech and language technology.
Automatic speech recognition relies to a great extent on the consistent
pronunciation and usage of words within a language. In Chinese, word
usage, pronunciation, and syntax and grammar vary depending on the
speaker's dialect. As a result speech recognition systems constructed to
process standard Chinese (Putonghua) perform poorly for the great majority
of the population.

The goal of our summer project is to develop a general framework to model
phonetic, lexical, and pronunciation variability in dialectal Chinese
automatic speech recognition tasks. The baseline system is a standard
Chinese recognizer. The goal of our research is to find suitable methods
that employ dialect-related knowledge and training data (in relatively
small quantities) to modify the baseline system to obtain a dialectal
Chinese recognizer for the specific dialect of interest. For practical
reasons during the summer, we will focus on one specific dialect, for
example the Wu dialect or the Chuan dialect. However the techniques we
intend to develop should be broadly applicable.

Our project will build on established ASR tools and systems developed for
standard Chinese. In particular, our previous studies in pronunciation
modeling have established baseline Mandarin ASR systems along with their
component lexicons and language model collections. However, little
previous work or resources are available to support research in Chinese
dialect variation for ASR. Our pre-workshop will therefore focus on
further infrastructure development:

  * Dialectal Lexicon Construction. We will establish an electronic
  dialect dictionary for the chosen dialect. The lexicon will be
  constructed to represent both standard and dialectal pronunciations.

  * Dialectal Chinese Database Collection. We will set up a dialectal
  Chinese speech database with canonical pinyin level and dialectal
  pinyin level transcriptions. The database could contain two parts:
  read speech and spontaneous speech. For the spontaneous speech part,
  the generalized initial/final (GIF) level transcription should be also
  included.

Our effort at the workshop will be to employ these materials to develop
ASR system components that can be adapted from standard Chinese to the
chosen dialect. Emphasis will be placed on developing techniques that work
robustly with relatively small (or even no) dialect data. Research will
focus primarily on acoustic phenomena, rather than syntax or grammatical
variation, which we intend to pursue after establishing baseline ASR
experiments.


---------------------------------------------------------------------------

4. Confidence Estimation for Natural Language Applications

Significant progress has been made in natural language processing (NLP)
technologies in recent years, but most still do not match human
performance. Since many applications of these technologies require
human-quality results, some form of manual intervention is necessary.

The success of such applications therefore depends heavily on the extent
to which errors can be automatically detected and signaled to a human
user. In our project we will attempt to devise a generic method for NLP
error detection by studying the problem of Confidence Estimation (CE) in
NLP results within a Machine Learning (ML) framework.

Although widely used in Automatic Speech Recognition (ASR) applications,
this approach has not yet been extensively pursued in other areas of NLP.
In ASR, error recovery is entirely based on confidence measures: results
with a low level of confidence are rejected and the user is asked to
repeat his or her statement. We argue that a large number of other NLP
applications could benefit from such an approach. For instance, when
post-editing MT output, a human translator could revise only those
automatic translations that have a high probability of being wrong. Apart
from improving user interactions, CE methods could also be used to improve
the underlying technologies. For example, bootstrap learning could be
based on outputs with a high confidence level, and NLP output re-scoring
could depend on probabilities of correctness.

Our basic approach will be to use a statistical Machine Learning (ML)
framework to post-process NLP results: an additional ML layer will be
trained to discriminate between correct and incorrect NLP results and
compute a confidence measure (CM) that is an estimate of the probability
of an output being correct. We will test this approach on a statistical MT
application using a very strong baseline MT system. Specifically, we will
start off with the same training corpus (Chinese-English data from recent
NIST evaluations), and baseline system as the Syntax for Statistical
Machine Translation team.

During the workshop we will investigate a variety of confidence features
and test their effects on the discriminative power of our CM using
Receiver Operating Characteristic (ROC) curves. We will investigate
features intended to capture the amount of overlap, or consensus, among
the system's n-best translation hypotheses, features focusing on the
reliability of estimates from the training corpus, ones intended to
capture the inherent difficulty of the source sentence under translation,
and those that exploit information from the base statistical MT system.
Other themes for investigation include a comparison of different ML
frameworks such as Neural Nets or Support Vector Machines, and a
determination of the optimal granularity for confidence estimates
(sentence-level, word-level, etc).

Two methods will be used to evaluate final results. First, we will perform
a re-scoring experiment where the n-best translation alternatives output
by the baseline system will be re-ordered according to their confidence
estimates. The results will be measured using the standard automatic
evaluation metric BLEU, and should be directly comparable to those
obtained by the Syntax for Statistical Machine Translation team. We expect
this to lead to many insights about the differences between our approach
and theirs. Another method of evaluation will be to estimate the tradeoff
between final translation quality and amount of human effort invested, in
a simulated post-editing scenario.


---------------------------------------------------------------------------



More information about the Corpora mailing list