14.433, FYI: Summer Internships, Jamaican Creole Newsletter

Wed Feb 12 22:25:42 UTC 2003

LINGUIST List:  Vol-14-433. Wed Feb 12 2003. ISSN: 1068-4875.

Subject: 14.433, FYI: Summer Internships, Jamaican Creole Newsletter

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: James Yuells <james at linguistlist.org>
 ==========================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
=================================Directory=================================

1)
Date:  Tue, 11 Feb 2003 06:59:14 -0500
From:  Fred Jelinek <jelinek at jhu.edu>
Subject:  NSF-supported Summer Internships

2)
Date:  Mon, 10 Feb 2003 19:42:20 +0000
From:  "Joseph T. Farquharson" <bakansa at yahoo.com>
Subject:  Jamaican Creole Newsletter

-------------------------------- Message 1 -------------------------------

Date:  Tue, 11 Feb 2003 06:59:14 -0500
From:  Fred Jelinek <jelinek at jhu.edu>
Subject:  NSF-supported Summer Internships

Dear Colleague:

The Center for Language and Speech Processing at the Johns Hopkins
University is offering a unique summer internship opportunity, which
we would like you to bring to the attention of your best students in
the current junior class.  Preliminary applications for these
internships are due at the end of this week.

This internship is unique in the sense that the selected students will
participate in cutting edge research as full members alongside leading
scientists from industry, academia, and the government.  The exciting
nature of the internship is the exposure of the undergraduate students
to the emerging fields of language engineering, such as automatic
speech recognition (ASR), natural language processing (NLP) and
machine translation (MT).

We are specifically looking to attract new talent into the field and,
as such, do not require the students to have prior knowledge of
language engineering technology.  Please take a few moments to
nominate suitable bright students for this internship.  On-line
applications for the program can be found at http://www.clsp.jhu.edu/
along with additional information regarding plans for the 2003
Workshop and information on past workshops.  The application deadline
is February 15, 2003.

If you have questions, please contact us by phone (410-516-4237),
e-mail (sec at clsp.jhu.edu) or via the Internet http://www.clsp.jhu.edu

Sincerely,

Frederick Jelinek
J.S. Smith Professor and Director

- -------------------------------------------------------------------------
Team Project Descriptions for this Summer
- -------------------------------------------------------------------------

1. Syntax for Statistical Machine Translation

In recent evaluations of machine translation systems, statistical
systems based on probabilistic models have outperformed classical
approaches based on interpretation, transfer, and
generation. Nonetheless, the output of statistical systems often
contains obvious grammatical errors. This can be attributed to the
fact that the syntactic well-formedness is only influenced by local
n-gram language models and simple alignment models. We aim to
integrate syntactic structure into statistical models to address this
problem. A very convenient and promising approach for this integration
is the maximum entropy framework, which allows to integrate many
different knowledge sources into an overall model and to train the
combination weights discriminatively. This approach will allow us to
extend a baseline system easily by adding new feature functions.

The workshop will start with a strong baseline -- the alignment
template statistical machine translation system that obtained best
results in the 2002 DARPA MT evaluations. During the workshop, we will
incrementally add new features representing syntactic knowledge that
deal with specific problems of the underlying baseline. We want to
investigate a broad range of possible feature functions, from very
simple binary features to sophisticated tree-to-tree translation
models. Simple feature functions might test if a certain constituent
occurs in the source and the target language parse tree. More
sophisticated features will be derived from an alignment model where
whole sub-trees in source and target can be aligned node by node. We
also plan to investigate features based on projection of parse trees
from one language onto strings of another, a useful technique when
parses are available for only one of the two languages. We will extend
previous tree-based alignment models by allowing partial tree
alignments when the two syntactic structures are not isomorphic.

We will work with the Chinese-English data from the recent
evaluations, as large amounts of sentence-aligned training corpora, as
well as multiple reference translations are available. This will also
allow us to compare our results with the various systems participating
in the evaluations. In addition, annotation is underway on a
Chinese-English parallel tree-bank.  We plan to evaluate the
improvement of our system using both automatic metrics for comparison
with reference translations (BLEU and NIST) as well as subjective
evaluations of adequacy and fluency. We hope both to improve machine
translation performance and advance the understanding of how
linguistic representations can be integrated into statistical models
of language.

- -------------------------------------------------------------------------

2. Semantic Analysis Over Sparse Data

The aim of the task is to verify the feasibility of a machine
learning-based semantic approach to the data sparseness problem that
is encountered in many areas of natural language processing such as
language modeling, text classification, question answering and
information extraction.  The suggested approach takes advantage of
several technologies for supervised and unsupervised sense
disambiguation that have been developed in the last decade and of
several resources that have been made available.

The task is motivated by the fact that current language processing
models are considerably affected by sparseness of training data, and
current solutions, like class-based approaches, do not elicit
appropriate information: the semantic nature and linguistic
expressiveness of automatically derived word classes is unclear. Many
of these limitations originate from the fact that fine-grained
automatic sense disambiguation is not applicable on a large scale.

The workshop will develop a weakly supervised method for sense
modeling (i.e. reduction of possible word senses in corpora according
to their genre) and apply it to a huge corpus in order to coarsely
sense-disambiguate it. This can be viewed as an incremental step
towards fine-grained sense disambiguation. The created semantic
repository as well as the developed techniques will be made available
as resources for future work on language modeling, semantic
acquisition for text extraction, question answering, summarization,
and most other natural language processing tasks.

- -------------------------------------------------------------------------

3. Dialectal Chinese Speech Recognition

There are eight major dialectal regions in addition to Mandarin
(Northern China) in China, including Wu (Southern Jiangsu, Zhejiang,
and Shanghai), Yue (Guangdong, Hong Kong, Nanning Guangxi), Min
(Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan), Hakka
(Meixian Guangdong, Hsin-chu Taiwan), Xiang (Hunan), Gan (Jiangxi),
Hui (Anhui), and Jin (Shanxi). These dialects can be further divided
into more than 40 sub-categories. Although the Chinese dialects share
a written language and standard Chinese (Putonghua) is widely spoken
in most regions, speech is still strongly influenced by the native
dialects. This great linguistic diversity poses problems for automatic
speech and language technology.  Automatic speech recognition relies
to a great extent on the consistent pronunciation and usage of words
within a language. In Chinese, word usage, pronunciation, and syntax
and grammar vary depending on the speaker's dialect. As a result
speech recognition systems constructed to process standard Chinese
(Putonghua) perform poorly for the great majority of the population.

The goal of our summer project is to develop a general framework to
model phonetic, lexical, and pronunciation variability in dialectal
Chinese automatic speech recognition tasks. The baseline system is a
standard Chinese recognizer. The goal of our research is to find
suitable methods that employ dialect-related knowledge and training
data (in relatively small quantities) to modify the baseline system to
obtain a dialectal Chinese recognizer for the specific dialect of
interest. For practical reasons during the summer, we will focus on
one specific dialect, for example the Wu dialect or the Chuan
dialect. However the techniques we intend to develop should be broadly
applicable.

Our project will build on established ASR tools and systems developed
for standard Chinese. In particular, our previous studies in
pronunciation modeling have established baseline Mandarin ASR systems
along with their component lexicons and language model
collections. However, little previous work or resources are available
to support research in Chinese dialect variation for ASR. Our
pre-workshop will therefore focus on further infrastructure
development:

  * Dialectal Lexicon Construction. We will establish an electronic
  dialect dictionary for the chosen dialect. The lexicon will be
  constructed to represent both standard and dialectal pronunciations.

  * Dialectal Chinese Database Collection. We will set up a dialectal
  Chinese speech database with canonical pinyin level and dialectal
  pinyin level transcriptions. The database could contain two parts:
  read speech and spontaneous speech. For the spontaneous speech part,
  the generalized initial/final (GIF) level transcription should be also
  included.

Our effort at the workshop will be to employ these materials to
develop ASR system components that can be adapted from standard
Chinese to the chosen dialect. Emphasis will be placed on developing
techniques that work robustly with relatively small (or even no)
dialect data. Research will focus primarily on acoustic phenomena,
rather than syntax or grammatical variation, which we intend to pursue
after establishing baseline ASR experiments.

- -------------------------------------------------------------------------

4. Confidence Estimation for Natural Language Applications

Significant progress has been made in natural language processing
(NLP) technologies in recent years, but most still do not match human
performance. Since many applications of these technologies require
human-quality results, some form of manual intervention is necessary.

The success of such applications therefore depends heavily on the
extent to which errors can be automatically detected and signaled to a
human user. In our project we will attempt to devise a generic method
for NLP error detection by studying the problem of Confidence
Estimation (CE) in NLP results within a Machine Learning (ML)
framework.

Although widely used in Automatic Speech Recognition (ASR)
applications, this approach has not yet been extensively pursued in
other areas of NLP.  In ASR, error recovery is entirely based on
confidence measures: results with a low level of confidence are
rejected and the user is asked to repeat his or her statement. We
argue that a large number of other NLP applications could benefit from
such an approach. For instance, when post-editing MT output, a human
translator could revise only those automatic translations that have a
high probability of being wrong. Apart from improving user
interactions, CE methods could also be used to improve the underlying
technologies. For example, bootstrap learning could be based on
outputs with a high confidence level, and NLP output re-scoring could
depend on probabilities of correctness.

Our basic approach will be to use a statistical Machine Learning (ML)
framework to post-process NLP results: an additional ML layer will be
trained to discriminate between correct and incorrect NLP results and
compute a confidence measure (CM) that is an estimate of the
probability of an output being correct. We will test this approach on
a statistical MT application using a very strong baseline MT
system. Specifically, we will start off with the same training corpus
(Chinese-English data from recent NIST evaluations), and baseline
system as the Syntax for Statistical Machine Translation team.

During the workshop we will investigate a variety of confidence
features and test their effects on the discriminative power of our CM
using Receiver Operating Characteristic (ROC) curves. We will
investigate features intended to capture the amount of overlap, or
consensus, among the system's n-best translation hypotheses, features
focusing on the reliability of estimates from the training corpus,
ones intended to capture the inherent difficulty of the source
sentence under translation, and those that exploit information from
the base statistical MT system.  Other themes for investigation
include a comparison of different ML frameworks such as Neural Nets or
Support Vector Machines, and a determination of the optimal
granularity for confidence estimates (sentence-level, word-level,
etc).

Two methods will be used to evaluate final results. First, we will
perform a re-scoring experiment where the n-best translation
alternatives output by the baseline system will be re-ordered
according to their confidence estimates. The results will be measured
using the standard automatic evaluation metric BLEU, and should be
directly comparable to those obtained by the Syntax for Statistical
Machine Translation team. We expect this to lead to many insights
about the differences between our approach and theirs. Another method
of evaluation will be to estimate the tradeoff between final
translation quality and amount of human effort invested, in a
simulated post-editing scenario.

- -------------------------------------------------------------------------

-------------------------------- Message 2 -------------------------------

Date:  Mon, 10 Feb 2003 19:42:20 +0000
From:  "Joseph T. Farquharson" <bakansa at yahoo.com>
Subject:  Jamaican Creole Newsletter

Those linguists working on English-lexifier Creoles, and thos who have
an interest in Jamaican Creole, particularly, should find this new
development of interest.

I along with a group of other young scholars have started to produce a
newsletter called Bak Ansa, which is written in Jamaican Creole using
the Cassidy/LePage orthographic system. The newsletter is an attempt
to get persons to use the languages to engage with topics which were
fomally the domain of the official language (English). Linguists with
an interest in language planning, language attitudes, and literature
and linguistics should find the newsletter useful.

In order to receive issues of the newsletter which will be
disseminated via email, we invite all interested persons to subscribe
to bak_ansa at yahoogroups.com.

Thank you.

Language Name:
Southwestern Caribbean Creole English
Code: JAM

---------------------------------------------------------------------------
LINGUIST List: Vol-14-433