Corpora: JHU CLSP Workshop 2000 - Summer Internships

Thu Jan 27 16:16:04 UTC 2000

Dear Colleague:

The Center for Language and Speech Processing at the Johns Hopkins
University is offering a unique summer internship opportunity, which
we would like you to bring to the attention of your best students in
the current junior class.

This internship is unique in the sense that the selected students will
participate in cutting edge research as full members alongside leading
scientists from industry, academia, and the government.  The exciting
nature of the internship is the exposure of the undergraduate students
to the emerging fields of language engineering, such as automatic
speech recognition (ASR). natural language processing (NLP), machine
translation (MT), and speech synthesis (ITS).

We are specifically looking to attract new talent into the field and,
as such, do not require the students to have prior knowledge of
language engineering technology.  Please take a few moments to
nominate suitable bright students who may be interested in this
internship.  On-line applications for the program can be found at
http://www.clsp.jhu.edu/workshops along with additional information
regarding plans for the 2000 Workshop and information on past
workshops.  The application deadline is January 28, 2000.

If you have questions, please contact us by phone (410-516-7730),
e-mail
(sec at clsp.jhu.edu) or via the Internet (http://www.clsp.jhu.edu).

					Sincerely,

					Frederick Jelinek
					J.S. Smith Professor and 					  Director

Project Descriptions3

1. Reading Comprehension

Building a computer system that can acquire information by reading
texts has been a long standing goal of computer science.  Consider
designing a computer system that can take the following third grade
reading comprehension exam.

  How Maple Syrup is Made
  Maple syrup comes from sugar maple trees.  At one time, maple syrup
was   used to make sugar.  This is why the tree is called a "sugar"
maple tree.   Sugar maple trees make sap.  Farmers collect the sap.
The best time to   collect sap is in February and March.  The nights
must be cold and the   days warm.  The farmer drills a few small holes
in each
tree.  He puts a   spout in each hole.  Then he hangs a bucket on the
end of each spout.
The bucket has a cover to keep rain and snow out.  The sap drips into
the   bucket.  About 10 gallons of sap come from each hole.

  1.  Who collects maple sap? (Farmers)
  2.  What does the farmer hang from a spout?  (A bucket)
  3.  When is sap collected?  (February and March)
  4.  Where does the maple sap come from? (Sugar maple trees)
  5.  Why is the bucket covered?  (to keep rain and snow out)

Such exams measure understanding by asking a variety of questions.
Different types of questions probe different aspects of understanding.

Existing techniques currently earn roughly a 40% grade; still failing
but encouraging.  We will investigate methods by which a computer can
understand the text better, and hope that by the end of the workshop
the computer will be ready to move on to the fourth grade!

2. Mandarin-English Information (MEI)

Our globally interconnected world increasingly demands technologies to
support on-demand retrieval of relevant information in any medium and
in any language.  If we search the web for, say, the loss of life in
an earthquake in Turkey, by entering keywords in English, the most
relevant stories are likely to be in Turkish or even Greek.
Furthermore, the latest information may be in the form of audio files
of the evening's
news.  One would like to be able to firstly find such information and
then to translate it to English.  Finding such information is beyond
the capabilities of most commercially available search engines; good
automatic translation is even harder.  In this project, we will extend
the state-of-the-art for searching audio and on-line text in one
language for a user who speaks another language.

A very large corpus of concurrent Mandarin and English textual and
spoken news stories is available for conducting such research.  These
textual and spoken documents in both languages will be automatically
indexed; in case of spoken documents, this will involve automatic
speech recognition.  Given a query in either language, we will then
investigate
systems that retrieve relevant documents in both languages for the
user.  Such cross-lingual and cross-media (CLCM) information retrieval
is a novel problem with many technical challenges.  Several schemes
for recognizing the audio, indexing the text, and for estimating
translation models to match queries in one language with documents in
another language will be investigated in the summer.  Applications of
this research include audio and video browsing, spoken document
retrieval, automated routing of information, and automatically
alerting the user when special events occur.

3. Audio-Visual Speech Recognition

It is well known that humans have the ability to lip-read: we combine
audio and visual Information in deciding what has been spoken,
especially in noisy environments.  A dramatic example is the so-called
McGurk effect, where a spoken sound ga is superimposed on the video of
a person uttering ba.  Most people perceive the speaker as uttering
the sound da.

We will strive to achieve automatic lip-reading by computers, i.e., to
make computers recognize human speech even better than is now possible
from the audio input alone, by using the video of the speaker's face.
There are many difficult research problems on the way to succeeding in
this task, e.g., tracking the speakers head as she moves in the
video-frame, identifying the type of lip-movement, guessing the spoken
words independently from the video and the audio and combining the
information from the two signals to make a better guess of what was
spoken.  In the summer, we will focus on a specific problem: how best
to combine the information from the audio and video signal.

For example, using visual cues to decide whether a person said /ba/
rather than /ga/ can be easier than making the decision based on audio
cues, which can sometimes be confusing.  On the other hand, deciding
between  /ka/ and /ga/  is more reliably done from the audio than the
video.  Therefore our confidence in the audio-based and video-based
hypotheses depends on the kinds of sounds being confused.  We will
invent and test algorithms for combining the automatic speech
classification decisions based on the audio and visual stimuli,
resulting in audio-visual speech recognition that significantly
improves the traditional audio-only speech recognition performance.

4. Pronunciation Modeling of Mandarin Casual Speech

When people speak casually in daily life, they are not consistent in
their pronunciation.  In listening to such casual  speech, it is quite
common to find many different pronunciations of individual words.
Current automatic speech recognition systems can  reach a word
accuracies above 90% when evaluated on carefully produced standard
speech, but in recognizing casual, unplanned speech, performance drops
to 75% or even lower.  There are many reasons for this. In casual
speech, one phoneme can shift to another.  In mandarin for example,
the initial / sh / in  "wo shi  (I am)" is often pronounced weakly and
shifts into a / r /. In some other cases, sounds are dropped. In
Mandarin, phonemes such as b, p, d, t, k   are often reduced and as a
result are often recognized as silence.  These problems are made
especially severe in Mandarin casual speech since most Chinese are
non-native Mandarin speakers. Chinese languages such as Cantonese are
as different from the standard Mandarin as French is different from
English. As a result, there is an even larger pronunciation variation
due to the influence of speakers' native language. We propose to study
and model such pronunciation differences in casual speech using
interviews found in Mandarin news broadcasts. We hope to include
experienced researchers from both China and the US in the areas of
pronunciation modeling, Mandarin speech recognition, and Chinese
phonology.

3 Proposed projects for WS00, Center for Language and Speech
Processing, Johns Hopkins University, Baltimore, Maryland 21218-2686.

--
        Amy Berdann                      410-516x4778
    Center Administrator                 berdann at jhu.edu
       320 Barton Hall                   http://www.clsp.jhu.edu
Center for Language and Speech Processing
    Johns Hopkins University