<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div align="center">LDC2009V01<b><br>
- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01">Audiovisual
Database of Spoken American English</a> -<br>
</b></div>
<p align="center">LDC2009T03<br>
- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03"><b>GALE
Phase 1 Arabic Newsgroup Parallel Text -
Part 1</b></a> -<br>
</p>
<div align="center"><b>- <a
href="http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1">LDC's
Corpus Catalog Receives Top OLAC
Rating</a></b>
-<br>
</div>
<p align="center">- <a
href="http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#2"><b>2009
Publications Pipeline</b></a> -</p>
<hr size="2" width="100%"><o:p></o:p>
<p style="text-align: center;" align="center"><b>New Publications</b><o:p></o:p></p>
<p>(1) The <span style="color: black;"><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01">Audiovisual
Database of Spoken American English</a> </span>was developed at Butler
University,
Indianapolis, IN in 2007 for use by a a variety of researchers to
evaluate
speech production and speech recognition. It contains approximately
seven hours
of audiovisual recordings of fourteen American English speakers
producing
syllables, word lists and sentences used in both academic and clinical
settings. <o:p></o:p></p>
<p>All talkers were from the North Midland dialect region -- roughly
defined as
Indianapolis and north within the state of Indiana -- and had lived in
that
region for the majority of the time from birth to 18 years of age. Each
participant read 238 different words and 166 different sentences. The
sentences
spoken were drawn from the following sources: <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">Central Institute for the Deaf (CID)
Everyday Sentences (Lists A-J) <o:p></o:p></li>
<li class="MsoNormal" style="">Northwestern University Auditory Test
No. 6 (Lists I-IV) <o:p></o:p></li>
<li class="MsoNormal" style="">Vowels in /hVd/ context (separate
words) <o:p></o:p></li>
<li class="MsoNormal" style="">Texas Instruments/Massachusetts
Institute for Technology <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">(TIMIT)</a>
sentences <o:p></o:p></li>
</ul>
<p>The Audiovisual Database of Spoken American English will be of
interest in
various disciplines: to linguists for studies of phonetics, phonology,
and
prosody of American English; to speech scientists for investigations of
motor
speech production and auditory-visual speech perception; to engineers
and
computer scientists for investigations of machine audio-visual speech
recognition (AVSR); and to speech and hearing scientists for clinical
purposes,
such as the examination and improvement of speech perception by
listeners with
hearing loss. <o:p></o:p></p>
<p>Participants were recorded individually during a single session with
a
Panasonic DVC-80 digital video camera to miniDV digital video cassette
tapes.
All participants wore a Sennheiser MKE-2060 directional/cardioid lapel
microphone throughout the recordings. Each speaker produced a total of
94
segmented files which were converted from Final Cut Express to
Quicktime (.mov)
files. <o:p></o:p><br>
<o:p></o:p></p>
<p class="MsoNormal" style="text-align: center;" align="center"><o:p> </o:p><br>
<b>*</b><o:p></o:p></p>
<p>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03">GALE
Phase 1 Arabic Newsgroup Parallel Text - Part 1</a> was prepared by LDC
and
contains a total of 178,000 words (264 files) of Arabic newsgroup text
and its
translation selected from thirty-five sources. Newsgroups consist of
posts to
electronic bulletin boards, Usenet newsgroups, discussion groups and
similar
forums. This release was used as training data in Phase 1 (year 1) of
the
DARPA-funded GALE program. <o:p></o:p>Preparing the source data
involved four stages of work: data
scouting, data
harvesting, formatting and data selection.<o:p></o:p></p>
<p class="MsoNormal">Data scouting involved manually searching the web
for
suitable newsgroup text. Data scouts were assigned particular topics
and genres
along with a production target in order to focus their web search.
Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to
manage the search process and to track progress. <o:p></o:p></p>
<p>Data scouts logged their decisions about potential text of interest
to a
database. A nightly process queried the annotation database and
harvested all
designated URLs. Whenever possible, the entire site was downloaded, not
just
the individual thread or post located by the data scout. Once the text
was
downloaded, its format was standardized so that the data could be more
easily
integrated into downstream annotation processes. Typically, a new
script was
required for each new domain name that was identified. After scripts
were run,
an optional manual process corrected any remaining formatting problems.<br>
<br>
The selected documents were then reviewed for content-suitability using
a
semi-automatic process. A statistical approach was used to rank a
document's
relevance to a set of already-selected documents labeled as "good."
An annotator then reviewed the list of relevance-ranked documents and
selected
those which were suitable for a particular annotation task or for
annotation in
general. These newly-judged documents in turn provided additional input
for the
generation of new ranked lists. <o:p></o:p></p>
<p class="MsoNormal">Manual sentence units/segments (SU) annotation was
also
performed as part of the transcription task. Three types of end of
sentence SU
were identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, files were reformatted into a
human-readable
translation format and assigned to professional translators for careful
translation. Translators followed LDC's GALE Translation guidelines
which
describe the makeup of the translation team, the source data format,
the
translation data format, best practices for translating certain
linguistic
features and quality control procedures applied to completed
translations. <o:p></o:p></p>
<p>All final data are presented in Tab Delimited Format (TDF). TDF is
compatible with other transcription formats, such as the Transcriber
format and
AG format making it easy to process.<o:p></o:p></p>
<br>
<p align="center"><b>LDC's Corpus Catalog Receives Top OLAC
Rating</b></p>
<p>LDC is pleased to announce that <a
href="http://www.ldc.upenn.edu/Catalog/">The
LDC Corpus Catalog</a> has been awarded a five-star quality rating, the
highest
rating available, by the <a href="http://www.language-archives.org/">Open
Language Archives Community (OLAC)</a>. OLAC is an international
partnership of institutions and individuals who are creating a
worldwide
virtual library of language resources by: (i) developing consensus on
best
current practice for the digital archiving of language resources, and
(ii)
developing a network of interoperating repositories and services for
housing
and accessing such resources. LDC supports OLAC and is among the 37
participating archives <span style=""></span>who have
contributed over 36,000 records to the combined catalog of language
resources. OLAC seeks to refine the quality of the
metadata in catalog records in order to improve the quality of
searching that
users can do over that catalog. When resources are described following
the best
practice guidelines established by OLAC, it increases the likelihood
that all
the resources returned by a query are relevant (precision) and that all
relevant resources are returned (recall).<o:p></o:p></p>
<p style="margin-bottom: 12pt;">Certain metadata in the LDC <span
style=""></span>catalog was missing, inaccurate and/or
non-compliant with OLAC standards for several fields. Over a period of
a
few months, a team at LDC took several steps to make that metadata
OLAC-compliant.
Most significantly, the language name and the language ID for over 400
corpora were
reviewed and changed when required to conform to the new standard for
language identification, <a href="http://www.sil.org/iso639-3/">ISO
639-3</a>. Additional efforts focused on providing author information
for all
corpora and fixing dead links. Finally, the team added a new metadata
field to consistently document the "type" of each resource, using a
standard vocabulary from the digital libraries community called
DCMI-Type, reliably distinguishing text and sound resources. The
benefits of these revisions include
improving LDC's management of resources in the catalog as well as
assisting LDC
users to quickly identify all corpora which are relevant to their
research.<br style="">
<!--[if !supportLineBreakNewLine]--><br style="">
<!--[endif]--><o:p></o:p></p>
<o:p></o:p>
<p align="center"><b>2009 Publications Pipeline<br>
</b></p>
<p>For Membership Year 2009 (MY2009), we
anticipate releasing a varied selection of
publications. Many publications are still in
development, but here is a glimpse of what is in the pipeline for
MY2009. Please note that this list is tentative and subject to
modifications. Our planned publications include:<br>
</p>
<blockquote>
<p><i>Arabic Gigaword Fourth Edition</i> ~ edition includes our
recent newswire
collections as well as the contents of Arabic Gigaword Third Edition
(LDC2007T40). In addition to sources found in previous releases such
as
Xihhuna, Agence France Presse, An Nahar, Al Hayat, this release
includes data from several new sources, such as Al Quds, Asharq Al
Awasat, and Al Ahram. <br>
</p>
<p><i>Chinese Gigaword Fourth Edition </i>~ edition includes our
recent newswire
collections as well as the contents of the Chinese Gigaword
Third Edition (LDC2007T38). In addition to sources found in previous
releases such as
Agence France Presse, Central News Agency (Taiwan), Xinhua and Zaobao,
this release includes data from several new sources, such as People's
Liberation Army Daily, Guangming Daily, and China News Service. <b> </b></p>
</blockquote>
<blockquote><i><span
style="font-size: 12pt; font-family: "Times New Roman";">Chinese Web
5-gram Corpus Version 1</span></i><span
style="font-size: 12pt; font-family: "Times New Roman";"> ~ contains
n-grams (unigrams to five-grams) and their observed counts
in 880 billion tokens of Chinese web data collected in March 2008. All
text was
converted to UTF-8. A simple segmenter using the same algorithm used to
generate the data is included. The set contains 3.9 billion n-grams
total.<br>
<br>
<i>CoNLL 2008 Shared Task Corpus</i> ~ includes syntactic and
semantic
dependencies for Treebank-3 (LDC99T42) data. This corpus was developed
for the
2008 shared task of the Conference on Natural Language Learning (CoNLL
2008).
The syntactic information was created by converting constituent trees
from
Treebank-3 to dependencies using a set of head percolation rules and a
series
of other transformations, e.g., named entity boundaries are included
from the
BBN Pronoun Coreference and Entity Type Corpus (LDC2005T33). The
semantic
dependencies were created by converting semantic propositions to a
dependency
representation. The corpus includes propositions centered around both
verbal
predicates - from Proposition Bank I (LDC2004T14) - and around nominal
predicates - from NomBank 1.0 (LDC2008T24).<br style="">
<!--[if !supportLineBreakNewLine]--></span><font
face="Times New Roman, Times, serif"><br>
</font><i>English Gigaword Fourth Edition</i> ~ edition includes our
recent collections as
well as the contents of the English Gigaword Third Edition
(LDC2007T07). The sources
of text data include Agence France Presse, Associated Press,
Central News Agency (Taiwan), NY Times, Xinhua and Salon.com <br>
<p class="MsoNormal" style="margin-bottom: 12pt;"><i>GALE Phase 1
Arabic
Newsgroup Parallel Text Part 2</i> ~ 145K words (263 files) of Arabic
newsgroup
text and its English translation selected from thirty sources.
Newsgroups
consist of posts to electronic bulletin boards, Usenet newsgroups,
discussion
groups and similar forums. This release was used as training data in
Phase 1 of
the DARPA-funded GALE program.<br>
<br>
<i>GALE Phase 1 Chinese Broadcast Conversation Parallel Text Part 2</i>
~ total
of 24 hours of Chinese broadcast conversation were selected from three
sources,
China Central TV (CCTV) Phoenix TV, and Voice of America. This
release was used as training data in Phase 1 of the DARPA-funded GALE
program.<br>
<br>
<i>GALE Phase 1 Chinese Newsgroup Parallel Text Part 1</i> ~ 240K
characters (112 files) of Chinese newsgroup text and its English
translation
selected from twenty-five sources. Newsgroups consist of posts to
electronic bulletin boards, Usenet newsgroups, discussion groups and
similar forums.
This release was used as training data in Phase 1 of the DARPA-funded
GALE
program.<br>
<br>
<i>Japanese Web N-gram Corpus Version 1</i> ~ contains n-grams
(unigrams to
seven-grams) and their observed counts in 250 billion tokens of
Japanese web
data collected in July 2007. All text was converted to UTF-8 and
segmented
using the publicly available segmenter MeCab. The set contains 3.2
billion
n-grams total.<br>
<br>
<i>NIST MetricsMATR08 Development Data</i> ~ contains sample data
extracted
from the NIST Open Machine Translation (MT) 2006 evaluation. Data
includes the English machine translations from 8 systems and the human
reference translations for 25 Arabic source language newswire
documents, along
with corresponding human assessments of adequacy and preference. This
data set was originally provided to NIST MetricsMATR08 participants for
the
purpose of MT metric development.<o:p></o:p></p>
<b> </b></blockquote>
2009 Subscription Members are automatically sent all MY2009 data as it
is released. 2009 Standard Members are entitled to request 16 corpora
for free from MY2009. Non-members may license most data for research
use.<br>
<br>
<hr size="2" width="100%"><br>
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>