<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center">LDC2009V01<b><br>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01">Audiovisual

Database of Spoken American English</a>  -<br>

</b></div>

<p align="center">LDC2009T03<br>

-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03"><b>GALE

Phase 1 Arabic Newsgroup Parallel Text -

Part 1</b></a>  -<br>

</p>

<div align="center"><b>-  <a

 href="http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1">LDC's

Corpus Catalog Receives Top OLAC

Rating</a></b> 

-<br>

</div>

<p align="center">-  <a

 href="http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#2"><b>2009

Publications Pipeline</b></a>  -</p>

<hr size="2" width="100%"><o:p></o:p>

<p style="text-align: center;" align="center"><b>New Publications</b><o:p></o:p></p>

<p>(1) The <span style="color: black;"><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01">Audiovisual

Database of Spoken American English</a> </span>was developed at Butler

University,

Indianapolis, IN in 2007 for use by a a variety of researchers to

evaluate

speech production and speech recognition. It contains approximately

seven hours

of audiovisual recordings of fourteen American English speakers

producing

syllables, word lists and sentences used in both academic and clinical

settings. <o:p></o:p></p>

<p>All talkers were from the North Midland dialect region -- roughly

defined as

Indianapolis and north within the state of Indiana -- and had lived in

that

region for the majority of the time from birth to 18 years of age. Each

participant read 238 different words and 166 different sentences. The

sentences

spoken were drawn from the following sources: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Central Institute for the Deaf (CID)

Everyday Sentences (Lists A-J) <o:p></o:p></li>

  <li class="MsoNormal" style="">Northwestern University Auditory Test

No. 6 (Lists I-IV) <o:p></o:p></li>

  <li class="MsoNormal" style="">Vowels in /hVd/ context (separate

words) <o:p></o:p></li>

  <li class="MsoNormal" style="">Texas Instruments/Massachusetts

Institute for Technology <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1">(TIMIT)</a>

sentences <o:p></o:p></li>

</ul>

<p>The Audiovisual Database of Spoken American English will be of

interest in

various disciplines: to linguists for studies of phonetics, phonology,

and

prosody of American English; to speech scientists for investigations of

motor

speech production and auditory-visual speech perception; to engineers

and

computer scientists for investigations of machine audio-visual speech

recognition (AVSR); and to speech and hearing scientists for clinical

purposes,

such as the examination and improvement of speech perception by

listeners with

hearing loss. <o:p></o:p></p>

<p>Participants were recorded individually during a single session with

a

Panasonic DVC-80 digital video camera to miniDV digital video cassette

tapes.

All participants wore a Sennheiser MKE-2060 directional/cardioid lapel

microphone throughout the recordings.  Each speaker produced a total of

94

segmented files which were converted from Final Cut Express to

Quicktime (.mov)

files. <o:p></o:p><br>

 <o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center"><o:p> </o:p><br>

<b>*</b><o:p></o:p></p>

<p>(2) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03">GALE

Phase 1 Arabic Newsgroup Parallel Text - Part 1</a> was prepared by LDC

and

contains a total of 178,000 words (264 files) of Arabic newsgroup text

and its

translation selected from thirty-five sources. Newsgroups consist of

posts to

electronic bulletin boards, Usenet newsgroups, discussion groups and

similar

forums. This release was used as training data in Phase 1 (year 1) of

the

DARPA-funded GALE program. <o:p></o:p>Preparing the source data

involved four stages of work: data

scouting, data

harvesting, formatting and data selection.<o:p></o:p></p>

<p class="MsoNormal">Data scouting involved manually searching the web

for

suitable newsgroup text. Data scouts were assigned particular topics

and genres

along with a production target in order to focus their web search.

Formal

annotation guidelines and a customized annotation toolkit helped data

scouts to

manage the search process and to track progress. <o:p></o:p></p>

<p>Data scouts logged their decisions about potential text of interest

to a

database. A nightly process queried the annotation database and

harvested all

designated URLs. Whenever possible, the entire site was downloaded, not

just

the individual thread or post located by the data scout. Once the text

was

downloaded, its format was standardized so that the data could be more

easily

integrated into downstream annotation processes. Typically, a new

script was

required for each new domain name that was identified. After scripts

were run,

an optional manual process corrected any remaining formatting problems.<br>

<br>

The selected documents were then reviewed for content-suitability using

a

semi-automatic process. A statistical approach was used to rank a

document's

relevance to a set of already-selected documents labeled as "good."

An annotator then reviewed the list of relevance-ranked documents and

selected

those which were suitable for a particular annotation task or for

annotation in

general. These newly-judged documents in turn provided additional input

for the

generation of new ranked lists. <o:p></o:p></p>

<p class="MsoNormal">Manual sentence units/segments (SU) annotation was

also

performed as part of the transcription task. Three types of end of

sentence SU

were identified: statement SU, question SU, and incomplete SU. After

transcription and SU annotation, files were reformatted into a

human-readable

translation format and assigned to professional translators for careful

translation. Translators followed LDC's GALE Translation guidelines

which

describe the makeup of the translation team, the source data format,

the

translation data format, best practices for translating certain

linguistic

features and quality control procedures applied to completed

translations.  <o:p></o:p></p>

<p>All final data are presented in Tab Delimited Format (TDF). TDF is

compatible with other transcription formats, such as the Transcriber

format and

AG format making it easy to process.<o:p></o:p></p>

<br>

<p align="center"><b>LDC's Corpus Catalog Receives Top OLAC

Rating</b></p>

<p>LDC is pleased to announce that <a

 href="http://www.ldc.upenn.edu/Catalog/">The

LDC Corpus Catalog</a> has been awarded a five-star quality rating, the

highest

rating available, by the <a href="http://www.language-archives.org/">Open

Language Archives Community (OLAC)</a>. OLAC is an international

partnership of institutions and individuals who are creating a

worldwide

virtual library of language resources by: (i) developing consensus on

best

current practice for the digital archiving of language resources, and

(ii)

developing a network of interoperating repositories and services for

housing

and accessing such resources.  LDC supports OLAC and is among the 37

participating archives <span style=""></span>who have

contributed over 36,000 records to the combined catalog of language

resources. OLAC seeks to refine the quality of the

metadata in catalog records in order to improve the quality of

searching that

users can do over that catalog. When resources are described following

the best

practice guidelines established by OLAC, it increases the likelihood

that all

the resources returned by a query are relevant (precision) and that all

relevant resources are returned (recall).<o:p></o:p></p>

<p style="margin-bottom: 12pt;">Certain metadata in the LDC <span

 style=""></span>catalog was missing, inaccurate and/or

non-compliant with OLAC standards for several fields.  Over a period of

a

few months, a team at LDC took several steps to make that metadata

OLAC-compliant. 

Most significantly, the language name and the language ID for over 400

corpora were

reviewed and changed when required to conform to the new standard for

language identification, <a href="http://www.sil.org/iso639-3/">ISO

639-3</a>.  Additional efforts focused on providing author information

for all

corpora and fixing dead links.  Finally, the team added a new metadata

field to consistently document the "type" of each resource, using a

standard vocabulary from the digital libraries community called

DCMI-Type, reliably distinguishing text and sound resources.  The

benefits of these revisions include

improving LDC's management of resources in the catalog as well as

assisting LDC

users to quickly identify all corpora which are relevant to their

research.<br style="">

<!--[if !supportLineBreakNewLine]--><br style="">

<!--[endif]--><o:p></o:p></p>

<o:p></o:p>

<p align="center"><b>2009 Publications Pipeline<br>

</b></p>

<p>For Membership Year 2009 (MY2009), we

anticipate releasing a varied selection of

publications. Many publications are still in

development, but here is a glimpse of what is in the pipeline for

MY2009.  Please note that this list is tentative and subject to

modifications.  Our planned publications include:<br>

</p>

<blockquote>

  <p><i>Arabic Gigaword Fourth Edition</i> ~ edition includes our

recent newswire

collections as well as the contents of Arabic Gigaword Third Edition

(LDC2007T40).  In addition to sources found in previous releases such

as

Xihhuna, Agence France Presse, An Nahar, Al Hayat, this release

includes data from several new sources, such as Al Quds, Asharq Al

Awasat, and Al Ahram. <br>

  </p>

  <p><i>Chinese Gigaword Fourth Edition </i>~ edition includes our

recent newswire

collections as well as the contents of the Chinese Gigaword

Third Edition (LDC2007T38). In addition to sources found in previous

releases such as

Agence France Presse, Central News Agency (Taiwan), Xinhua and Zaobao,

this release includes data from several new sources, such as People's

Liberation Army Daily, Guangming Daily, and China News Service. <b> </b></p>

</blockquote>

<blockquote><i><span

 style="font-size: 12pt; font-family: "Times New Roman";">Chinese Web

5-gram Corpus Version 1</span></i><span

 style="font-size: 12pt; font-family: "Times New Roman";"> ~ contains

n-grams (unigrams to five-grams) and their observed counts

in 880 billion tokens of Chinese web data collected in March 2008. All

text was

converted to UTF-8. A simple segmenter using the same algorithm used to

generate the data is included. The set contains 3.9 billion n-grams

total.<br>

  <br>

  <i>CoNLL 2008 Shared Task Corpus</i> ~ includes syntactic and

semantic

dependencies for Treebank-3 (LDC99T42) data. This corpus was developed

for the

2008 shared task of the Conference on Natural Language Learning (CoNLL

2008).

The syntactic information was created by converting constituent trees

from

Treebank-3 to dependencies using a set of head percolation rules and a

series

of other transformations, e.g., named entity boundaries are included

from the

BBN Pronoun Coreference and Entity Type Corpus (LDC2005T33). The

semantic

dependencies were created by converting semantic propositions to a

dependency

representation. The corpus includes propositions centered around both

verbal

predicates - from Proposition Bank I (LDC2004T14) - and around nominal

predicates - from NomBank 1.0 (LDC2008T24).<br style="">

<!--[if !supportLineBreakNewLine]--></span><font

 face="Times New Roman, Times, serif"><br>

  </font><i>English Gigaword Fourth Edition</i> ~ edition includes our

recent collections as

well as the contents of the English Gigaword Third Edition

(LDC2007T07).  The sources

of text data include Agence France Presse, Associated Press,

Central News Agency (Taiwan), NY Times, Xinhua and Salon.com <br>

  <p class="MsoNormal" style="margin-bottom: 12pt;"><i>GALE Phase 1

Arabic

Newsgroup Parallel Text Part 2</i> ~ 145K words (263 files) of Arabic

newsgroup

text and its English translation selected from thirty sources.

Newsgroups

consist of posts to electronic bulletin boards, Usenet newsgroups,

discussion

groups and similar forums. This release was used as training data in

Phase 1 of

the DARPA-funded GALE program.<br>

  <br>

  <i>GALE Phase 1 Chinese Broadcast Conversation Parallel Text Part 2</i>

~ total

of 24 hours of Chinese broadcast conversation were selected from three

sources,

China Central TV (CCTV) Phoenix TV, and Voice of America.  This

release was used as training data in Phase 1 of the DARPA-funded GALE

program.<br>

  <br>

  <i>GALE Phase 1 Chinese Newsgroup Parallel Text Part 1</i> ~  240K

characters (112 files) of Chinese newsgroup text and its English

translation

selected from twenty-five sources.   Newsgroups consist of posts to

electronic bulletin boards, Usenet newsgroups, discussion groups and

similar forums.

This release was used as training data in Phase 1 of the DARPA-funded

GALE

program.<br>

  <br>

  <i>Japanese Web N-gram Corpus Version 1</i> ~ contains n-grams

(unigrams to

seven-grams) and their observed counts in 250 billion tokens of

Japanese web

data collected in July 2007. All text was converted to UTF-8 and

segmented

using the publicly available segmenter MeCab. The set contains 3.2

billion

n-grams total.<br>

  <br>

  <i>NIST MetricsMATR08 Development Data</i> ~ contains sample data

extracted

from the NIST Open Machine Translation (MT) 2006 evaluation.  Data

includes the English machine translations from 8 systems and the human

reference translations for 25 Arabic source language newswire

documents, along

with corresponding human assessments of adequacy and preference.  This

data set was originally provided to NIST MetricsMATR08 participants for

the

purpose of MT metric development.<o:p></o:p></p>

  <b> </b></blockquote>

2009 Subscription Members are automatically sent all MY2009 data as it

is released.  2009 Standard Members are entitled to request 16 corpora

for free from MY2009.   Non-members may license most data for research

use.<br>

<br>

<hr size="2" width="100%"><br>

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>