<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center">LDC2008S05<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05">2005

NIST Language Recognition Evaluation</a>  -</b><br>

<br>

<b> </b>LDC2008T09<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T09">GALE

Phase 1 Arabic Broadcast News Parallel Text - Part 2</a>  -<br>

</b></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><br>

<b>The Linguistic Data Consortium (LDC) would like to announce the

availability of two new publications.</b><br>

</p>

<hr size="2" width="100%">

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b><br>

New

Publications<br>

<br>

</b></p>

<p>(1) <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T09">GALE

Phase 1 Arabic Broadcast News Parallel Text - Part 2</a> is the second

part of

the three-part GALE Phase 1 Arabic Broadcast News Parallel Text, which,

along

with other corpora, was used as training data in year 1 (Phase 1) of

the

DARPA-funded GALE program. The corpus contains transcripts and English

translations of 10.7 hours of Arabic broadcast news programming

selected from

various sources. This corpus does not contain the audio files from

which the

transcripts and translations were generated. <o:p></o:p></p>

<p>The Arabic broadcast news recordings were selected from four sources

and

four different programs.   A manual selection procedure was used to

choose

data appropriate for the GALE program, namely, news and conversation

programs

focusing on current events. Stories on topics such as sports,

entertainment

news, and stock market reports were excluded from the data set.  Manual

sentence units/segments (SU) annotation was also performed on

a subset of files following LDC's Quick Rich Transcription

specification. Three types of end of sentence SU were identified:

statement SU, question SU, and incomplete SU. </p>

After transcription and SU annotation, they were reformatted into a

human-readable translation format, and the files were then assigned to

professional translators for careful translation. Translators followed

LDC's GALE Translation guidelines, which describe the makeup of the

translation team, the source, data format, the translation data format,

best practices for translating certain linguistic features (such as

names and speech disfluencies), and quality control procedures applied

to completed translations.  <br>

<p> <o:p></o:p></p>

<p style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>

<p>(2) The <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05">2005

NIST Language Recognition Evaluation</a> corpus contains the evaluation

data, portions of the training

data, the evaluation plan, answer keys and scoring script for the 2005

NIST

(National Institute of Standards and Technology) Language Recognition

Evaluation (LRE). The goal of the LRE is to establish the baseline of

current

performance capability for language recognition of conversational

telephone

speech and to lay the groundwork for further research efforts in the

field.

NIST conducted two previous evaluations in <a

 href="http://www.nist.gov/speech/tests/lang/1996/LRE96EvalPlan.pdf">1996</a>

and <a

 href="http://www.nist.gov/speech/tests/lang/2003/LRE03EvalPlan-v1.pdf">2003</a>.

For the 2005

NIST LRE, the emphasis was on research directed

toward a general base of technology to be ported to various language

recognition tasks with minimum effort and the development of the

ability to

make more difficult discriminations between similar languages and

dialects of

the same language. <o:p></o:p></p>

<p class="MsoNormal">The task evaluated was the detection of a given

target

language or dialect. From a test segment of speech and a target

language or

dialect, the system to be evaluated determined whether the speech was

from the

target language or dialect. The evaluation consisted of speech from the

following languages and dialects: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">English (American) <o:p></o:p></li>

  <li class="MsoNormal" style="">English (Indian) <o:p></o:p></li>

  <li class="MsoNormal" style="">Hindi <o:p></o:p></li>

  <li class="MsoNormal" style="">Japanese <o:p></o:p></li>

  <li class="MsoNormal" style="">Korean <o:p></o:p></li>

  <li class="MsoNormal" style="">Mandarin (Mainland) <o:p></o:p></li>

  <li class="MsoNormal" style="">Mandarin (<st1:country-region><st1:place>Taiwan</st1:place></st1:country-region>)

    <o:p></o:p></li>

  <li class="MsoNormal" style="">Spanish (Mexican) <o:p></o:p></li>

  <li class="MsoNormal" style="">Tamil <o:p></o:p></li>

</ul>

<p>The 2005 NIST Language Recognition Evaluation Plan, which includes a

description of the evaluation tasks, is included with this release.

Further

information regarding this evaluation is also available at the <a

 href="http://www.nist.gov/speech/tests/lang/"><span

 style="text-decoration: none;">NIST Language Recognition Evaluation</span></a>

website. <o:p></o:p></p>

<p>Each speech file is one side of a telephone conversation . There are

11,106

speech files in sphere (.sph) format for a total of 44.2 hours of

speech. The speech

data was compiled from LDC's CALLFRIEND corpora and from data collected

by

Oregon Health and <st1:place><st1:placename>Science</st1:placename> <st1:placetype>University</st1:placetype></st1:place>.

<o:p></o:p></p>

<p>Each test segment was prepared using an automatic speech activity

detection

algorithm to identify areas and durations of speech. Segments were

chosen to

contain a specified approximate duration of actual speech. The test

segments

contain three nominal durations of speech: 3 seconds, 10 seconds, and

30

seconds. Performance was evaluated separately for test segments of each

duration. Auxiliary information was included in the SPHERE headers to

document

the source file, start time, and duration of all excerpts that were

used to

construct the segment. <br>

</p>

<br>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

</body>

</html>