<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal"><b>- <a href="#scholar">Fall 2014 Data
Scholarship Program</a> -</b><o:p></o:p></p>
<p class="MsoNormal"><i>New publications:</i><o:p></o:p></p>
<p class="MsoNormal"><b>- <a href="#lre">2009 NIST Language
Recognition Evaluation Test Set</a> -</b><o:p></o:p></p>
<p class="MsoNormal"><b>- <a href="#gale">GALE Arabic-English Word
Alignment Training Part 3 -- Web</a> -</b><o:p></o:p></p>
<p class="MsoNormal"><b>- <a href="#g2">GALE Phase 2 Chinese
Newswire Parallel Text Part 1</a> -</b></p>
<o:p></o:p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr size="2" width="100%" align="center"> </div>
<p class="MsoNormal"><a name="scholar"></a><b>Fall 2014 Data
Scholarship Program</b><o:p></o:p></p>
<p class="MsoNormal">Applications are now being accepted through
Monday, September 15, 2014, 11:59PM EST for the Fall 2014 LDC Data
Scholarship program! The LDC Data Scholarship program provides
university students with access to LDC data at no-cost.<br>
<br>
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research
agenda and a bona fide inability to pay. The selection process is
highly competitive.<br>
<br>
The application consists of two parts:<br>
<br>
(1) Data Use Proposal. Applicants must submit a proposal
describing their intended use of the data. The proposal should
state which data the student plans to use and how the data will
benefit their research project as well as information on the
proposed methodology or algorithm.<br>
<br>
Applicants should consult the <a
href="https://catalog.ldc.upenn.edu/"><span style="color:blue">LDC
Catalog</span></a> for a complete list of data distributed by
LDC. Due to certain restrictions, a handful of LDC corpora are
restricted to members of the Consortium. Applicants are advised to
select a maximum of one to two databases.<br>
<br>
(2) Letter of Support. Applicants must submit one letter of
support from their thesis adviser or department chair. The letter
must confirm that the department or university lacks the funding
to pay the full non-member fee for the data and verify the
student's need for data.<br>
<br>
For further information on application materials and program
rules, please visit the <a
href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships"><span
style="color:blue">LDC Data Scholarship</span></a> page.<br>
</p>
<p class="MsoNormal"><br>
<o:p></o:p></p>
<p class="MsoNormal"><b>New publications<br>
</b></p>
<p class="MsoNormal"><a name="lre"></a>(1)<a
href="https://catalog.ldc.upenn.edu/LDC2014S06"><span
style="color:blue"> 2009 NIST Language Recognition Evaluation
Test Set</span></a> contains approximately 215 hours of
conversational telephone speech and radio broadcast conversation
collected by LDC in the following 23 languages and dialects:
Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari,
English (American), English (Indian), Farsi, French, Georgian,
Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian,
Spanish, Turkish, Ukrainian, Urdu and Vietnamese.<o:p></o:p></p>
<p class="MsoNormal">The goal of the <a
href="http://www.itl.nist.gov/iad/"><span style="color:blue">NIST
(National Institute of Standards and Technology)</span></a> <a
href="http://www.itl.nist.gov/iad/mig/tests/lre/"><span
style="color:blue">Language Recognition Evaluation (LRE)</span></a>
is to establish the baseline of current performance capability for
language recognition of conversational telephone speech and to lay
the groundwork for further research efforts in the field. NIST
conducted language recognition evaluations in <a
href="http://www.itl.nist.gov/iad/mig/tests/lre/1996/"><span
style="color:blue">1996</span></a>, <a
href="http://www.itl.nist.gov/iad/mig/tests/lre/2003/"><span
style="color:blue">2003</span></a>, <a
href="http://www.itl.nist.gov/iad/mig/tests/lre/2005/"><span
style="color:blue">2005</span></a> and <a
href="http://www.itl.nist.gov/iad/mig/tests/lre/2007/"><span
style="color:blue">2007</span></a>. The <a
href="http://www.itl.nist.gov/iad/mig/tests/lre/2009/"><span
style="color:blue">2009</span></a> evaluation increased the
number of target languages. Most of the test data originated from
multilingual Voice of America (VOA) radio broadcasts assessed as
being of telephone bandwidth in addition to conversational
telephone speech. Further information regarding this evaluation
can be found in the evaluation plan which is included in the
documentation for this release.<o:p></o:p></p>
<p class="MsoNormal">LDC released the prior LREs as:<o:p></o:p></p>
<blockquote>
<p class="MsoNormal">2003 NIST Language Recognition Evaluation (<a
href="https://catalog.ldc.upenn.edu/LDC2006S31"><span
style="color:blue">LDC2006S31</span></a>)<o:p></o:p></p>
<p class="MsoNormal">2005 NIST Language Recognition Evaluation (<a
href="https://catalog.ldc.upenn.edu/LDC2008S05"><span
style="color:blue">LDC2008S05</span></a>)<o:p></o:p></p>
<p class="MsoNormal">2007 NIST Language Recognition Evaluation
Test Set (<a href="https://catalog.ldc.upenn.edu/LDC2009S04"><span
style="color:blue">LDC2009S04</span></a>)<o:p></o:p></p>
<p class="MsoNormal">2007 NIST Language Recognition Evaluation
Supplemental Training Set (<a
href="https://catalog.ldc.upenn.edu/LDC2009S05"><span
style="color:blue">LDC2009S05</span></a>)<o:p></o:p></p>
</blockquote>
<p class="MsoNormal">The VOA speech data was collected by LDC in
2000 and 2001 and constitutes approximately 75% of the test set.
The telephone speech was taken from LDC's Mixer 3 collection
recorded between 2005 and 2007.<o:p></o:p></p>
<p class="MsoNormal">All test speech segments are presented as a
sampled data stream in standard 8-bit 8-kHz μ-law format. Each
segment is stored separately in a single channel SPHERE format
file. The test segments contain three nominal durations of speech:
3 seconds, 10 seconds and 30 seconds. Actual speech durations
vary, but were constrained to be within the ranges of 2-4 seconds,
7-13 seconds and 23-35 seconds, respectively. <o:p></o:p></p>
<br>
<span class="MsoCommentReference"><span
style="font-size:8.0pt;line-height:115%"><span
style="mso-special-character:comment"></span></span></span><o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="gale"></a>(2) <a
href="https://catalog.ldc.upenn.edu/LDC2014T14"><span
style="color:blue">GALE Arabic-English Word Alignment Training
Part 3 -- Web</span></a> was developed by LDC and contains
217,158 tokens of word aligned Arabic and English parallel text
enriched with linguistic tags. This material was used as training
data in the DARPA GALE (Global Autonomous Language Exploitation)
program. <o:p></o:p></p>
<p class="MsoNormal">Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation.<o:p></o:p></p>
<p class="MsoNormal">Other releases available in this series are:<o:p></o:p></p>
<blockquote>
<p class="MsoNormal">GALE Chinese-English Word Alignment and
Tagging Training Part 1 -- Newswire and Web (<a
href="http://catalog.ldc.upenn.edu/LDC2012T16"><span
style="color:blue">LDC2012T16</span></a>)<o:p></o:p></p>
<p class="MsoNormal">GALE Chinese-English Word Alignment and
Tagging Training Part 2 -- Newswire (<a
href="http://catalog.ldc.upenn.edu/LDC2012T20"><span
style="color:blue">LDC2012T20</span></a>)<o:p></o:p></p>
<p class="MsoNormal">GALE Chinese-English Word Alignment and
Tagging Training Part 3 -- Web (<a
href="http://catalog.ldc.upenn.edu/LDC2012T24"><span
style="color:blue">LDC2012T24</span></a>)<o:p></o:p></p>
<p class="MsoNormal">GALE Chinese-English Word Alignment and
Tagging Training Part 4 -- Web (<a
href="http://catalog.ldc.upenn.edu/LDC2013T05"><span
style="color:blue">LDC2013T05</span></a>)<o:p></o:p></p>
<p class="MsoNormal">GALE Chinese-English Word Alignment and
Tagging -- Broadcast Training Part 1 (<a
href="http://catalog.ldc.upenn.edu/LDC2013T23"><span
style="color:blue">LDC2013T23</span></a>)<o:p></o:p></p>
<p class="MsoNormal">GALE Arabic-English Word Alignment Training
Part 1 -- Newswire and Web (<a
href="http://catalog.ldc.upenn.edu/LDC2014T05"><span
style="color:blue">LDC2014T05</span></a>)<o:p></o:p></p>
<p class="MsoNormal">GALE Arabic-English Word Alignment Training
Part 2 -- Newswire (<a
href="http://catalog.ldc.upenn.edu/LDC2014T10"><span
style="color:blue">LDC2014T10</span></a>)<o:p></o:p></p>
</blockquote>
<p class="MsoNormal">This release consists of Arabic source web data
collected by LDC. The distribution by genre, words, character
tokens and segments appears below:<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="1" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Language<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Genre<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Files<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Words<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">CharTokens<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Segments<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">WB<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">2,449<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">154,144<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">217,158<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">7,332<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">Note that word count is based on the
untokenized Arabic source, and token count is based on the
tokenized Arabic source.<o:p></o:p></p>
<p class="MsoNormal">The Arabic word alignment tasks consisted of
the following components:<o:p></o:p></p>
<blockquote>
<p class="MsoNormal">Normalizing tokenized tokens as needed<o:p></o:p></p>
<p class="MsoNormal">Identifying different types of links<o:p></o:p></p>
<p class="MsoNormal">Identifying sentence segments not suitable
for annotation<o:p></o:p></p>
<p class="MsoNormal">Tagging unmatched words attached to other
words or phrases<o:p></o:p></p>
</blockquote>
<br>
<o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="g2"></a>(3) <a
href="https://catalog.ldc.upenn.edu/LDC2014T15"><span
style="color:blue">GALE Phase 2 Chinese Newswire Parallel Text
Part 1</span></a> was developed by LDC. Along with other
corpora, the parallel text in this release comprised training data
for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains 117,173 tokens of
Chinese source text and corresponding English translations
selected from newswire data collected by LDC in 2007 and
transcribed by LDC or under its direction.<o:p></o:p></p>
<p class="MsoNormal">This release includes 167 source-translation
document pairs, comprising 117,173 tokens of translated data. Data
is drawn from four distinct Chinese newswire sources: China News
Service, Guangming Daily, People's Daily and People's Liberation
Army Daily.<o:p></o:p></p>
<p class="MsoNormal">The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with
Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines. Bilingual LDC staff
performed quality control procedures on the completed
translations.<o:p></o:p></p>
<p class="MsoNormal">Source data and translations are distributed in
TDF format. TDF files are tab-delimited files containing one
segment of text along with meta information about that segment.
Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.<o:p></o:p></p>
<br>
<br>
<hr size="2" width="100%"><br>
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>