<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal"><i>New publications:</i><br>
<br>
<b>- <a href="#domain">Domain-Specific Hyponym Relations</a></b><b>
-<br>
</b><b> </b><b><br>
</b><b> - <a href="#gale">GALE Arabic-English Parallel Aligned
Treebank -- Web Training</a></b><b> -<br>
</b><b> </b><b><br>
</b><b> - <a href="#wsj">Multi-Channel WSJ Audio</a> -</b><b></b><o:p></o:p></p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr size="2" width="100%" align="center"> </div>
<p class="MsoNormal"><b>New publications<br>
</b></p>
<p class="MsoNormal"><a name="domain"></a>(1) <a
href="http://catalog.ldc.upenn.edu/LDC2014T07">Domain-Specific
Hyponym Relations</a> was developed by the Shaanxi Province Key
Laboratory of Satellite and Terrestrial Network Technology at <a
href="http://www.xjtu.edu.cn/en/">Xi’an Jiaotung University</a>,
Xi’an, Shaanxi, China. It provides more than 5,000 English hyponym
relations in five domains including data mining, computer
networks, data structures, Euclidean geometry and microbiology.
All hypernym and hyponym words were taken from Wikipedia article
titles. <o:p></o:p></p>
<p class="MsoNormal">A hyponym relation is a word sense relation
that is an IS-A relation. For example, dog is a hyponym of animal
and binary tree is a hyponym of tree structure. Among the
applications for domain-specific hyponym relations are taxonomy
and ontology learning, query result organization in a faceted
search and knowledge organization and automated reasoning in
knowledge-rich applications. <o:p></o:p></p>
<p class="MsoNormal">The data is presented in XML format, and each
file provides hyponym relations in one domain. Within each file,
the term, Wikipedia URL, hyponym relation and the names of the
hyponym and hypernym words are included. The distribution of terms
and relations is set forth in the table below:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="1" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Dataset<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Terms<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Hyponym Relations<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Data Mining<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">278<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">364<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:2">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Computer Network<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">336<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">399<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:3">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Data Structure<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">315<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">578<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:4">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Euclidean Geometry<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">455<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">690<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:5;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Microbiology<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">1,028<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">3,533<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<br>
<p class="MsoNormal"> <br>
This data is made available at no-cost under the <a
href="http://creativecommons.org/licenses/by-nc-sa/3.0/">Creative
Commons
Attribution-Noncommercial
Share Alike 3.0</a> license. <o:p></o:p></p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><br>
<a name="gale"></a>(2) <a
href="http://catalog.ldc.upenn.edu/LDC2014T08">GALE
Arabic-English Parallel Aligned Treebank -- Web Training</a> was
developed by LDC and contains 69,766 tokens of word aligned Arabic
and English parallel text with treebank annotations. This material
was used as training data in the DARPA GALE (Global Autonomous
Language Exploitation) program. <o:p></o:p></p>
<p class="MsoNormal">Parallel aligned treebanks are treebanks
annotated with morphological and syntactic structures aligned at
the sentence level and the sub-sentence level. Such data sets are
useful for natural language processing and related fields,
including automatic word alignment system training and evaluation,
transfer-rule extraction, word sense disambiguation, translation
lexicon extraction and cultural heritage and cross-linguistic
studies. With respect to machine translation system development,
parallel aligned treebanks may improve system performance with
enhanced syntactic parsers, better rules and knowledge about
language pairs and reduced word error rate.<o:p></o:p></p>
<p class="MsoNormal">In this release, the source Arabic data was
translated into English. Arabic and English treebank annotations
were performed independently. The parallel texts were then word
aligned. <o:p></o:p></p>
<p class="MsoNormal">LDC previously released Arabic-English Parallel
Aligned Treebanks as follows:<o:p></o:p></p>
<ul>
<li><a href="http://catalog.ldc.upenn.edu/LDC2013T10">Newswire</a></li>
<li><a href="http://catalog.ldc.upenn.edu/LDC2013T14">Broadcast
News Part 1</a></li>
<li><a href="http://catalog.ldc.upenn.edu/LDC2014T03">Broadcast
News Part 2</a><o:p></o:p></li>
</ul>
<p class="MsoNormal">This release consists of Arabic source web data
(newsgroups, weblogs) collected by LDC in 2004 and 2005. All data
is encoded as UTF-8. A count of files, words, tokens and segments
is below.<o:p></o:p></p>
<table class="MsoNormalTable" style="mso-cellspacing:1.5pt;
mso-yfti-tbllook:1184" border="1" cellpadding="0">
<tbody>
<tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Language<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Files<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Words<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Tokens<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Segments<o:p></o:p></p>
</td>
</tr>
<tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">Arabic<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">162<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">46,710<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">69,766<o:p></o:p></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal">3,178<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal">Note: Word count is based on the untokenized
Arabic source, token count is based on the ATB-tokenized Arabic
source.<o:p></o:p></p>
<p class="MsoNormal">The purpose of the GALE word alignment task was
to find correspondences between words, phrases or groups of words
in a set of parallel texts. Arabic-English word alignment
annotation consisted of the following tasks:<o:p></o:p></p>
<ul>
<li>Identifying different types of links: translated (correct or
incorrect) and not translated (correct or incorrect)</li>
<li>Identifying sentence segments not suitable for annotation,
e.g., blank segments, incorrectly-segmented segments, segments
with foreign languages</li>
<li>Tagging unmatched words attached to other words or phrases<o:p></o:p></li>
</ul>
<o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="wsj"></a>(3) <a
href="http://catalog.ldc.upenn.edu/LDC2014S03">Multi-Channel WSJ
Audio</a> was developed by the <a
href="http://www.cstr.ed.ac.uk/">Centre for Speech Technology
Research</a> at the University of Edinburgh and contains
approximately 100 hours of recorded speech from 45 British English
speakers. Participants read Wall Street Journal texts published in
1987-1989 in three recording scenarios: a single stationary
speaker, two stationary overlapping speakers and one single moving
speaker.<o:p></o:p></p>
<p class="MsoNormal">This corpus was designed to address the
challenges of speech recognition in meetings, which often occur in
rooms with non-ideal acoustic conditions and significant
background noise, and may contain large sections of overlapping
speech. Using headset microphones represents one approach, but
meeting participants may be reluctant to wear them. Microphone
arrays are another option. MCWSJ supports research in large
vocabulary tasks using microphone arrays. The news sentences read
by speakers are taken from <a
href="http://catalog.ldc.upenn.edu/LDC95S24">WSJCAM0 Cambridge
Read News</a>, a corpus originally developed for large
vocabulary continuous speech recognition experiments, which in
turn was based on <a href="http://catalog.ldc.upenn.edu/LDC93S6A">CSR-I
(WSJ0) Complete</a>, made available by LDC to support large
vocabulary continuous speech recognition initiatives. <o:p></o:p></p>
<p class="MsoNormal">Speakers reading news text from prompts were
recorded using a headset microphone, a lapel microphone and an
eight-channel microphone array. In the single speaker scenario,
participants read from six fixed positions. Fixed positions were
assigned for the entire recording in the overlapping scenario. For
the moving scenario, participants moved from one position to the
next while reading. <o:p></o:p></p>
<p class="MsoNormal">Fifteen speakers were recorded for the single
scenario, nine pairs for the overlapping scenario and nine
individuals for the moving scenario. Each read approximately 90
sentences. <o:p></o:p></p>
<o:p></o:p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr size="2" width="100%" align="center"> </div>
<p class="MsoNormal"><o:p> </o:p></p>
<div class="moz-text-html" lang="x-western">
<link rel="File-List"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml">
<link rel="Edit-Time-Data"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_editdata.mso">
<link rel="themeData"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx">
<link rel="colorSchemeMapping"
href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml">
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>