<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal"><i>New publications:</i><br>

      <br>

      <b>-  <a href="#domain">Domain-Specific Hyponym Relations</a></b><b> 

        -<br>

      </b><b> </b><b><br>

      </b><b> -  <a href="#gale">GALE Arabic-English Parallel Aligned

          Treebank -- Web Training</a></b><b>  -<br>

      </b><b> </b><b><br>

      </b><b> -  <a href="#wsj">Multi-Channel WSJ Audio</a>  -</b><b></b><o:p></o:p></p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"> </div>

    <p class="MsoNormal"><b>New publications<br>

      </b></p>

    <p class="MsoNormal"><a name="domain"></a>(1) <a

        href="http://catalog.ldc.upenn.edu/LDC2014T07">Domain-Specific

        Hyponym Relations</a> was developed by the Shaanxi Province Key

      Laboratory of Satellite and Terrestrial Network Technology at <a

        href="http://www.xjtu.edu.cn/en/">Xi’an Jiaotung University</a>,

      Xi’an, Shaanxi, China. It provides more than 5,000 English hyponym

      relations in five domains including data mining, computer

      networks, data structures, Euclidean geometry and microbiology.

      All hypernym and hyponym words were taken from Wikipedia article

      titles. <o:p></o:p></p>

    <p class="MsoNormal">A hyponym relation is a word sense relation

      that is an IS-A relation. For example, dog is a hyponym of animal

      and binary tree is a hyponym of tree structure. Among the

      applications for domain-specific hyponym relations are taxonomy

      and ontology learning, query result organization in a faceted

      search and knowledge organization and automated reasoning in

      knowledge-rich applications. <o:p></o:p></p>

    <p class="MsoNormal">The data is presented in XML format, and each

      file provides hyponym relations in one domain. Within each file,

      the term, Wikipedia URL, hyponym relation and the names of the

      hyponym and hypernym words are included. The distribution of terms

      and relations is set forth in the table below:<o:p></o:p></p>

    <p class="MsoNormal"><o:p> </o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="1" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Dataset<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Terms<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Hyponym Relations<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Data Mining<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">278<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">364<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:2">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Computer Network<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">336<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">399<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:3">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Data Structure<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">315<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">578<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:4">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Euclidean Geometry<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">455<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">690<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:5;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Microbiology<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">1,028<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">3,533<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <br>

    <p class="MsoNormal"> <br>

      This data is made available at no-cost under the <a

        href="http://creativecommons.org/licenses/by-nc-sa/3.0/">Creative

Commons

Attribution-Noncommercial

        Share Alike 3.0</a> license.  <o:p></o:p></p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><br>

      <a name="gale"></a>(2) <a

        href="http://catalog.ldc.upenn.edu/LDC2014T08">GALE

        Arabic-English Parallel Aligned Treebank -- Web Training</a> was

      developed by LDC and contains 69,766 tokens of word aligned Arabic

      and English parallel text with treebank annotations. This material

      was used as training data in the DARPA GALE (Global Autonomous

      Language Exploitation) program. <o:p></o:p></p>

    <p class="MsoNormal">Parallel aligned treebanks are treebanks

      annotated with morphological and syntactic structures aligned at

      the sentence level and the sub-sentence level. Such data sets are

      useful for natural language processing and related fields,

      including automatic word alignment system training and evaluation,

      transfer-rule extraction, word sense disambiguation, translation

      lexicon extraction and cultural heritage and cross-linguistic

      studies. With respect to machine translation system development,

      parallel aligned treebanks may improve system performance with

      enhanced syntactic parsers, better rules and knowledge about

      language pairs and reduced word error rate.<o:p></o:p></p>

    <p class="MsoNormal">In this release, the source Arabic data was

      translated into English. Arabic and English treebank annotations

      were performed independently. The parallel texts were then word

      aligned. <o:p></o:p></p>

    <p class="MsoNormal">LDC previously released Arabic-English Parallel

      Aligned Treebanks as follows:<o:p></o:p></p>

    <ul>

      <li><a href="http://catalog.ldc.upenn.edu/LDC2013T10">Newswire</a></li>

      <li><a href="http://catalog.ldc.upenn.edu/LDC2013T14">Broadcast

          News Part 1</a></li>

      <li><a href="http://catalog.ldc.upenn.edu/LDC2014T03">Broadcast

          News Part 2</a><o:p></o:p></li>

    </ul>

    <p class="MsoNormal">This release consists of Arabic source web data

      (newsgroups, weblogs) collected by LDC in 2004 and 2005. All data

      is encoded as UTF-8. A count of files, words, tokens and segments

      is below.<o:p></o:p></p>

    <table class="MsoNormalTable" style="mso-cellspacing:1.5pt;

      mso-yfti-tbllook:1184" border="1" cellpadding="0">

      <tbody>

        <tr style="mso-yfti-irow:0;mso-yfti-firstrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Language<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Files<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Words<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Tokens<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Segments<o:p></o:p></p>

          </td>

        </tr>

        <tr style="mso-yfti-irow:1;mso-yfti-lastrow:yes">

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">Arabic<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">162<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">46,710<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">69,766<o:p></o:p></p>

          </td>

          <td style="padding:.75pt .75pt .75pt .75pt">

            <p class="MsoNormal">3,178<o:p></o:p></p>

          </td>

        </tr>

      </tbody>

    </table>

    <p class="MsoNormal">Note: Word count is based on the untokenized

      Arabic source, token count is based on the ATB-tokenized Arabic

      source.<o:p></o:p></p>

    <p class="MsoNormal">The purpose of the GALE word alignment task was

      to find correspondences between words, phrases or groups of words

      in a set of parallel texts. Arabic-English word alignment

      annotation consisted of the following tasks:<o:p></o:p></p>

    <ul>

      <li>Identifying different types of links: translated (correct or

        incorrect) and not translated (correct or incorrect)</li>

      <li>Identifying sentence segments not suitable for annotation,

        e.g., blank segments, incorrectly-segmented segments, segments

        with foreign languages</li>

      <li>Tagging unmatched words attached to other words or phrases<o:p></o:p></li>

    </ul>

    <o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="wsj"></a>(3) <a

        href="http://catalog.ldc.upenn.edu/LDC2014S03">Multi-Channel WSJ

        Audio</a> was developed by the <a

        href="http://www.cstr.ed.ac.uk/">Centre for Speech Technology

        Research</a> at the University of Edinburgh and contains

      approximately 100 hours of recorded speech from 45 British English

      speakers. Participants read Wall Street Journal texts published in

      1987-1989 in three recording scenarios: a single stationary

      speaker, two stationary overlapping speakers and one single moving

      speaker.<o:p></o:p></p>

    <p class="MsoNormal">This corpus was designed to address the

      challenges of speech recognition in meetings, which often occur in

      rooms with non-ideal acoustic conditions and significant

      background noise, and may contain large sections of overlapping

      speech. Using headset microphones represents one approach, but

      meeting participants may be reluctant to wear them. Microphone

      arrays are another option. MCWSJ supports research in large

      vocabulary tasks using microphone arrays. The news sentences read

      by speakers are taken from <a

        href="http://catalog.ldc.upenn.edu/LDC95S24">WSJCAM0 Cambridge

        Read News</a>, a corpus originally developed for large

      vocabulary continuous speech recognition experiments, which in

      turn was based on <a href="http://catalog.ldc.upenn.edu/LDC93S6A">CSR-I

        (WSJ0) Complete</a>, made available by LDC to support large

      vocabulary continuous speech recognition initiatives. <o:p></o:p></p>

    <p class="MsoNormal">Speakers reading news text from prompts were

      recorded using a headset microphone, a lapel microphone and an

      eight-channel microphone array. In the single speaker scenario,

      participants read from six fixed positions. Fixed positions were

      assigned for the entire recording in the overlapping scenario. For

      the moving scenario, participants moved from one position to the

      next while reading. <o:p></o:p></p>

    <p class="MsoNormal">Fifteen speakers were recorded for the single

      scenario, nine pairs for the overlapping scenario and nine

      individuals for the moving scenario. Each read approximately 90

      sentences. <o:p></o:p></p>

    <o:p></o:p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr size="2" width="100%" align="center"> </div>

    <p class="MsoNormal"><o:p> </o:p></p>

    <div class="moz-text-html" lang="x-western">

      <link rel="File-List"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml">

      <link rel="Edit-Time-Data"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_editdata.mso">

      <link rel="themeData"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx">

      <link rel="colorSchemeMapping"

href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml">

      <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

    </div>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>