<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    <div class="moz-text-html" lang="x-western">

      <div align="center"><i>New publications:</i></div>

      <p class="MsoNormal" style="text-align: center;" align="center">LDC2012T02<br>

        <b><a href="#trans">-  </a><a href="#tb">English Translation

            Treeba</a></b><b><a href="#trans">nk: An Nahar Newswire</a> 

          -</b></p>

      <p class="MsoNormal" style="text-align: center;" align="center">LDC2012S04<br>

        <b> -  <a href="#malto">Malto Speech and Transcripts</a>  -</b></p>

      <hr width="100%" size="2"><br>

      <p class="MsoNormal" style="margin-bottom: 12pt; text-align:

        center;" align="center"><b>New Publications</b></p>

      <p class="MsoNormal" style=""><a name="tb"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T02">English

          Translation Treebank: An Nahar Newswire</a> was developed by

        LDC and consists of 599 distinct newswire stories from the

        Lebanese publication An Nahar translated from Arabic to English

        and annotated for part-of-speech and syntactic structure. </p>

      <p class="MsoNormal" style="">This corpus is part of an ongoing

        effort at LDC to produce parallel Arabic and English treebanks.

        The guidelines followed for both part-of-speech and syntactic

        annotation are Penn Treebank II style, with changes in the

        tokenization of hyphenated words, part-of-speech and tree

        changes necessitated by those tokenization changes and revisions

        to the syntactic annotation to comply with the updated

        annotation guidelines (including the "Treebank-PropBank merge"

        or "Treebank IIa" and "treebank c" changes). The original Penn

        Treebank II guidelines, addenda describing changes to the

        guidelines and the tokenization specifications can be found on

        LDC's <a

href="http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/">website</a>.</p>

      <p class="MsoNormal" style="">The data consists of 461,489 tokens

        in 599 individual files. The news stories in this release were

        published in An Nahar in 2002.</p>

      <p class="MsoNormal" style="">The English sources files

        (translated from the Arabic) were automatically tokenized,

        part-of-speech tagged and parsed; the tokens, tags and parses

        were manually corrected. The quality control process consisted

        of a series of specific searches for over 100 types of potential

        inconsistency and parse or annotation error. Any errors found in

        those searches were manually corrected. </p>

      <p class="MsoNormal" style="">Annotations are in the following two

        formats:</p>

      <ul type="disc">

        <li class="MsoNormal" style="line-height: normal;">Penn Style

          Trees </li>

        <ul type="circle">

          <li class="MsoNormal" style="line-height: normal;">Bracketed

            tree files following the basic form (NODE (TAG token)). Each

            sentence is surrounded by a pair of empty parentheses.</li>

        </ul>

        <li class="MsoNormal" style="line-height: normal;">AG xml </li>

        <ul type="circle">

          <li class="MsoNormal" style="line-height: normal;">TreeEditor

            .xml stand-off annotation files. These files contain the POS

            and Treebank annotation and reference the source files by

            character offset. DTD files for the AG xml files were moved

            from their original location indicated in the readme to be

            more consistent with LDC publications.</li>

        </ul>

      </ul>

      <div align="center">*<br>

      </div>

      <p class="MsoNormal" style=""><a name="malto"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012S04">Malto

          Speech and Transcripts</a> was developed by Masato Kobayashi,

        Associate Professor in Linguistics at the University of Tokyo

        (Japan), and Bablu Tirkey, research scholar at the Tribal and

        Regional Languages Department, Ranchi University (India). It

        contains approximately 8 hours of Malto speech data collected

        between 2005 and 2009 from 27 speakers (22 males, 5 females).

        Also included are accompanying transcripts, English translations

        and glosses for 6 hours of the collection. Speakers were asked

        to talk about themselves, their lives, rituals and folklore;

        elicitation interviews were then conducted. The goal of the work

        was to present the current state and dialectal variation of

        Malto.</p>

      <p class="MsoNormal" style="">Malto is a Dravidian language spoken

        in northeastern India (principally the states of Bihar,

        Jharkhand and West Bengal) and Bangladesh by people called the

        Pahariyas. Indian census data places the number of Malto

        speakers in a range of between 100,000-200,000 total speakers.

        Most Malto speakers live in the three northeastern districts of

        Jharkhand, i.e, Sahebganj, Godda and Pakur; the fieldwork that

        resulted in this corpus was conducted in those districts. Of the

        Pahariyas in that area, three subtribes, the Sawriya Pahariyas,

        the Mal Pahariyas and the Kumarbhag Pahariyas, primarily speak

        Malto. </p>

      <p class="MsoNormal" style="">The transcribed data accounts for 6

        hours of the collection and contains 21 speakers (17 male, 4

        female). The untranscribed data accounts for 2 hours of the

        collection and contains 10 speakers (9 male, 1 female). Four of

        the male speakers are present in both groups.</p>

      <p class="MsoNormal" style="">All audio is presented in .wav

        format. Each audio file name includes a subject number, village

        name, speaker name and the topic discussed. The transcripts and

        glossary are UTF-8 text files. Because of ambiguities that occur

        when writing Malto in Devenagari script, the transcripts were

        developed using Roman script with symbols adapted from the

        International Phonetic Alphabet (IPA) but are not considered

         phonetic transcripts.</p>

      The first 100 copies distributed to non-LDC member organizations

      are available free of charge.   Shipping and handling fees apply.<br>

      <hr width="100%" size="2"><br>

      <pre class="moz-signature" cols="72">--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

    </div>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>