<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <i>New publications:</i><b><br>

      <br>

      -  <a href="#chinese">Chinese Discourse Treebank 0.5</a>  -<br>

      <br>

      -  <a href="#gale">GALE Arabic-English Word Alignment --

        Broadcast Training Part 2</a>  -<br>

      <br>

      -  <a href="#un">United Nations Proceedings Speech</a>  -</b>

    <hr size="2" width="100%"><b>New publications</b><o:p></o:p>

    <p class="MsoNormal"><span style="mso-spacerun:yes"> </span><a

        name="chinese"></a>(1) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T21">Chinese

        Discourse Treebank 0.5</a> was developed at Brandeis University

      as part of the <a href="http://www.cs.brandeis.edu/%7Eclp/ctb/">Chinese

        Treebank Project </a>and consists of approximately 73,000 words

      of Chinese newswire text annotated for discourse relations. It

      follows the lexically grounded approach of the Penn Discourse

      Treebank (PDTB) (<a

        href="https://catalog.ldc.upenn.edu/LDC2008T05">LDC2008T05</a>)

      with adaptations based on the linguistic and statistical

      characteristics of Chinese text. Discourse relations are lexically

      anchored by discourse connectives (e.g., because, but, therefore),

      which are viewed as predicates that take abstract objects such as

      propositions, events and states as their arguments. Along with

      PDTB-style schemes for English, Turkish, Hindi and Czech, Chinese

      Discourse Treebank provides an additional perspective on how the

      PDTB approach can be extended for cross-lingual annotation of

      discourse relations.<o:p></o:p></p>

    <p class="MsoNormal">Data was selected from the newswire material in

      Chinese Treebank 8.0 (<a

        href="https://catalog.ldc.upenn.edu/LDC2013T21">LDC2013T21</a>),

      specifically, from Xinhua News Agency stories. There are

      approximately 5,500 annotation instances. Following the PDTB

      format, each annotation instance consists of 27 vertical bar

      delimited fields. The fields specify the attributes of the

      discourse relation as a whole, as well as the attributes of its

      two arguments. Not all fields are filled in this release. Filled

      fields are indicated by a pair of angle brackets; the remaining

      fields are place holders for future releases.<o:p></o:p></p>

    <o:p></o:p>

    <p class="MsoNormal"><br>

      *<br style="mso-special-character:line-break">

      <br style="mso-special-character:line-break">

      <o:p></o:p></p>

    <p class="MsoNormal"><a name="gale"></a>(2) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T22">GALE

        Arabic-English Word Alignment -- Broadcast Training Part 2</a>

      was developed by LDC and contains 215,923 tokens of word aligned

      Arabic and English parallel text enriched with linguistic tags.

      This material was used as training data in the DARPA GALE (Global

      Autonomous Language Exploitation) program. Some approaches to

      statistical machine translation include the incorporation of

      linguistic knowledge in word aligned text as a means to improve

      automatic word alignment and machine translation quality. This is

      accomplished with two annotation schemes: alignment and tagging.

      Alignment identifies minimum translation units and translation

      relations by using minimum-match and attachment annotation

      approaches. A set of word tags and alignment link tags are

      designed in the tagging scheme to describe these translation units

      and relations. Tagging adds contextual, syntactic and

      language-specific features to the alignment annotation.<o:p></o:p></p>

    <p class="MsoNormal">This release consists of Arabic source

      broadcast news and broadcast conversation data collected by LDC

      from 2007-2009.The Arabic word alignment tasks consisted of the

      following components: <o:p></o:p></p>

    <p class="MsoNormal">Normalizing tokenized tokens as needed<o:p></o:p></p>

    <p class="MsoNormal">Identifying different types of links<o:p></o:p></p>

    <p class="MsoNormal">Identifying sentence segments not suitable for

      annotation<o:p></o:p></p>

    <p class="MsoNormal">Tagging unmatched words attached to other words

      or phrases<o:p></o:p></p>

    <o:p></o:p>

    <p class="MsoNormal">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="un"></a>(3) <a

        href="https://catalog.ldc.upenn.edu/LDC2014S08">United Nations

        Proceedi</a><a href="https://catalog.ldc.upenn.edu/LDC2014S08">ngs

        Speech</a> was developed by the <a href="http://www.un.org/">United

        Nations</a> (UN) and contains approximately 8,500 hours of

      recorded proceedings in the six official UN languages, Arabic,

      Chinese, English, French, Russian and Spanish. The data was

      recorded in 2009-2012 from sessions 64-66 of the <a

        href="http://www.un.org/en/ga/">General Assembly</a> (GA) and <a

        href="http://www.un.org/en/ga/first/">First Committee</a> (FC)

      (Disarmament and International Security), and meetings 6434-6763

      of the <a href="http://www.un.org/en/sc/">Security Council</a>.<o:p></o:p></p>

    <p class="MsoNormal">Recordings were made using a customized system

      <span class="msoIns"><ins cite="mailto:dipersio"

          datetime="2014-10-14T11:02">f</ins></span>ollowing a daily

      internal circulated instruction from the <a

        href="http://www.un.org/depts/DGACM/mms.shtml">Meetings

        Management Section</a>. Most of the subjects and information

      related to a particular meeting or session are published in a UN

      Journal which can be found in the following <a

        href="http://www.un.org/en/documents/journal.asp">here</a>. <o:p></o:p></p>

    <p class="MsoNormal">Data is presented either as mp3 or flac

      compressed wav and are 16-bit single channel files in either

      22,050 or 8,000 Hz organized by committee and session number, then

      language. The folder labeled "Floor" indicates the microphone used

      by the particular speaker. Those files may include other

      languages, for instance, if the speaker's language was not among

      the six official UN languages.<br>

    </p>

    <br>

    <div class="moz-text-html" lang="x-western"> <span

        style="font-size:12.0pt"><br>

      </span>

      <hr size="2" width="100%"><br>

    </div>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>