<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<i>New publications:</i><b><br>
<br>
- <a href="#chinese">Chinese Discourse Treebank 0.5</a> -<br>
<br>
- <a href="#gale">GALE Arabic-English Word Alignment --
Broadcast Training Part 2</a> -<br>
<br>
- <a href="#un">United Nations Proceedings Speech</a> -</b>
<hr size="2" width="100%"><b>New publications</b><o:p></o:p>
<p class="MsoNormal"><span style="mso-spacerun:yes"> </span><a
name="chinese"></a>(1) <a
href="https://catalog.ldc.upenn.edu/LDC2014T21">Chinese
Discourse Treebank 0.5</a> was developed at Brandeis University
as part of the <a href="http://www.cs.brandeis.edu/%7Eclp/ctb/">Chinese
Treebank Project </a>and consists of approximately 73,000 words
of Chinese newswire text annotated for discourse relations. It
follows the lexically grounded approach of the Penn Discourse
Treebank (PDTB) (<a
href="https://catalog.ldc.upenn.edu/LDC2008T05">LDC2008T05</a>)
with adaptations based on the linguistic and statistical
characteristics of Chinese text. Discourse relations are lexically
anchored by discourse connectives (e.g., because, but, therefore),
which are viewed as predicates that take abstract objects such as
propositions, events and states as their arguments. Along with
PDTB-style schemes for English, Turkish, Hindi and Czech, Chinese
Discourse Treebank provides an additional perspective on how the
PDTB approach can be extended for cross-lingual annotation of
discourse relations.<o:p></o:p></p>
<p class="MsoNormal">Data was selected from the newswire material in
Chinese Treebank 8.0 (<a
href="https://catalog.ldc.upenn.edu/LDC2013T21">LDC2013T21</a>),
specifically, from Xinhua News Agency stories. There are
approximately 5,500 annotation instances. Following the PDTB
format, each annotation instance consists of 27 vertical bar
delimited fields. The fields specify the attributes of the
discourse relation as a whole, as well as the attributes of its
two arguments. Not all fields are filled in this release. Filled
fields are indicated by a pair of angle brackets; the remaining
fields are place holders for future releases.<o:p></o:p></p>
<o:p></o:p>
<p class="MsoNormal"><br>
*<br style="mso-special-character:line-break">
<br style="mso-special-character:line-break">
<o:p></o:p></p>
<p class="MsoNormal"><a name="gale"></a>(2) <a
href="https://catalog.ldc.upenn.edu/LDC2014T22">GALE
Arabic-English Word Alignment -- Broadcast Training Part 2</a>
was developed by LDC and contains 215,923 tokens of word aligned
Arabic and English parallel text enriched with linguistic tags.
This material was used as training data in the DARPA GALE (Global
Autonomous Language Exploitation) program. Some approaches to
statistical machine translation include the incorporation of
linguistic knowledge in word aligned text as a means to improve
automatic word alignment and machine translation quality. This is
accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation
relations by using minimum-match and attachment annotation
approaches. A set of word tags and alignment link tags are
designed in the tagging scheme to describe these translation units
and relations. Tagging adds contextual, syntactic and
language-specific features to the alignment annotation.<o:p></o:p></p>
<p class="MsoNormal">This release consists of Arabic source
broadcast news and broadcast conversation data collected by LDC
from 2007-2009.The Arabic word alignment tasks consisted of the
following components: <o:p></o:p></p>
<p class="MsoNormal">Normalizing tokenized tokens as needed<o:p></o:p></p>
<p class="MsoNormal">Identifying different types of links<o:p></o:p></p>
<p class="MsoNormal">Identifying sentence segments not suitable for
annotation<o:p></o:p></p>
<p class="MsoNormal">Tagging unmatched words attached to other words
or phrases<o:p></o:p></p>
<o:p></o:p>
<p class="MsoNormal">*<o:p></o:p></p>
<p class="MsoNormal"><a name="un"></a>(3) <a
href="https://catalog.ldc.upenn.edu/LDC2014S08">United Nations
Proceedi</a><a href="https://catalog.ldc.upenn.edu/LDC2014S08">ngs
Speech</a> was developed by the <a href="http://www.un.org/">United
Nations</a> (UN) and contains approximately 8,500 hours of
recorded proceedings in the six official UN languages, Arabic,
Chinese, English, French, Russian and Spanish. The data was
recorded in 2009-2012 from sessions 64-66 of the <a
href="http://www.un.org/en/ga/">General Assembly</a> (GA) and <a
href="http://www.un.org/en/ga/first/">First Committee</a> (FC)
(Disarmament and International Security), and meetings 6434-6763
of the <a href="http://www.un.org/en/sc/">Security Council</a>.<o:p></o:p></p>
<p class="MsoNormal">Recordings were made using a customized system
<span class="msoIns"><ins cite="mailto:dipersio"
datetime="2014-10-14T11:02">f</ins></span>ollowing a daily
internal circulated instruction from the <a
href="http://www.un.org/depts/DGACM/mms.shtml">Meetings
Management Section</a>. Most of the subjects and information
related to a particular meeting or session are published in a UN
Journal which can be found in the following <a
href="http://www.un.org/en/documents/journal.asp">here</a>. <o:p></o:p></p>
<p class="MsoNormal">Data is presented either as mp3 or flac
compressed wav and are 16-bit single channel files in either
22,050 or 8,000 Hz organized by committee and session number, then
language. The folder labeled "Floor" indicates the microphone used
by the particular speaker. Those files may include other
languages, for instance, if the speaker's language was not among
the six official UN languages.<br>
</p>
<br>
<div class="moz-text-html" lang="x-western"> <span
style="font-size:12.0pt"><br>
</span>
<hr size="2" width="100%"><br>
</div>
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</body>
</html>