<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div align="center"><i>New publications:</i></div>
<p class="MsoNormal" align="center">LDC2012T05<b><br>
</b>- <b><a href="#depend">Chinese Dependency Treebank 1.0</a> </b>
-
<b><br>
<br>
</b>LDC2012T06 <b><br>
</b>- <b> <a href="#gale">GALE Phase 2 Arabic Broadcast
Conversation Parallel Text Part 1</a></b><b> -</b><br>
<br>
<b> </b>LDC2012S06 <b><br>
</b><a
href="imap://ldc@imap.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E12993#turk"></a><a
href="#turk">- <b>Turkish Broadcast News Speech and
Transcripts</b></a> -</p>
<div class="MsoNormal" style="text-align:center" align="center">
<hr align="center" size="2" width="100%"></div>
<div align="center"><b>New Publications</b><br
style="mso-special-character:line-break">
</div>
<p class="MsoNormal"> <br style="mso-special-character:line-break">
</p>
<p class="MsoNormal"><a name="depend"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T05">Chinese
Dependency
Treebank 1.0</a> was developed by the <a
href="http://en.hit.edu.cn/">Harbin Institute of Technology's</a>
<a href="http://ir.hit.edu.cn/english/">Research Center for Social
Computing and Information Retrieval</a> (HIT-SCIR). It contains
49,996 Chinese sentences (902,191 words) randomly selected from
People's Daily newswire stories published between 1992 and 1996
and annotated with syntactic dependency structures. Ill-formed or
short sentences were eliminated from the randomly-selected
sentences prior to annotation. The data was segmented and
annotated for part of speech (POS), syntactic structures, verb
subclasses and noun compounds. Word segmentation and POS tagging
were accomplished automatically using statistical models trained
on a larger, annotated corpus of People's Daily newswire stories.
Humans manually annotated the syntactic structures and corrected
word segmentation errors. POS tags were not corrected.</p>
<p class="MsoNormal">The data is provided in the format of CoNLL-X
and in UTF-8. </p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal" align="center">*</p>
<p class="MsoNormal"><br>
<a name="gale"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06">GA</a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06">LE
Phase
2 Arabic Broadcast Conversation Parallel Text Part 1</a> was
developed by LDC. Along with other corpora, the parallel text in
this release comprised machine translation training data for Phase
2 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text
and corresponding English translations selected from broadcast
conversation (BC) data collected by LDC between 2004 and 2007 and
transcribed by LDC or under its direction. </p>
<p class="MsoNormal">GALE Phase 2 Arabic Broadcast Conversation
Parallel Text Part 1 includes 36 source-translation document
pairs, comprising 169,109 words of Arabic source text and its
English translation. Data is drawn from thirteen distinct Arabic
programs broadcast between 2004 and 2007 from the following
sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, and
Radio Sawa. Broadcast conversation programming is generally more
interactive than traditional news broadcasts and includes talk
shows, interviews, call-in programs and roundtable discussions.
The programs in this release focus on current events topics. </p>
<p class="MsoNormal">The files in this release were transcribed by
LDC staff and/or transcription vendors under contract to LDC in
accordance with <a
href="http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V2.pdf">Quick
Rich
Transcription</a> guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's Arabic
to English translation guidelines which are included with this
release. Bilingual LDC staff performed quality control procedures
on the completed translations.</p>
<p class="MsoNormal">Source data and translations are distributed in
TDF format. All data are encoded in UTF8.</p>
<p class="MsoNormal"><br>
<br>
</p>
<p class="MsoNormal" align="center">*</p>
<p class="MsoNormal"><br>
<a name="turk"></a>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06">T</a><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06">urkish
Broadcast
News Speech and Transcripts</a> was developed by <a
href="http://www.boun.edu.tr/en-US/Content/About_BU/History.aspx">Boğaziçi
University</a>, Istanbul, Turkey and contains approximately 130
hours of Voice of America (VOA) Turkish radio broadcasts and
corresponding transcripts. This is part of a larger corpus of
Turkish broadcast news data collected and transcribed with the
goal to facilitate research in Turkish automatic speech
recognition and its applications, such as speech retrieval. </p>
<p class="MsoNormal">The VOA material was collected between December
2006 and June 2009 using a PC and TV/radio card setup. The data
collected during the period 2006-2008 was recorded from analog FM
radio; the 2009 broadcasts were recorded from digital satellite
transmissions. A quick manual segmentation and transcription
approach was followed.</p>
<p class="MsoNormal">The data was recorded at 32 kHz and re-sampled
at 16 kHz. After screening for recording quality, the files were
segmented, transcribed, and verified. The segmentation occurred in
two steps, an initial automatic segmentation followed by manual
correction and annotation which included information such as
background conditions and speaker boundaries. </p>
<p class="MsoNormal">The transcription guidelines were adapted from
the LDC HUB4 and quick transcription guidelines. An English
version of the adapted guidelines is provided with the data.
Manual segmentation and transcripts were created by native Turkish
speakers at Boğaziçi University using <a
href="http://trans.sourceforge.net/en/presentation.php">Transcriber</a>.
The transcriptions are provided in the ISO-8859-9 (Latin5)
character set.</p>
<br>
<hr size="2" width="100%">
<pre class="moz-signature" cols="72"><link rel="File-List" href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml"><link rel="themeData" href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx"><link rel="colorSchemeMapping" href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml">
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>