<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div align="center"><i>New publications:</i></div>

    <p class="MsoNormal" align="center">LDC2012T05<b><br>

      </b>-  <b><a href="#depend">Chinese Dependency Treebank 1.0</a> </b>

      -

      <b><br>

        <br>

      </b>LDC2012T06 <b><br>

      </b>- <b> <a href="#gale">GALE Phase 2 Arabic Broadcast

          Conversation Parallel Text Part 1</a></b><b>  -</b><br>

      <br>

      <b> </b>LDC2012S06   <b><br>

      </b><a

href="imap://ldc@imap.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E12993#turk"></a><a

        href="#turk">-  <b>Turkish Broadcast News Speech and

          Transcripts</b></a>  -</p>

    <div class="MsoNormal" style="text-align:center" align="center">

      <hr align="center" size="2" width="100%"></div>

    <div align="center"><b>New Publications</b><br

        style="mso-special-character:line-break">

    </div>

    <p class="MsoNormal"> <br style="mso-special-character:line-break">

    </p>

    <p class="MsoNormal"><a name="depend"></a>(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T05">Chinese

Dependency

        Treebank 1.0</a> was developed by the <a

        href="http://en.hit.edu.cn/">Harbin Institute of Technology's</a>

      <a href="http://ir.hit.edu.cn/english/">Research Center for Social

        Computing and Information Retrieval</a> (HIT-SCIR). It contains

      49,996 Chinese sentences (902,191 words) randomly selected from

      People's Daily newswire stories published between 1992 and 1996

      and annotated with syntactic dependency structures. Ill-formed or

      short sentences were eliminated from the randomly-selected

      sentences prior to annotation. The data was segmented and

      annotated for part of speech (POS), syntactic structures, verb

      subclasses and noun compounds. Word segmentation and POS tagging

      were accomplished automatically using statistical models trained

      on a larger, annotated corpus of People's Daily newswire stories.

      Humans manually annotated the syntactic structures and corrected

      word segmentation errors. POS tags were not corrected.</p>

    <p class="MsoNormal">The data is provided in the format of CoNLL-X

      and in UTF-8. </p>

    <p class="MsoNormal"><br>

    </p>

    <p class="MsoNormal" align="center">*</p>

    <p class="MsoNormal"><br>

      <a name="gale"></a>(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06">GA</a><a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06">LE

Phase

        2 Arabic Broadcast Conversation Parallel Text Part 1</a> was

      developed by LDC. Along with other corpora, the parallel text in

      this release comprised machine translation training data for Phase

      2 of the DARPA GALE (Global Autonomous Language Exploitation)

      Program. This corpus contains Modern Standard Arabic source text

      and corresponding English translations selected from broadcast

      conversation (BC) data collected by LDC between 2004 and 2007 and

      transcribed by LDC or under its direction. </p>

    <p class="MsoNormal">GALE Phase 2 Arabic Broadcast Conversation

      Parallel Text Part 1 includes 36 source-translation document

      pairs, comprising 169,109 words of Arabic source text and its

      English translation. Data is drawn from thirteen distinct Arabic

      programs broadcast between 2004 and 2007 from the following

      sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, and

      Radio Sawa. Broadcast conversation programming is generally more

      interactive than traditional news broadcasts and includes talk

      shows, interviews, call-in programs and roundtable discussions.

      The programs in this release focus on current events topics. </p>

    <p class="MsoNormal">The files in this release were transcribed by

      LDC staff and/or transcription vendors under contract to LDC in

      accordance with <a

href="http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V2.pdf">Quick

Rich

        Transcription</a> guidelines developed by LDC. Transcribers

      indicated sentence boundaries in addition to transcribing the

      text. Data was manually selected for translation according to

      several criteria, including linguistic features, transcription

      features and topic features. The transcribed and segmented files

      were then reformatted into a human-readable translation format and

      assigned to translation vendors. Translators followed LDC's Arabic

      to English translation guidelines which are included with this

      release. Bilingual LDC staff performed quality control procedures

      on the completed translations.</p>

    <p class="MsoNormal">Source data and translations are distributed in

      TDF format. All data are encoded in UTF8.</p>

    <p class="MsoNormal"><br>

      <br>

    </p>

    <p class="MsoNormal" align="center">*</p>

    <p class="MsoNormal"><br>

      <a name="turk"></a>(3) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06">T</a><a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06">urkish

Broadcast

        News Speech and Transcripts</a> was developed by <a

        href="http://www.boun.edu.tr/en-US/Content/About_BU/History.aspx">Boğaziçi

        University</a>, Istanbul, Turkey and contains approximately 130

      hours of Voice of America (VOA) Turkish radio broadcasts and

      corresponding transcripts. This is part of a larger corpus of

      Turkish broadcast news data collected and transcribed with the

      goal to facilitate research in Turkish automatic speech

      recognition and its applications, such as speech retrieval. </p>

    <p class="MsoNormal">The VOA material was collected between December

      2006 and June 2009 using a PC and TV/radio card setup. The data

      collected during the period 2006-2008 was recorded from analog FM

      radio; the 2009 broadcasts were recorded from digital satellite

      transmissions. A quick manual segmentation and transcription

      approach was followed.</p>

    <p class="MsoNormal">The data was recorded at 32 kHz and re-sampled

      at 16 kHz. After screening for recording quality, the files were

      segmented, transcribed, and verified. The segmentation occurred in

      two steps, an initial automatic segmentation followed by manual

      correction and annotation which included information such as

      background conditions and speaker boundaries. </p>

    <p class="MsoNormal">The transcription guidelines were adapted from

      the LDC HUB4 and quick transcription guidelines. An English

      version of the adapted guidelines is provided with the data.

      Manual segmentation and transcripts were created by native Turkish

      speakers at Boğaziçi University using <a

        href="http://trans.sourceforge.net/en/presentation.php">Transcriber</a>.

      The transcriptions are provided in the ISO-8859-9 (Latin5)

      character set.</p>

    <br>

    <hr size="2" width="100%">

    <pre class="moz-signature" cols="72"><link rel="File-List" href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_filelist.xml"><link rel="themeData" href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_themedata.thmx"><link rel="colorSchemeMapping" href="file:///C:%5CUsers%5Celefthea%5CAppData%5CLocal%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_colorschememapping.xml">

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>