<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal"><b><b><a href="#scholar">Fall 2014 LDC Data

            Scholarship program- September 15 deadline approaching</a></b></b></p>

    <p class="MsoNormal"><i>New publications:</i><b><br>

      </b></p>

    <p class="MsoNormal"><b><a href="#speech%22">GALE Phase 2 Arabic

          Broadcast News Speech Part 1</a></b><b><br>

      </b></p>

    <p class="MsoNormal"><b><a href="#trans">GALE Phase 2 Arabic

          Broadcast News Transcripts Part 1</a></b><b><br>

      </b></p>

    <p class="MsoNormal"><b><a href="#tac">TAC KBP Reference Knowledge

          Base</a></b></p>

    <hr size="2" width="100%">

    <hr size="2" width="100%">

    <p class="MsoNormal"><a name="scholar"></a><b>Fall 2014 LDC Data

        Scholarship program- September 15 deadline approaching</b><o:p></o:p></p>

    <p class="MsoNormal">Student applications for the Fall 2014 LDC Data

      Scholarship program are being accepted now through Monday,

      September 15, 2014, 11:59PM EST.  The LDC Data Scholarship program

      provides university students with access to LDC data at no cost. 

      This program is open to students pursuing both undergraduate and

      graduate studies in an accredited college or university. LDC Data

      Scholarships are not restricted to any particular field of study;

      however, students must demonstrate a well-developed research

      agenda and a bona fide inability to pay.  <br>

      <br>

      Students will need to complete an application which consists of a

      data use proposal and letter of support from their adviser.  For

      further information on application materials and program rules,

      please visit the <a

href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships">LDC

Data

        Scholarship</a> page.  <o:p></o:p></p>

    <p class="MsoNormal">Applicants can email their materials to the <a

        href="mailto:datascholarships@ldc.upenn.edu">LDC Data

        Scholarship program</a>. Decisions will be sent by email from

      the same address.<br>

    </p>

    <p class="MsoNormal"><br>

      <o:p></o:p><o:p></o:p><b><br>

      </b><b> New publications</b><o:p></o:p></p>

    <p class="MsoNormal"><a name="speech"></a>(1) <a

        href="https://catalog.ldc.upenn.edu/LDC2014S07">GALE Phase 2

        Arabic Broadcast News Speech Part 1</a> was developed by LDC and

      is comprised of approximately 165 hours of Arabic broadcast news

      speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia

      and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global

      Autonomous Language Exploitation) Program. Corresponding

      transcripts are released as GALE Phase 2 Arabic Broadcast News

      Transcripts Part 1 (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014T17">LDC2014T17</a>).<o:p></o:p></p>

    <p class="MsoNormal">Broadcast audio for the GALE program was

      collected at LDC’s Philadelphia, PA USA facilities and at three

      remote collection sites: Hong Kong University of Science and

      Technology, Hong King (Chinese), Medianet (Tunis, Tunisia)

      (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local

      and outsourced broadcast collection supported GALE at a rate of

      approximately 300 hours per week of programming from more than 50

      broadcast sources for a total of over 30,000 hours of collected

      broadcast audio over the life of the program.<o:p></o:p></p>

    <p class="MsoNormal">The broadcast recordings in this release

      feature news programs focusing principally on current events from

      the following sources: Abu Dhabi TV, a televisions station based

      in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in

      Iran; Alhurra, a U.S. government-funded regional broadcaster;

      Aljazeera, a regional broadcaster located in Doha, Qatar; Dubai

      TV, a broadcast station in the United Arab Emirates; Al Iraqiyah,

      an Iraqi television station; Kuwait TV, a national broadcast

      station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese

      television station; Nile TV, a broadcast programmer based in

      Egypt; Saudi TV, a national television station based in Saudi

      Arabia; and Syria TV, the national television station in Syria.<o:p></o:p></p>

    <p class="MsoNormal">This release contains 200 audio files presented

      in <a href="http://flac.sourceforge.net">FLAC</a>-compressed

      Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit

      PCM. Each file was audited by a native Arabic speaker following

      Audit Procedure Specification Version 2.0 which is included in

      this release. The broadcast auditing process served three

      principal goals: as a check on the operation of the broadcast

      collection system equipment by identifying failed, incomplete or

      faulty recordings; as an indicator of broadcast schedule changes

      by identifying instances when the incorrect program was recorded;

      and as a guide for data selection by retaining information about a

      program’s genre, data type and topic.<o:p></o:p></p>

    <br>

    <span style="mso-special-character:comment"></span><o:p></o:p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="trans"></a>(2) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T17">GALE Phase 2

        Arabic Broadcast News Transcripts Part 1</a> was developed by

      LDC and contains transcriptions of approximately 165 hours of

      Arabic broadcast news speech collected in 2006 and 2007 by LDC,

      MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of

      the DARPA GALE (Global Autonomous Language Exploitation) program.

      Corresponding audio data is released as GALE Phase 2 Arabic

      Broadcast News Speech Part 1 (<a

href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014S07">LDC2014S07</a>).<o:p></o:p></p>

    <p class="MsoNormal">The transcript files are in plain-text,

      tab-delimited format (TDF) with UTF-8 encoding, and the

      transcribed data totals 897,868 tokens. The transcripts were

      created with the LDC-developed transcription tool, <a

        href="https://www.ldc.upenn.edu/language-resources/tools/xtrans">XTrans</a>,

      a multi-platform, multilingual, multi-channel transcription tool

      that supports manual transcription and annotation of audio

      recordings. <o:p></o:p></p>

    <p class="MsoNormal">The files in this corpus were transcribed by

      LDC staff and/or by transcription vendors under contract to LDC.

      Transcribers followed LDC's quick transcription guidelines (QTR)

      and quick rich transcription specification (QRTR) both of which

      are included in the documentation with this release. QTR

      transcription consists of quick (near-)verbatim, time-aligned

      transcripts plus speaker identification with minimal additional

      mark-up. It does not include sentence unit annotation. QRTR

      annotation adds structural information such as topic boundaries

      and manual sentence unit annotation to the core components of a

      quick transcript. Files with QTR as part of the filename were

      developed using QTR transcription. Files with QRTR in the filename

      indicate QRTR transcription.<o:p></o:p></p>

    <br>

    <p class="MsoNormal"><br>

      <o:p></o:p></p>

    <p class="MsoNormal" align="center">*<o:p></o:p></p>

    <p class="MsoNormal"><a name="tac"></a>(3) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T16">TAC KBP

        Reference Knowledge Base</a> was developed by LDC in support of

      the NIST-sponsored TAC-KBP evaluation series. It is a knowledge

      base built from English Wikipedia articles and their associated

      infoboxes and covers over 800,000 entities.<o:p></o:p></p>

    <p class="MsoNormal"><a href="http://www.nist.gov/tac/">TAC</a>

      (Text Analysis Conference) is a series of workshops organized by <a

        href="http://www.nist.gov/">NIST</a> (the National Institute of

      Standards and Technology) to encourage research in natural

      language processing and related applications by providing a large

      test collection, common evaluation procedures, and a forum for

      researchers to share their results. TAC's KBP track (Knowledge

      Base Population) encourages the development of systems that can

      match entities mentioned in natural texts with those appearing in

      a knowledge base and extract novel information about entities from

      a document collection and add it to a new or existing knowledge

      base.<o:p></o:p></p>

    <p class="MsoNormal">Consult the LDC <a

        href="https://www.ldc.upenn.edu/collaborations/current-projects/tac-kbp">TAC-KBP</a>

      project page for further information about LDC's resource

      development for the TAC-KBP program.<o:p></o:p></p>

    <p class="MsoNormal">The source data, Wikipedia infoboxes and

      articles, was taken from an October 2008 snapshot of Wikipedia.<o:p></o:p></p>

    <p class="MsoNormal">TAC KBP Reference Knowledge Base contains a set

      of entities, each with a canonical name and title for the

      Wikipedia page, an entity type, an automatically parsed version of

      the data from the infobox in the entity's Wikipedia article, and a

      stripped version of the text of the Wiki article. Each entity is

      assigned one of four types: PER (person), ORG (organization), GPE

      (geo-political entity) and UKN (unknown). All data files are

      presented as UTF-8 encoded XML.<o:p></o:p></p>

    <br>

    <hr size="2" width="100%">

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>