<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p class="MsoNormal"><b><b><a href="#scholar">Fall 2014 LDC Data
Scholarship program- September 15 deadline approaching</a></b></b></p>
<p class="MsoNormal"><i>New publications:</i><b><br>
</b></p>
<p class="MsoNormal"><b><a href="#speech%22">GALE Phase 2 Arabic
Broadcast News Speech Part 1</a></b><b><br>
</b></p>
<p class="MsoNormal"><b><a href="#trans">GALE Phase 2 Arabic
Broadcast News Transcripts Part 1</a></b><b><br>
</b></p>
<p class="MsoNormal"><b><a href="#tac">TAC KBP Reference Knowledge
Base</a></b></p>
<hr size="2" width="100%">
<hr size="2" width="100%">
<p class="MsoNormal"><a name="scholar"></a><b>Fall 2014 LDC Data
Scholarship program- September 15 deadline approaching</b><o:p></o:p></p>
<p class="MsoNormal">Student applications for the Fall 2014 LDC Data
Scholarship program are being accepted now through Monday,
September 15, 2014, 11:59PM EST. The LDC Data Scholarship program
provides university students with access to LDC data at no cost.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research
agenda and a bona fide inability to pay. <br>
<br>
Students will need to complete an application which consists of a
data use proposal and letter of support from their adviser. For
further information on application materials and program rules,
please visit the <a
href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships">LDC
Data
Scholarship</a> page. <o:p></o:p></p>
<p class="MsoNormal">Applicants can email their materials to the <a
href="mailto:datascholarships@ldc.upenn.edu">LDC Data
Scholarship program</a>. Decisions will be sent by email from
the same address.<br>
</p>
<p class="MsoNormal"><br>
<o:p></o:p><o:p></o:p><b><br>
</b><b> New publications</b><o:p></o:p></p>
<p class="MsoNormal"><a name="speech"></a>(1) <a
href="https://catalog.ldc.upenn.edu/LDC2014S07">GALE Phase 2
Arabic Broadcast News Speech Part 1</a> was developed by LDC and
is comprised of approximately 165 hours of Arabic broadcast news
speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia
and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding
transcripts are released as GALE Phase 2 Arabic Broadcast News
Transcripts Part 1 (<a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014T17">LDC2014T17</a>).<o:p></o:p></p>
<p class="MsoNormal">Broadcast audio for the GALE program was
collected at LDC’s Philadelphia, PA USA facilities and at three
remote collection sites: Hong Kong University of Science and
Technology, Hong King (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local
and outsourced broadcast collection supported GALE at a rate of
approximately 300 hours per week of programming from more than 50
broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program.<o:p></o:p></p>
<p class="MsoNormal">The broadcast recordings in this release
feature news programs focusing principally on current events from
the following sources: Abu Dhabi TV, a televisions station based
in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in
Iran; Alhurra, a U.S. government-funded regional broadcaster;
Aljazeera, a regional broadcaster located in Doha, Qatar; Dubai
TV, a broadcast station in the United Arab Emirates; Al Iraqiyah,
an Iraqi television station; Kuwait TV, a national broadcast
station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese
television station; Nile TV, a broadcast programmer based in
Egypt; Saudi TV, a national television station based in Saudi
Arabia; and Syria TV, the national television station in Syria.<o:p></o:p></p>
<p class="MsoNormal">This release contains 200 audio files presented
in <a href="http://flac.sourceforge.net">FLAC</a>-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit
PCM. Each file was audited by a native Arabic speaker following
Audit Procedure Specification Version 2.0 which is included in
this release. The broadcast auditing process served three
principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or
faulty recordings; as an indicator of broadcast schedule changes
by identifying instances when the incorrect program was recorded;
and as a guide for data selection by retaining information about a
program’s genre, data type and topic.<o:p></o:p></p>
<br>
<span style="mso-special-character:comment"></span><o:p></o:p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="trans"></a>(2) <a
href="https://catalog.ldc.upenn.edu/LDC2014T17">GALE Phase 2
Arabic Broadcast News Transcripts Part 1</a> was developed by
LDC and contains transcriptions of approximately 165 hours of
Arabic broadcast news speech collected in 2006 and 2007 by LDC,
MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of
the DARPA GALE (Global Autonomous Language Exploitation) program.
Corresponding audio data is released as GALE Phase 2 Arabic
Broadcast News Speech Part 1 (<a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014S07">LDC2014S07</a>).<o:p></o:p></p>
<p class="MsoNormal">The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 897,868 tokens. The transcripts were
created with the LDC-developed transcription tool, <a
href="https://www.ldc.upenn.edu/language-resources/tools/xtrans">XTrans</a>,
a multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings. <o:p></o:p></p>
<p class="MsoNormal">The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC's quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release. QTR
transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript. Files with QTR as part of the filename were
developed using QTR transcription. Files with QRTR in the filename
indicate QRTR transcription.<o:p></o:p></p>
<br>
<p class="MsoNormal"><br>
<o:p></o:p></p>
<p class="MsoNormal" align="center">*<o:p></o:p></p>
<p class="MsoNormal"><a name="tac"></a>(3) <a
href="https://catalog.ldc.upenn.edu/LDC2014T16">TAC KBP
Reference Knowledge Base</a> was developed by LDC in support of
the NIST-sponsored TAC-KBP evaluation series. It is a knowledge
base built from English Wikipedia articles and their associated
infoboxes and covers over 800,000 entities.<o:p></o:p></p>
<p class="MsoNormal"><a href="http://www.nist.gov/tac/">TAC</a>
(Text Analysis Conference) is a series of workshops organized by <a
href="http://www.nist.gov/">NIST</a> (the National Institute of
Standards and Technology) to encourage research in natural
language processing and related applications by providing a large
test collection, common evaluation procedures, and a forum for
researchers to share their results. TAC's KBP track (Knowledge
Base Population) encourages the development of systems that can
match entities mentioned in natural texts with those appearing in
a knowledge base and extract novel information about entities from
a document collection and add it to a new or existing knowledge
base.<o:p></o:p></p>
<p class="MsoNormal">Consult the LDC <a
href="https://www.ldc.upenn.edu/collaborations/current-projects/tac-kbp">TAC-KBP</a>
project page for further information about LDC's resource
development for the TAC-KBP program.<o:p></o:p></p>
<p class="MsoNormal">The source data, Wikipedia infoboxes and
articles, was taken from an October 2008 snapshot of Wikipedia.<o:p></o:p></p>
<p class="MsoNormal">TAC KBP Reference Knowledge Base contains a set
of entities, each with a canonical name and title for the
Wikipedia page, an entity type, an automatically parsed version of
the data from the infobox in the entity's Wikipedia article, and a
stripped version of the text of the Wiki article. Each entity is
assigned one of four types: PER (person), ORG (organization), GPE
(geo-political entity) and UKN (unknown). All data files are
presented as UTF-8 encoded XML.<o:p></o:p></p>
<br>
<hr size="2" width="100%">
<pre class="moz-signature" cols="72">--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>