[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Aug 25 20:18:24 UTC 2014


**Fall 2014 LDC Data Scholarship program- September 15 deadline 
approaching <#scholar>**

/New publications:/*
*

*GALE Phase 2 Arabic Broadcast News Speech Part 1 <#speech%22>**
*

*GALE Phase 2 Arabic Broadcast News Transcripts Part 1 <#trans>**
*

*TAC KBP Reference Knowledge Base <#tac>*

------------------------------------------------------------------------
------------------------------------------------------------------------

*Fall 2014 LDC Data Scholarship program- September 15 deadline approaching*

Student applications for the Fall 2014 LDC Data Scholarship program are 
being accepted now through Monday, September 15, 2014, 11:59PM EST.  The 
LDC Data Scholarship program provides university students with access to 
LDC data at no cost. This program is open to students pursuing both 
undergraduate and graduate studies in an accredited college or 
university. LDC Data Scholarships are not restricted to any particular 
field of study; however, students must demonstrate a well-developed 
research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data 
use proposal and letter of support from their adviser.  For further 
information on application materials and program rules, please visit the 
LDC Data Scholarship 
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page.

Applicants can email their materials to the LDC Data Scholarship program 
<mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent by email 
from the same address.


*
**New publications*

(1) GALE Phase 2 Arabic Broadcast News Speech Part 1 
<https://catalog.ldc.upenn.edu/LDC2014S07> was developed by LDC and is 
comprised of approximately 165 hours of Arabic broadcast news speech 
collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, 
Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous 
Language Exploitation) Program. Corresponding transcripts are released 
as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2014T17 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014T17>).

Broadcast audio for the GALE program was collected at LDC's 
Philadelphia, PA USA facilities and at three remote collection sites: 
Hong Kong University of Science and Technology, Hong King (Chinese), 
Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). 
The combined local and outsourced broadcast collection supported GALE at 
a rate of approximately 300 hours per week of programming from more than 
50 broadcast sources for a total of over 30,000 hours of collected 
broadcast audio over the life of the program.

The broadcast recordings in this release feature news programs focusing 
principally on current events from the following sources: Abu Dhabi TV, 
a televisions station based in Abu Dhabi, United Arab Emirates; Al Alam 
News Channel, based in Iran; Alhurra, a U.S. government-funded regional 
broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; 
Dubai TV, a broadcast station in the United Arab Emirates; Al Iraqiyah, 
an Iraqi television station; Kuwait TV, a national broadcast station in 
Kuwait; Lebanese Broadcasting Corporation, a Lebanese television 
station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a 
national television station based in Saudi Arabia; and Syria TV, the 
national television station in Syria.

This release contains 200 audio files presented in FLAC 
<http://flac.sourceforge.net>-compressed Waveform Audio File format 
(.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a 
native Arabic speaker following Audit Procedure Specification Version 
2.0 which is included in this release. The broadcast auditing process 
served three principal goals: as a check on the operation of the 
broadcast collection system equipment by identifying failed, incomplete 
or faulty recordings; as an indicator of broadcast schedule changes by 
identifying instances when the incorrect program was recorded; and as a 
guide for data selection by retaining information about a program's 
genre, data type and topic.


*

(2) GALE Phase 2 Arabic Broadcast News Transcripts Part 1 
<https://catalog.ldc.upenn.edu/LDC2014T17> was developed by LDC and 
contains transcriptions of approximately 165 hours of Arabic broadcast 
news speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia 
and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global 
Autonomous Language Exploitation) program. Corresponding audio data is 
released as GALE Phase 2 Arabic Broadcast News Speech Part 1 (LDC2014S07 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014S07>).

The transcript files are in plain-text, tab-delimited format (TDF) with 
UTF-8 encoding, and the transcribed data totals 897,868 tokens. The 
transcripts were created with the LDC-developed transcription tool, 
XTrans <https://www.ldc.upenn.edu/language-resources/tools/xtrans>, a 
multi-platform, multilingual, multi-channel transcription tool that 
supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by 
transcription vendors under contract to LDC. Transcribers followed LDC's 
quick transcription guidelines (QTR) and quick rich transcription 
specification (QRTR) both of which are included in the documentation 
with this release. QTR transcription consists of quick (near-)verbatim, 
time-aligned transcripts plus speaker identification with minimal 
additional mark-up. It does not include sentence unit annotation. QRTR 
annotation adds structural information such as topic boundaries and 
manual sentence unit annotation to the core components of a quick 
transcript. Files with QTR as part of the filename were developed using 
QTR transcription. Files with QRTR in the filename indicate QRTR 
transcription.



*

(3) TAC KBP Reference Knowledge Base 
<https://catalog.ldc.upenn.edu/LDC2014T16> was developed by LDC in 
support of the NIST-sponsored TAC-KBP evaluation series. It is a 
knowledge base built from English Wikipedia articles and their 
associated infoboxes and covers over 800,000 entities.

TAC <http://www.nist.gov/tac/> (Text Analysis Conference) is a series of 
workshops organized by NIST <http://www.nist.gov/> (the National 
Institute of Standards and Technology) to encourage research in natural 
language processing and related applications by providing a large test 
collection, common evaluation procedures, and a forum for researchers to 
share their results. TAC's KBP track (Knowledge Base Population) 
encourages the development of systems that can match entities mentioned 
in natural texts with those appearing in a knowledge base and extract 
novel information about entities from a document collection and add it 
to a new or existing knowledge base.

Consult the LDC TAC-KBP 
<https://www.ldc.upenn.edu/collaborations/current-projects/tac-kbp> 
project page for further information about LDC's resource development 
for the TAC-KBP program.

The source data, Wikipedia infoboxes and articles, was taken from an 
October 2008 snapshot of Wikipedia.

TAC KBP Reference Knowledge Base contains a set of entities, each with a 
canonical name and title for the Wikipedia page, an entity type, an 
automatically parsed version of the data from the infobox in the 
entity's Wikipedia article, and a stripped version of the text of the 
Wiki article. Each entity is assigned one of four types: PER (person), 
ORG (organization), GPE (geo-political entity) and UKN (unknown). All 
data files are presented as UTF-8 encoded XML.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140825/fc508955/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list