[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Apr 23 19:12:37 UTC 2013


*
***/New publications:/

*- **GALE Phase 2 Chinese Broadcast Conversation Speech <#speech>**  -
****
**- * *GALE Phase 2 Chinese Broadcast Conversation Transcripts 
<#transcripts>**  -
****
**- * *NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test 
Sets <#openmt>* -

**

------------------------------------------------------------------------

*New publications*


(1) GALE Phase 2 Chinese Broadcast Conversation Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013S04> 
(LDC2013S04) was developed by LDC and is comprised of approximately 120 
hours of Chinese broadcast conversation speech collected in 2006 and 
2007 by LDC and Hong University of Science and Technology (HKUST), Hong 
Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language 
Exploitation) Program.

Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast 
Conversation Transcripts (LDC2013T08 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013T08>).

Broadcast audio for the GALE program was collected at the Philadelphia, 
PA USA facilities of LDC and at three remote collection sites: HKUST 
(Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC, Rabat, Morocco 
(Arabic). The combined local and outsourced broadcast collection 
supported GALE at a rate of approximately 300 hours per week of 
programming from more than 50 broadcast sources for a total of over 
30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature 
interviews, call-in programs and roundtable discussions focusing 
principally on current events from the following sources: Anhui TV, a 
regional television station in Mainland China, Anhui Province; China 
Central TV (CCTV), a national and international broadcaster in Mainland 
China; Hubei TV, a regional broadcaster in Mainland China, Hubei 
Province; and Phoenix TV, a Hong Kong-based satellite television 
station. A table showing the number of programs and hours recorded from 
each source is contained in the readme file.

This release contains 202 audio files presented in Waveform Audio File 
format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was audited 
by a native Chinese speaker following Audit Procedure Specification 
Version 2.0 which is included in this release. The broadcast auditing 
process served three principal goals: as a check on the operation of the 
broadcast collection system equipment by identifying failed, incomplete 
or faulty recordings; as an indicator of broadcast schedule changes by 
identifying instances when the incorrect program was recorded; and as a 
guide for data selection by retaining information about the genre, data 
type and topic of a program.

*


(2) GALE Phase 2 Chinese Broadcast Conversation Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T08> 
(LDC2013T08) was developed by LDC and contains transcriptions of 
approximately 120 hours of Chinese broadcast conversation speech 
collected in 2006 and 2007 by LDC and Hong University of Science and 
Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global 
Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 2 Chinese Broadcast 
Conversation Speech (LDC2013S04 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2013S04>).

The source broadcast conversation recordings feature interviews, call-in 
programs and round table discussions focusing principally on current 
events from the following sources: Anhui TV, a regional television 
station in Mainland China, Anhui Province; China Central TV (CCTV), a 
national and international broadcaster in Mainland China; Hubei TV, a 
regional broadcaster in Mainland China, Hubei Province; and Phoenix TV, 
a Hong Kong-based satellite television station.

The transcript files are in plain-text, tab-delimited format (TDF) with 
UTF-8 encoding, and the transcribed data totals 1,523,373 tokens. The 
transcripts were created with the LDC-developed transcription tool, 
XTrans <http://www.ldc.upenn.edu/tools/XTrans/downloads/>, a 
multi-platform, multilingual, multi-channel transcription tool that 
supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by 
transcription vendors under contract to LDC. Transcribers followed LDC's 
quick transcription guidelines (QTR) and quick rich transcription 
specification (QRTR) both of which are included in the documentation 
with this release. QTR transcription consists of quick (near-)verbatim, 
time-aligned transcripts plus speaker identification with minimal 
additional mark-up. It does not include sentence unit annotation. QRTR 
annotation adds structural information such as topic boundaries and 
manual sentence unit annotation to the core components of a quick 
transcript. Files with QTR as part of the filename were developed using 
QTR transcription. Files with QRTR in the filename indicate QRTR 
transcription.


*

(3) NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T07> 
(LDC2013T07) was developed by NIST Multimodal Information Group 
<http://nist.gov/itl/iad/mig/>. This release contains the evaluation 
sets (source data and human reference translations), DTD, scoring 
software, and evaluation plans for the Arabic-to-English and 
Chinese-to-English progress test sets for the NIST OpenMT 2008, 2009, 
and 2012 evaluations. The test data remained unseen between evaluations 
and was reused unchanged each time. The package was compiled, and 
scoring software was developed, at NIST, making use of Chinese and 
Arabic newswire and web data and reference translations collected and 
developed by LDC.

The objective of the OpenMT evaluation series is to support research in, 
and help advance the state of the art of, machine translation (MT) 
technologies -- technologies that translate text between human 
languages. Input may include all forms of text. The goal is for the 
output to be an adequate and fluent translation of the original.

The MT evaluation series started in 2001 as part of the DARPA TIDES 
(Translingual Information Detection, Extraction) program. Beginning with 
the 2006 evaluation, the evaluations have been driven and coordinated by 
NIST as NIST OpenMT. These evaluations provide an important contribution 
to the direction of research efforts and the calibration of technical 
capabilities in MT. The OpenMT evaluations are intended to be of 
interest to all researchers working on the general problem of automatic 
translation between human languages. To this end, they are designed to 
be simple, to focus on core technology issues and to be fully supported. 
For more general information about the NIST OpenMT evaluations, please 
refer to the NIST OpenMT website 
<http://www.nist.gov/itl/iad/mig/openmt.cfm>.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that 
may be used to produce a translation quality score for one (or more) MT 
systems. The script works by comparing the system output translation 
with a set of (expert) reference translations of the same source text. 
Comparison is based on finding sequences of words in the reference 
translations that match word sequences in the system output translation.

This release contains 2,748 documents with corresponding source and 
reference files, the latter of which contains four independent human 
reference translations of the source data. The source data is comprised 
of Arabic and Chinese newswire and web data collected by LDC in 2007. 
The table below displays statistics by source, genre, documents, 
segments and source tokens.

Source

	

Genre

	

Documents

	

Segments

	

Source Tokens

Arabic

	

Newswire

	

84

	

784

	

20039

Arabic

	

Web Data

	

51

	

594

	

14793

Chinese

	

Newswire

	

82

	

688

	

26923

Chinese

	

Web Data

	

40

	

682

	

19112




------------------------------------------------------------------------


-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130423/1e2424f8/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list