[Corpora-List] New from the LDC

Wed Jun 25 18:08:42 UTC 2008

LDC2008S05
*-  2005 NIST Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05>  -*

* *LDC2008T09
*-  GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T09>  -
*

*The Linguistic Data Consortium (LDC) would like to announce the 
availability of two new publications.*

------------------------------------------------------------------------

*
New Publications

*

(1) GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T09> 
is the second part of the three-part GALE Phase 1 Arabic Broadcast News 
Parallel Text, which, along with other corpora, was used as training 
data in year 1 (Phase 1) of the DARPA-funded GALE program. The corpus 
contains transcripts and English translations of 10.7 hours of Arabic 
broadcast news programming selected from various sources. This corpus 
does not contain the audio files from which the transcripts and 
translations were generated.

The Arabic broadcast news recordings were selected from four sources and 
four different programs.   A manual selection procedure was used to 
choose data appropriate for the GALE program, namely, news and 
conversation programs focusing on current events. Stories on topics such 
as sports, entertainment news, and stock market reports were excluded 
from the data set.  Manual sentence units/segments (SU) annotation was 
also performed on a subset of files following LDC's Quick Rich 
Transcription specification. Three types of end of sentence SU were 
identified: statement SU, question SU, and incomplete SU.

After transcription and SU annotation, they were reformatted into a 
human-readable translation format, and the files were then assigned to 
professional translators for careful translation. Translators followed 
LDC's GALE Translation guidelines, which describe the makeup of the 
translation team, the source, data format, the translation data format, 
best practices for translating certain linguistic features (such as 
names and speech disfluencies), and quality control procedures applied 
to completed translations. 

***

(2) The 2005 NIST Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05> 
corpus contains the evaluation data, portions of the training data, the 
evaluation plan, answer keys and scoring script for the 2005 NIST 
(National Institute of Standards and Technology) Language Recognition 
Evaluation (LRE). The goal of the LRE is to establish the baseline of 
current performance capability for language recognition of 
conversational telephone speech and to lay the groundwork for further 
research efforts in the field. NIST conducted two previous evaluations 
in 1996 <http://www.nist.gov/speech/tests/lang/1996/LRE96EvalPlan.pdf> 
and 2003 
<http://www.nist.gov/speech/tests/lang/2003/LRE03EvalPlan-v1.pdf>. For 
the 2005 NIST LRE, the emphasis was on research directed toward a 
general base of technology to be ported to various language recognition 
tasks with minimum effort and the development of the ability to make 
more difficult discriminations between similar languages and dialects of 
the same language.

The task evaluated was the detection of a given target language or 
dialect. From a test segment of speech and a target language or dialect, 
the system to be evaluated determined whether the speech was from the 
target language or dialect. The evaluation consisted of speech from the 
following languages and dialects:

    * English (American)
    * English (Indian)
    * Hindi
    * Japanese
    * Korean
    * Mandarin (Mainland)
    * Mandarin (Taiwan)
    * Spanish (Mexican)
    * Tamil

The 2005 NIST Language Recognition Evaluation Plan, which includes a 
description of the evaluation tasks, is included with this release. 
Further information regarding this evaluation is also available at the 
NIST Language Recognition Evaluation 
<http://www.nist.gov/speech/tests/lang/> website.

Each speech file is one side of a telephone conversation . There are 
11,106 speech files in sphere (.sph) format for a total of 44.2 hours of 
speech. The speech data was compiled from LDC's CALLFRIEND corpora and 
from data collected by Oregon Health and Science University.

Each test segment was prepared using an automatic speech activity 
detection algorithm to identify areas and durations of speech. Segments 
were chosen to contain a specified approximate duration of actual 
speech. The test segments contain three nominal durations of speech: 3 
seconds, 10 seconds, and 30 seconds. Performance was evaluated 
separately for test segments of each duration. Auxiliary information was 
included in the SPHERE headers to document the source file, start time, 
and duration of all excerpts that were used to construct the segment.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080625/8416a9fd/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora