[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Jun 25 18:08:42 UTC 2008
LDC2008S05
*- 2005 NIST Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05> -*
* *LDC2008T09
*- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T09> -
*
*The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications.*
------------------------------------------------------------------------
*
New Publications
*
(1) GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T09>
is the second part of the three-part GALE Phase 1 Arabic Broadcast News
Parallel Text, which, along with other corpora, was used as training
data in year 1 (Phase 1) of the DARPA-funded GALE program. The corpus
contains transcripts and English translations of 10.7 hours of Arabic
broadcast news programming selected from various sources. This corpus
does not contain the audio files from which the transcripts and
translations were generated.
The Arabic broadcast news recordings were selected from four sources and
four different programs. A manual selection procedure was used to
choose data appropriate for the GALE program, namely, news and
conversation programs focusing on current events. Stories on topics such
as sports, entertainment news, and stock market reports were excluded
from the data set. Manual sentence units/segments (SU) annotation was
also performed on a subset of files following LDC's Quick Rich
Transcription specification. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU.
After transcription and SU annotation, they were reformatted into a
human-readable translation format, and the files were then assigned to
professional translators for careful translation. Translators followed
LDC's GALE Translation guidelines, which describe the makeup of the
translation team, the source, data format, the translation data format,
best practices for translating certain linguistic features (such as
names and speech disfluencies), and quality control procedures applied
to completed translations.
***
(2) The 2005 NIST Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05>
corpus contains the evaluation data, portions of the training data, the
evaluation plan, answer keys and scoring script for the 2005 NIST
(National Institute of Standards and Technology) Language Recognition
Evaluation (LRE). The goal of the LRE is to establish the baseline of
current performance capability for language recognition of
conversational telephone speech and to lay the groundwork for further
research efforts in the field. NIST conducted two previous evaluations
in 1996 <http://www.nist.gov/speech/tests/lang/1996/LRE96EvalPlan.pdf>
and 2003
<http://www.nist.gov/speech/tests/lang/2003/LRE03EvalPlan-v1.pdf>. For
the 2005 NIST LRE, the emphasis was on research directed toward a
general base of technology to be ported to various language recognition
tasks with minimum effort and the development of the ability to make
more difficult discriminations between similar languages and dialects of
the same language.
The task evaluated was the detection of a given target language or
dialect. From a test segment of speech and a target language or dialect,
the system to be evaluated determined whether the speech was from the
target language or dialect. The evaluation consisted of speech from the
following languages and dialects:
* English (American)
* English (Indian)
* Hindi
* Japanese
* Korean
* Mandarin (Mainland)
* Mandarin (Taiwan)
* Spanish (Mexican)
* Tamil
The 2005 NIST Language Recognition Evaluation Plan, which includes a
description of the evaluation tasks, is included with this release.
Further information regarding this evaluation is also available at the
NIST Language Recognition Evaluation
<http://www.nist.gov/speech/tests/lang/> website.
Each speech file is one side of a telephone conversation . There are
11,106 speech files in sphere (.sph) format for a total of 44.2 hours of
speech. The speech data was compiled from LDC's CALLFRIEND corpora and
from data collected by Oregon Health and Science University.
Each test segment was prepared using an automatic speech activity
detection algorithm to identify areas and durations of speech. Segments
were chosen to contain a specified approximate duration of actual
speech. The test segments contain three nominal durations of speech: 3
seconds, 10 seconds, and 30 seconds. Performance was evaluated
separately for test segments of each duration. Auxiliary information was
included in the SPHERE headers to document the source file, start time,
and duration of all excerpts that were used to construct the segment.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080625/8416a9fd/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list