[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Jul 22 13:37:46 UTC 2011
- *LDC Sponsors a Student Group at 2011 International Linguistics
Olympiad <#olympiad>* -
- *LDC Receives META Prize from META-NET <#meta>* -
/New publications:/
*- 2005 NIST Speaker Recognition Evaluation Test Data <#sre> -*
*- 2006 NIST Spoken Term Detection Evaluation Set <#std> -*
*- **NIST/USF Evaluation Resources for the VACE Program - Meeting Data
Test Set Part 2 <#vace>** -*
------------------------------------------------------------------------
*LDC Sponsors a Student Group at 2011 International Linguistics Olympiad*
LDC is happy to support the 2011 International Linguistics Olympiad by
sponsoring a student team. The IOL is one of the twelve International
Science Olympiads <http://olympiads.win.tue.nl/> and is an annual event
that brings together students from around the world to compete in
linguistically--based challenges. This year's competition takes place
from July 24-30 at Carnegie Mellon University, Pittsburgh, PA USA.
Students do not need to have a background in linguistics in order to
participate since they typically use analysis and deductive reasoning to
solve the competition problems.
Please visit the 2011 IOL website <http://www.ioling.org/2011/> for
additional details. We wish good luck to all of the participants!
*LDC Receives META Prize from META-NET*
LDC was awarded a '2^nd META Prize' from META-NET 'for outstanding
long term commitment to the preparation and distribution of language
resources and technologies.'
The META Prize is awarded by META-NET to those who provide outstanding
products or services that support the European Multilingual Information
Society. META-NET <http://www.meta-net.eu/mission> is a Network of
Excellence dedicated to fostering the technological foundations of a
multilingual European information society. Several organizations were
honored at this year's META Forum in Budapest; LDC and ELRA
<http://www.elra.info/> were both honored for supporting and developing
language resources.
*New Publications *
(1)2005 NIST Speaker Recognition Evaluation Test Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S04>
was developed at LDC and NIST (National Institute of Standards and
Technology). It consists of 525 hours of conversational telephone speech
in English, Arabic, Mandarin Chinese, Russian and Spanish and associated
English transcripts used as test data in the NIST-sponsored 2005 Speaker
Recognition Evaluation
<http://www.itl.nist.gov/iad/mig/tests/spk/2005/index.html> (SRE). The
ongoing series of SRE yearly evaluations conducted by NIST are intended
to be of interest to researchers working on the general problem of text
independent speaker recognition. To that end the evaluations are
designed to be simple, to focus on core technology issues, to be fully
supported and accessible.
The task of the 2005 SRE evaluation was speaker detection, that is, to
determine whether a specified speaker is speaking during a given segment
of conversational speech. The task was divided into 20 distinct and
separate tests involving one of five training conditions and one of four
test conditions. Further information about the task conditions is
contained in the The NIST Year 2005 Speaker Recognition Evaluation Plan
<http://www.itl.nist.gov/iad/mig/tests/sre/2005/sre-05_evalplan-v6.pdf>.
The speech data consists of conversational telephone speech with
"multi-channel" data collected by LDC simultaneously from a number of
auxiliary microphones. The files are organized into two segments: 10
second two-channel excerpts (continuous segments from single
conversations that are estimated to contain approximately 10 seconds of
actual speech in the channel of interest) and 5 minute two-channel
conversations.
The data are stored as 8-bit u-law speech signals in NIST SPHERE format.
In addition to the standard header fields, the SPHERE header for each
file contains some auxiliary information that includes the language of
the conversation and whether the data was recorded over a telephone
line. English language word transcripts in .cmt format were produced
using an automatic speech recognition system (ASR) with error rates in
the range of 15-30%.
*
(2)2006 NIST Spoken Term Detection Evaluation Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S03>
was compiled by researchers at NIST (National Institute of Standards and
Technology) and contains approximately eighteen hours of Arabic,
Chinese and English broadcast news, English conversational telephone
speech and English meeting room speech used in NIST's 2006 Spoken Term
Detection (STD) evaluation
<http://www.itl.nist.gov/iad/mig/tests/std/2006/index.html>. The STD
initiative is designed to facilitate research and development of
technology for retrieving information from archives of speech data with
the goals of exploring promising new ideas in spoken term detection,
developing advanced technology incorporating these ideas, measuring the
performance of this technology and establishing a community for the
exchange of research results and technical insights.
The 2006 STD task was to find all of the occurrences of a specified
"term" (a sequence of one or more words) in a given corpus of speech
data. The evaluation was intended to develop technology for rapidly
searching very large quantities of audio data. Although the evaluation
used modest amounts of data, it was structured to simulate the very
large data situation and to make it possible to extrapolate the speed
measurements to much larger data sets. Therefore, systems were
implemented in two phases: indexing and searching. In the indexing
phase, the system processes the speech data without knowledge of the
terms. In the searching phase, the system uses the terms, the index, and
optionally the audio to detect term occurrences.
The evaluation corpus consists of three data genres: broadcast news
(BNews), conversational telephone speech (CTS) and conference room
meetings (CONFMTG). The broadcast news material was collected in 2003
and 2004 by LDC's broadcast collection system
<http://www.ldc.upenn.edu/DataSheets/Broadcast_Collection_System_DS.pdf>from
the following sources: ABC (English), Aljazeera (Arabic), China Central
TV (Chinese), CNN (English), CNBC (English), Dubaie TV (Arabic), New
Tang Dynasty TV (Chinese), Public Radio International (English) and
Radio Free Asia(Chinese). The CTS data was taken from the Switchboard
data sets (e.g., Switchboard-2 Phase 1 LDC98S75
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S75>,
Switchboard-2 Phase 2 LDC99S79
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S79>)
and the Fisher corpora (e.g., Fisher English Training Speech Part 1
LDC2004S13
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S13>),
also collected by LDC. The conference room meeting material consists of
goal-oriented, small group round table meetings and was collected in
2004 and 2005 by NIST, the International Computer Science Institute
(Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO
(The Netherlands) and Virginia Polytechnic Institute and State
University (Blacksburg, VA) as part of the AMI corpus project
<http://corpus.amiproject.org/>. This evaluation corpus includes scoring
software. It uses the inputs described in the STD Evaluation plan to
complete the evaluation of a system.
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE
formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz,
SPHERE formatted files. The CONFMTG files contain a single recorded channel.
*
(3)NIST/USF Evaluation Resources for the VACE Program - Meeting Data
Test Set Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V04>
was developed by researchers at the Department of Computer Science and
Engineering <http://www.cse.usf.edu/>, University of South Florida
(USF), Tampa, Florida and the Multimodal Information Group
<http://nist.gov/itl/iad/mig/> at the National Institute of Standards
and Technology (NIST). It contains approximately thirteen hours of
meeting room video data collected in 2001 and 2002 at NIST's Meeting
Data Collection Laboratory and used in the VACE (Video Analysis and
Content Extraction) 2005 evaluation.
The VACE program was established to develop novel algorithms for
automatic video content extraction, multi-modal fusion, and event
understanding. During VACE Phases I and II, the program made significant
progress in the automated detection and tracking of moving objects
including faces, hands, people, vehicles and text in four primary video
domains: broadcast news, meetings, street surveillance, and unmanned
aerial vehicle motion imagery. Initial results were also obtained on
automatic analysis of human activities and understanding of video
sequences.
Three performance evaluations were conducted under the auspices of the
VACE program between 2004 and 2007. The 2005 evaluation was
administered by USF in collaboration with NIST and guided by an advisory
forum including the evaluation participants.
LDC has previously released NIST/USF Evaluation Resources for the VACE
Program -- Meeting Data Training Set Part 1 LDC2011V01
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V01>, NIST/USF
Evaluation Resources for the VACE Program -- Meeting Data Training Set
Part 2 LDC2011V02
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V02>
and NIST/USF Evaluation Resources for the VACE Program -- Meeting Data
Test Set Part 1 LDC2011V03
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V03>.
NIST's Meeting Data Collection Laboratory is designed to collect corpora
to support research, development and evaluation in meeting recognition
technologies. It is equipped to look and sound like a conventional
meeting space. The data collection facility includes five Sony EV1-D30
video cameras, four of which have stationary views of a center
conference table (one view from each surrounding wall) with a fixed
focus and viewing angle, and an additional "floating" camera which is
used to focus on particular participants, whiteboard or conference table
depending on the meeting forum. The data is captured in a NIST-internal
file format. The video data was extracted from the NIST format and
encoded using the MPEG-2 standard in NTSC format. Further information
concerning the video data parameters can found in the documentation
included with this corpus.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110722/55eff583/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list