[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Jul 22 13:37:46 UTC 2011


- *LDC Sponsors a Student Group at 2011 International Linguistics 
Olympiad <#olympiad>*  -

- *LDC Receives META Prize from META-NET <#meta>*  -

/New publications:/

*- 2005 NIST Speaker Recognition Evaluation Test Data <#sre>  -*

*- 2006 NIST Spoken Term Detection Evaluation Set <#std>  -*

*- **NIST/USF Evaluation Resources for the VACE Program - Meeting Data 
Test Set Part 2 <#vace>**  -*

------------------------------------------------------------------------


*LDC Sponsors a Student Group at 2011 International Linguistics Olympiad*

LDC is happy to support the 2011 International Linguistics Olympiad by 
sponsoring a student team. The IOL is one of the twelve International 
Science Olympiads <http://olympiads.win.tue.nl/> and is an annual event 
that brings together students from around the world to compete in 
linguistically--based challenges. This year's competition takes place 
from July 24-30 at Carnegie Mellon University, Pittsburgh, PA  USA. 
Students do not need to have a background in linguistics in order to 
participate since they typically use analysis and deductive reasoning to 
solve the competition problems.

Please visit the 2011 IOL website <http://www.ioling.org/2011/> for 
additional details. We wish good luck to all of the participants!

*LDC Receives META Prize from META-NET*

  LDC was awarded a '2^nd META Prize' from META-NET 'for outstanding 
long term commitment to the preparation and distribution of language 
resources and technologies.'

  The META Prize is awarded by META-NET to those who provide outstanding 
products or services that support the European Multilingual Information 
Society. META-NET <http://www.meta-net.eu/mission> is a Network of 
Excellence dedicated to fostering the technological foundations of a 
multilingual European information society. Several organizations were 
honored at this year's META Forum in Budapest; LDC and ELRA 
<http://www.elra.info/> were both honored for supporting and developing 
language resources.

*New Publications *

(1)2005 NIST Speaker Recognition Evaluation Test Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S04> 
was developed at LDC and NIST (National Institute of Standards and 
Technology). It consists of 525 hours of conversational telephone speech 
in English, Arabic, Mandarin Chinese, Russian and Spanish and associated 
English transcripts used as test data in the NIST-sponsored 2005 Speaker 
Recognition Evaluation 
<http://www.itl.nist.gov/iad/mig/tests/spk/2005/index.html> (SRE). The 
ongoing series of SRE yearly evaluations conducted by NIST are intended 
to be of interest to researchers working on the general problem of text 
independent speaker recognition. To that end the evaluations are 
designed to be simple, to focus on core technology issues, to be fully 
supported and accessible.

The task of the 2005 SRE evaluation was speaker detection, that is, to 
determine whether a specified speaker is speaking during a given segment 
of conversational speech. The task was divided into 20 distinct and 
separate tests involving one of five training conditions and one of four 
test conditions. Further information about the task conditions is 
contained in the The NIST Year 2005 Speaker Recognition Evaluation Plan 
<http://www.itl.nist.gov/iad/mig/tests/sre/2005/sre-05_evalplan-v6.pdf>.

The speech data consists of conversational telephone speech with 
"multi-channel" data collected by LDC simultaneously from a number of 
auxiliary microphones. The files are organized into two segments: 10 
second two-channel excerpts (continuous segments from single 
conversations that are estimated to contain approximately 10 seconds of 
actual speech in the channel of interest) and 5 minute two-channel 
conversations.

The data are stored as 8-bit u-law speech signals in NIST SPHERE format. 
In addition to the standard header fields, the SPHERE header for each 
file contains some auxiliary information that includes the language of 
the conversation and whether the data was recorded over a telephone 
line.  English language word transcripts in .cmt format were produced 
using an automatic speech recognition system (ASR) with error rates in 
the range of 15-30%.


*

(2)2006 NIST Spoken Term Detection Evaluation Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011S03> 
was compiled by researchers at NIST (National Institute of Standards and 
Technology) and contains approximately eighteen hours of  Arabic, 
Chinese and English broadcast news, English conversational telephone 
speech and English meeting room speech used in NIST's 2006 Spoken Term 
Detection (STD) evaluation 
<http://www.itl.nist.gov/iad/mig/tests/std/2006/index.html>. The STD 
initiative is designed to facilitate research and development of 
technology for retrieving information from archives of speech data with 
the goals of exploring promising new ideas in spoken term detection, 
developing advanced technology incorporating these ideas, measuring the 
performance of this technology and establishing a community for the 
exchange of research results and technical insights.

The 2006 STD task was to find all of the occurrences of a specified 
"term" (a sequence of one or more words) in a given corpus of speech 
data. The evaluation was intended to develop technology for rapidly 
searching very large quantities of audio data. Although the evaluation 
used modest amounts of data, it was structured to simulate the very 
large data situation and to make it possible to extrapolate the speed 
measurements to much larger data sets. Therefore, systems were 
implemented in two phases: indexing and searching. In the indexing 
phase, the system processes the speech data without knowledge of the 
terms. In the searching phase, the system uses the terms, the index, and 
optionally the audio to detect term occurrences.

The evaluation corpus consists of three data genres: broadcast news 
(BNews), conversational telephone speech (CTS) and conference room 
meetings (CONFMTG). The broadcast news material was collected in 2003 
and 2004  by LDC's broadcast collection system 
<http://www.ldc.upenn.edu/DataSheets/Broadcast_Collection_System_DS.pdf>from 
the following sources: ABC (English), Aljazeera (Arabic), China Central 
TV (Chinese), CNN (English), CNBC (English), Dubaie TV (Arabic), New 
Tang Dynasty TV (Chinese), Public Radio International (English) and 
Radio Free Asia(Chinese). The CTS data was taken from the Switchboard 
data sets (e.g., Switchboard-2 Phase 1 LDC98S75 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S75>, 
Switchboard-2 Phase 2 LDC99S79 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S79>) 
and the Fisher corpora (e.g., Fisher English Training Speech Part 1 
LDC2004S13 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S13>), 
also collected by LDC. The conference room meeting material consists of 
goal-oriented, small group round table meetings and was collected in 
  2004 and 2005 by NIST, the International Computer Science Institute 
(Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO 
(The Netherlands) and Virginia Polytechnic Institute and State 
University (Blacksburg, VA) as part of the AMI corpus project 
<http://corpus.amiproject.org/>. This evaluation corpus includes scoring 
software. It uses the inputs described in the STD Evaluation plan to 
complete the evaluation of a system.

Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE 
formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, 
SPHERE formatted files. The CONFMTG files contain a single recorded channel.



*

(3)NIST/USF Evaluation Resources for the VACE Program - Meeting Data 
Test Set Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V04> 
was developed by researchers at the Department of Computer Science and 
Engineering <http://www.cse.usf.edu/>, University of South Florida 
(USF), Tampa, Florida and the Multimodal Information Group 
<http://nist.gov/itl/iad/mig/> at the National Institute of Standards 
and Technology (NIST). It contains approximately thirteen hours of 
meeting room video data collected in 2001 and 2002 at NIST's Meeting 
Data Collection Laboratory and used in the VACE (Video Analysis and 
Content Extraction) 2005 evaluation.

The VACE program was established to develop novel algorithms for 
automatic video content extraction, multi-modal fusion, and event 
understanding. During VACE Phases I and II, the program made significant 
progress in the automated detection and tracking of moving objects 
including faces, hands, people, vehicles and text in four primary video 
domains: broadcast news, meetings, street surveillance, and unmanned 
aerial vehicle motion imagery. Initial results were also obtained on 
automatic analysis of human activities and understanding of video 
sequences.

Three performance evaluations were conducted under the auspices of the 
VACE program between 2004 and 2007.  The 2005 evaluation was 
administered by USF in collaboration with NIST and guided by an advisory 
forum including the evaluation participants.

LDC has previously released NIST/USF Evaluation Resources for the VACE 
Program -- Meeting Data Training Set Part 1 LDC2011V01 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V01>, NIST/USF 
Evaluation Resources for the VACE Program -- Meeting Data Training Set 
Part 2 LDC2011V02 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V02> 
and NIST/USF Evaluation Resources for the VACE Program -- Meeting Data 
Test Set Part 1 LDC2011V03 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011V03>.

NIST's Meeting Data Collection Laboratory is designed to collect corpora 
to support research, development and evaluation in meeting recognition 
technologies. It is equipped to look and sound like a conventional 
meeting space. The data collection facility includes five Sony EV1-D30 
video cameras, four of which have stationary views of a center 
conference table (one view from each surrounding wall) with a fixed 
focus and viewing angle, and an additional "floating" camera which is 
used to focus on particular participants, whiteboard or conference table 
depending on the meeting forum. The data is captured in a NIST-internal 
file format. The video data was extracted from the NIST format and 
encoded using the MPEG-2 standard in NTSC format. Further information 
concerning the video data parameters can found in the documentation 
included with this corpus.



------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110722/55eff583/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list