[Corpora-List] New Releases from the LDC

Fri Mar 23 16:13:57 UTC 2007

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.

LDC2007S02
*Fisher Levantine Arabic Conversational Telephone Speech* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S02>

LDC2007T04
*Fisher Levantine Arabic Conversational Telephone Speech, Transcripts* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T04>

LDC2007V01
*TRECVID 2005 Keyframes & Transcripts* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V01>

------------------------------------------------------------------------

(1) Fisher Levantine Arabic Conversational Telephone Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S02> 
contains 279 conversations totaling 45 hours of speech.  Levantine 
Arabic is spoken along the western Mediterranean coast from Anatolia to 
the Sinai Peninsula and encompasses the local dialects of Lebanon, Syria 
and Palestine. There are two distinct varieties: Northern, centered 
around Syria and Lebanon; and Southern, spoken in Jordan and Palestine.  
The majority of speakers in Fisher Levantine Arabic Conversational 
Telephone Speech are from Jordan, Lebanon, and Palestine.

The conversations in this corpus are a subset of the conversations in 
Levantine Arabic QT Training Data Set 5, Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S29>, 
LDC2006S29. The individual audio files are in NIST SPHERE format. The 
corresponding transcripts may be found in Fisher Levantine Arabic 
Conversational Telephone Speech, Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T04>, 
LDC2007T04. 
*

(2) Fisher Levantine Arabic Conversational Telephone Speech, Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T04> 
contains the transcripts for the 279 telephone conversations in  Fisher 
Levantine Arabic Conversational Telephone Speech 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S02>, 
LDC2007S02.  The transcripts were created with "green" and "yellow" 
layers using LDC's Multi-Dialectal Transcription Tool (AMADAT). The 
green layer seeks to anchor dialectal forms to similar or related Modern 
Standard Arabic orothgraphy-based forms. The yellow layer is a more 
careful and detailed transcription that adds functionally necessary 
vowels and marks important sociolinguistic variations and morphophonemic 
features.

The green layer transcripts in this corpus are a subset of the 
transcripts contained in Levantine Arabic QT Training Data Set 5, 
Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T07>, 
LDC2006T07. The yellow layer transcription was added in this release. 

*

(3) TREC Video Retrieval Evaluation (TRECVID) is sponsored by the 
National Institute of Standards and Technology (NIST) to promote 
progress in content-based retrieval from digital video via open, 
metrics-based evaluation. The keyframes in TRECVID 2005 Keyframes & 
Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V01> 
were extracted for use in the NIST TRECVID 2005 Evaluation.   The source 
data used were Arabic, Chinese and English language broadcast 
programming collected in November 2004.

TRECVID is a laboratory-style evaluation that attempts to model real 
world situations or significant component tasks involved in such 
situations. In 2005 there were four main tasks with associated tests:

    * shot boundary determination

    * low-level feature extraction

    * high-level feature extraction

    * search (interactive, manual, and automatic)

Shots are fundamental units of video, useful for higher-level 
processing. To create the master list of shots, the video was segmented. 
The results of this pass are called subshots. Because the master shot 
reference is designed for use in manual assessment, a second pass over 
the segmentation was made to create the master shots of at least 2 
seconds in length. These master shots are the ones used in submitting 
results for the feature and search tasks in the evaluation. In the 
second pass, starting at the beginning of each file, the subshots were 
aggregated, if necessary, until the current shot was at least 2 seconds 
in duration, at which point the aggregation began anew with the next 
subshot.

The keyframes were selected by going to the middle frame of the shot 
boundary, then parsing left and right of that frame to locate the 
nearest I-Frame. This then became the keyframe and was extracted. 
Keyframes have been provided at both the subshot (NRKF) and master shot 
(RKF) levels.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070323/b5f458c0/attachment.htm>