[Corpora-List] New Releases from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Mar 23 16:13:57 UTC 2007
The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.
LDC2007S02
*Fisher Levantine Arabic Conversational Telephone Speech*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S02>
LDC2007T04
*Fisher Levantine Arabic Conversational Telephone Speech, Transcripts*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T04>
LDC2007V01
*TRECVID 2005 Keyframes & Transcripts*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V01>
------------------------------------------------------------------------
(1) Fisher Levantine Arabic Conversational Telephone Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S02>
contains 279 conversations totaling 45 hours of speech. Levantine
Arabic is spoken along the western Mediterranean coast from Anatolia to
the Sinai Peninsula and encompasses the local dialects of Lebanon, Syria
and Palestine. There are two distinct varieties: Northern, centered
around Syria and Lebanon; and Southern, spoken in Jordan and Palestine.
The majority of speakers in Fisher Levantine Arabic Conversational
Telephone Speech are from Jordan, Lebanon, and Palestine.
The conversations in this corpus are a subset of the conversations in
Levantine Arabic QT Training Data Set 5, Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S29>,
LDC2006S29. The individual audio files are in NIST SPHERE format. The
corresponding transcripts may be found in Fisher Levantine Arabic
Conversational Telephone Speech, Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T04>,
LDC2007T04.
*
(2) Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T04>
contains the transcripts for the 279 telephone conversations in Fisher
Levantine Arabic Conversational Telephone Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S02>,
LDC2007S02. The transcripts were created with "green" and "yellow"
layers using LDC's Multi-Dialectal Transcription Tool (AMADAT). The
green layer seeks to anchor dialectal forms to similar or related Modern
Standard Arabic orothgraphy-based forms. The yellow layer is a more
careful and detailed transcription that adds functionally necessary
vowels and marks important sociolinguistic variations and morphophonemic
features.
The green layer transcripts in this corpus are a subset of the
transcripts contained in Levantine Arabic QT Training Data Set 5,
Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T07>,
LDC2006T07. The yellow layer transcription was added in this release.
*
(3) TREC Video Retrieval Evaluation (TRECVID) is sponsored by the
National Institute of Standards and Technology (NIST) to promote
progress in content-based retrieval from digital video via open,
metrics-based evaluation. The keyframes in TRECVID 2005 Keyframes &
Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007V01>
were extracted for use in the NIST TRECVID 2005 Evaluation. The source
data used were Arabic, Chinese and English language broadcast
programming collected in November 2004.
TRECVID is a laboratory-style evaluation that attempts to model real
world situations or significant component tasks involved in such
situations. In 2005 there were four main tasks with associated tests:
* shot boundary determination
* low-level feature extraction
* high-level feature extraction
* search (interactive, manual, and automatic)
Shots are fundamental units of video, useful for higher-level
processing. To create the master list of shots, the video was segmented.
The results of this pass are called subshots. Because the master shot
reference is designed for use in manual assessment, a second pass over
the segmentation was made to create the master shots of at least 2
seconds in length. These master shots are the ones used in submitting
results for the feature and search tasks in the evaluation. In the
second pass, starting at the beginning of each file, the subshots were
aggregated, if necessary, until the current shot was at least 2 seconds
in duration, at which point the aggregation began anew with the next
subshot.
The keyframes were selected by going to the middle frame of the shot
boundary, then parsing left and right of that frame to locate the
nearest I-Frame. This then became the keyframe and was extracted.
Keyframes have been provided at both the subshot (NRKF) and master shot
(RKF) levels.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070323/b5f458c0/attachment.htm>
More information about the Corpora
mailing list