Arabic-L:LING:new from LDC

Thu Sep 28 23:08:08 UTC 2006

------------------------------------------------------------------------
Arabic-L: Thu 28 Aug 2006
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:new from LDC

-------------------------Messages-----------------------------------
1)
Date: 28 Aug 2006
From:ldc at ldc.upenn.edu
Subject:new from LDC

LDC2006S43
Gulf Arabic Conversational Telephone Speech

LDC2006T15
Gulf Arabic Conversational Telephone Speech, Transcripts

LDC2006T13
Web 1T 5-gram Version 1

The Linguistic Data Consortium (LDC) is pleased to announce the  
availability of three new publications.

New Publications

(1)  Gulf Arabic Conversational Telephone Speech contains 975 Gulf  
Arabic speakers taking part in spontaneous telephone conversations in  
Colloquial Gulf Arabic. A total of 976 conversation sides are  
provided (one speaker appears on two distinct calls). The average  
duration per side is about 5.7 minutes.  This corpus was collected  
and transcribed in 2004 by Appen Pty Ltd. (Appen), Syndey, Australia,  
working under a U.S. Government contract.
The single-channel files represent just one side of a normal  
conversation. The "devtest" set represents a relatively balanced  
(representative) sample drawn from the total pool of collected calls,  
based on a test-set selection process applied by the National  
Institute of Standards and Technology (NIST) and based on  
demographic, phone and audit information as provided by Appen.
*

(2)  Gulf Arabic Conversational Telephone Speech, Transcripts  
contains transcripts of 975 Gulf Arabic speakers taking part in  
spontaneous telephone conversations in Colloquial Gulf Arabic. A  
total of 976 conversation sides are provided (one speaker appears on  
two distinct calls).  The data was collected and transcribed in 2004  
by Appen Pty Ltd., Sydney, Australia, working under a U.S. Government  
contract.

Each transcript file is a tab-delimited flat table, where each line  
contains information and text for a single contiguous utterance,  
presented via the following fields:

beginning time stamp in seconds, in square brackets ("[5.7189]")
ending time stamp in seconds, in square brackets
channel/speaker-ID ("A:" or "B:")
"consonant skeleton" orthography for the utterance, in UTF-8
"diacritized" orthography for the utterance, in ASCII

*
(3)  Web 1T 5-gram Version 1 contains English word n-grams and their  
observed frequency counts. The length of the n-grams ranges from  
unigrams (single words) to five-grams. This data will be useful for  
statistical language modeling, e.g., for machine translation or  
speech recognition, as well as for other uses.  The n-gram counts  
were generated from approximately 1 trillion word tokens of text from  
publicly accessible web pages.

The input encoding of documents was automatically detected, and all  
text was converted to UTF8.  The data was tokenized in a manner  
similar to the tokenization of the Wall Street Journal portion of the  
Penn Treebank. Notable exceptions include the following:

Hyphenated word are usually separated, and hyphenated numbers usually  
form one token.
Sequences of numbers separated by slashes (e.g. in dates) form one  
token.
Sequences that look like urls or email addresses form one token.

If you need further information, or would like to inquire about  
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215  
573 1275.

--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

------------------------------------------------------------------------ 
--
End of Arabic-L:  28 Aug 2006
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20060928/a627c65f/attachment.htm>