<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center">LDC2006S43<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S43"><b>Gulf

Arabic Conversational Telephone Speech</b></a><br>

<br>

LDC2006T15<br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T15"><b>Gulf

Arabic Conversational Telephone Speech, Transcripts</b></a><br>

<br>

LDC2006T13  <br>

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13"><b>Web

1T 5-gram Version 1</b></a>

<br>

<br>

<br>

The Linguistic Data Consortium (LDC) is pleased to announce the

availability

of

three new publications.<br>

</div>

 

<br>

<div align="center">

<div align="left"><br>

<hr size="2" width="100%"><br>

<br>

</div>

<b>New Publications</b><br>

</div>

<b><br>

<br>

</b>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S43">Gulf

Arabic Conversational Telephone Speech</a> contains 975 Gulf Arabic

speakers taking part in spontaneous telephone conversations in

Colloquial Gulf Arabic. A total of 976 conversation sides are provided

(one speaker appears on two distinct calls). The average duration per

side is about 5.7 minutes.  This corpus was collected and transcribed

in 2004 by Appen Pty Ltd. (Appen), Syndey, Australia, working under a

U.S. Government contract.

<p>The single-channel files represent just one side of a normal

conversation. The "devtest" set represents a relatively balanced

(representative) sample drawn from the total pool of collected calls,

based on a test-set selection process applied by the National Institute

of Standards and Technology (NIST) and based on demographic, phone and

audit information as provided by Appen.  <br>

</p>

<p align="center">*<br>

</p>

<p>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T15">Gulf

Arabic Conversational Telephone Speech, Transcripts</a> contains

transcripts of 975 Gulf Arabic speakers taking part in spontaneous

telephone conversations in Colloquial Gulf Arabic. A total of 976

conversation sides are provided (one speaker appears on two distinct

calls).  The data

was collected and transcribed in 2004 by Appen Pty Ltd., Sydney,

Australia, working under a U.S. Government contract.</p>

<p>Each transcript file is a tab-delimited flat table, where each line

contains information and text for a single contiguous utterance,

presented via the following fields:</p>

<ol>

  <li>beginning time stamp in seconds, in square brackets ("[5.7189]") </li>

  <li>ending time stamp in seconds, in square brackets </li>

  <li>channel/speaker-ID ("A:" or "B:") </li>

  <li>"consonant skeleton" orthography for the utterance, in UTF-8 </li>

  <li>"diacritized" orthography for the utterance, in ASCII </li>

</ol>

<br>

<div align="center">*<br>

</div>

<p>(3)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13">Web

1T 5-gram Version 1</a> contains English word n-grams and their

observed frequency counts. The length of the n-grams ranges from

unigrams (single words) to five-grams. This data will be useful for

statistical language modeling, e.g., for machine translation or speech

recognition, as well as for other uses.  The n-gram counts were

generated from approximately 1 trillion word tokens of text from

publicly accessible web pages. <br>

</p>

<p>The input encoding of documents was automatically detected, and all

text was converted to UTF8.  The data was tokenized in a manner similar

to the tokenization of the Wall Street Journal portion of the Penn

Treebank. Notable exceptions include the following:</p>

<ul>

  <li>Hyphenated word are usually separated, and hyphenated numbers

usually form one token. </li>

  <li>Sequences of numbers separated by slashes (e.g. in dates) form

one token. </li>

  <li>Sequences that look like urls or email addresses form one token. </li>

</ul>

<br>

<hr size="2" width="100%"><br>

<div align="center"><font face="Courier New"><small><big><font

 face="Times New Roman"><br>

If

you need further

information, or would like to inquire about

membership to the LDC, please email <a class="moz-txt-link-abbreviated"

 href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> or call +1 215

573 1275.</font></big></small></font><br>

</div>

<p><font face="Courier New"><small><br>

<br>

</small></font>

</p>

<div align="center">--------------------------------------------------------------------<br>

</div>

<div align="center">

<pre class="moz-signature" cols="72">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                  <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

</div>

</body>

</html>