[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Mar 27 21:36:17 UTC 2008


LDC2008S02*
-  CSLU: National Cellular Telephone Speech Release 2.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S02>  -
*

LDC2008T02*
**-  GALE Phase 1 Arabic Blog Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T02>  -
*

LDC2008S03*
-  STC-TIMIT 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S03>  -

*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of three new publications.

------------------------------------------------------------------------

*New Publications*

(1)  CSLU: National Cellular Telephone Speech Release 2.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S02> 
was created by the Center for Spoken Language Understanding (CSLU) at 
OGI School of Science and Engineering, Oregon Health and Science 
University, Beaverton, Oregon. It consists of cellular telephone speech 
and corresponding transcripts, specifically, approximately one minute of 
speech from 2336 speakers calling from locations throughout the United 
States. 

Speakers called the CSLU data collection system on cellular telephones, 
and they were asked a series of questions. Two prompt protocols were 
used: an In Vehicle Protocol for speakers calling from inside a vehicle 
and a Not in Vehicle Protocol for those calling from outside a vehicle. 
The protocols shared several questions, but each protocol contained 
distinct queries designed to probe the conditions of the caller's in 
vehicle/not in vehicle surroundings.

The text transcriptions in this corpus were produced using the non 
time-aligned word-level conventions described in The CSLU Labeling 
Guide, which is included in the documentation for this release. CSLU: 
National Cellular Telephone Speech Release 2.3 contains orthographic and 
phonetic transcriptions of corresponding speech files.


***

(2)  GALE Phase 1 Arabic Blog Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T02> 
was prepared by the LDC and consists of 102K words (222 files) of Arabic 
blog text and its English translation from thirty-three sources. This 
release was used as training data in Phase 1 of the DARPA-funded GALE 
program.

The task of preparing this corpus involved four stages of work: data 
scouting, data harvesting, formatting, and data selection.

Data scouting involved manually searching the web for suitable blog 
text. Data scouts were assigned particular topics and genres along with 
a production target in order to focus their web search. Formal 
annotation guidelines and a customized annotation toolkit helped data 
scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest 
(sites, threads and posts) to a database. A nightly process queried the 
annotation database and harvested all designated URLs. Whenever 
possible, the entire site was downloaded, not just the individual thread 
or post located by the data scout.

Once the text was downloaded, its format was standardized so that the 
data could be more easily integrated into downstream annotation 
processes. Typically a new script was required for each new domain name 
that was identified. After scripts were run, an optional manual process 
corrected any remaining formatting problems.

The selected documents were then reviewed for content suitability using 
a semi-automatic process. A statistical approach was used to rank a 
document's relevance to a set of already-selected documents labeled as 
"good." An annotator then reviewed the list of relevance-ranked 
documents and selected those which were suitable for a particular 
annotation task or for annotation in general. 

After files were selected, they were reformatted into a human-readable 
translation format, and the files were then assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines, which describe the makeup of the translation 
team, the source, data format, the translation data format, best 
practices for translating certain linguistic features (such as names and 
speech disfluencies), and quality control procedures applied to 
completed translations.

All final data are in Tab Delimited Format (TDF). TDF is compatible with 
other transcription formats, such as the Transcriber format and AG 
format, and it is easy to process.  Each line of a TDF file corresponds 
to a speech segment and contains 13 tab delimited field.A source TDF 
file and its translation are the same except that the transcript in the 
source TDF is replaced by its English translation. 

***

(3)  STC-TIMIT 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S03> 
is a telephone version of TIMIT Acoustic Phonetic Continuous Speech 
Corpus, LDC93S1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1> 
(TIMIT). TIMIT contains broadband recordings of 630 speakers of eight 
major dialects of American English reading ten phonetically rich 
sentences. Created in 1993, TIMIT was designed to provide speech data 
for acoustic-phonetic studies and for the development and evaluation of 
automatic speech recognition systems. In this TIMIT-derived corpus, the 
entire TIMIT database was passed through an actual telephone channel in 
a single call. Thus, a single type of channel distortion and noise 
affect the whole database.

The process was managed using a Dialogic switchboard for the calling and 
receiving ends. No transducer (microphone) was employed; the original 
digital signal was converted to analog using the switchboard's A/D 
converter, transmitted trough a telephone channel and converted back to 
digital format before recording. As a result, the only distortion 
introduced is that of the telephone channel itself.

The STC-TIMIT 1.0 database is organized in the same manner as in the 
original TIMIT corpus: 4620 files belonging to the training partition 
and 1680 files belonging to the test partition. Utterances in STC-TIMIT 
1.0 are time-aligned with those of TIMIT with an average precision of 
0.125 ms (1 sample), by maximizing the cross-correlation between pairs 
of files from each corpus. Thus, labels from TIMIT may be used for 
STC-TIMIT 1.0, and the effects of telephone channels may be studied on a 
frame-by-frame basis.

Two telephone lines within the same building were connected to a 
Dialogic(R) card. One of the lines was used as the calling-end and 
played the speech file, while the other line was used as the 
receiving-end and recorded the new signal. The whole recording process 
was conducted in a single call.

After recording, the file was pre-cut according to the length of the 
corresponding TIMIT database file. Each resulting file was then aligned 
to its corresponding file in TIMIT using the xcorr routine in Matlab(R). 
Based on these results, the recorded file was sliced again from the 
original recorded file using the newly-generated alignments. Thus, each 
file in STC-TIMIT 1.0 is aligned to its equivalent in TIMIT and has the 
same length. 

------------------------------------------------------------------------

-- 


Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080327/d508ef2a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list