[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Jul 30 15:47:01 UTC 2008


*-  Collaboration between LDC and Georgetown University Press  -

**LDC2008S06*
*-  CSLU: Alphadigit Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06>  -*

*LDC2008T08*
*-  GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08>  -*

*The Linguistic Data Consortium (LDC) would like to report on recent 
developments and announce the availability of two new publications.

*
------------------------------------------------------------------------
*
Collaboration between LDC and **Georgetown** **University** Press
*

LDC is pleased to announce that the U.S. Department of Education 
<http://www.ed.gov/index.jhtml>, International Education Programs 
Service <http://www.ed.gov/about/offices/list/ope/iegps/index.html>, has 
funded a collaboration between LDC and Georgetown University Press 
<http://www.press.georgetown.edu/> (GUP) to create up-to-date lexical 
databases, with translations to and from English, for three dialects of 
colloquial Arabic. The databases will be used for interactive computer 
access and for new print publications of dictionaries in Iraqi, 
Syrian/Levantine and Moroccan dialects. 

The databases will be based on three GUP source dictionaries: /A 
Dictionary of Iraqi Arabic, English-Arabic, Arabic-English /(Clarity, et 
al., 2003), /A Dictionary of Syrian Arabic, English-Arabic/ (Stowasser 
and Ani, 2004) and a /Dictionary of Moroccan Arabic, Arabic-English, 
English-Arabic/ (Harrell and Sobelman, 2004). Utilizing contemporary 
principles of computational linguistics and current pedagogical 
requirements in order to reflect current vocabulary and usage, the work 
will provide a standardized system of transcription and use the Arabic 
script, both vocalized and unvocalized, to show vowel pronunciation as 
well as standard orthography. A searchable version on CD-ROM will 
accompany each print reference. The project has been funded for three 
years. Work will commence in Year 1 with the Iraqi Arabic dictionary, 
proceed to the Syrian/Levantine dictionary and conclude with the 
Moroccan Arabic dictionary.

The proposed dictionaries and databases aim to provide U.S. students and 
teachers of Arabic with current dialectal Arabic lexical information to 
enable them to communicate orally with native and non-native Arabic 
speakers. The scholarship used to create a modernized transcription 
system and to provide existing and new terms in Arabic script (including 
diacritics) may also help integrate instruction in dialect and Modern 
Standard Arabic by providing tools for curriculum developers.

*New Publications
*


(1) CSLU:  Alphadigit Version 1.3 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06>  
is a collection of 78,044 utterances from 3,025 speakers saying 
six-digit strings of letters and digits over the telephone for a total 
of approximately 82 hours of speech. Each speech file has corresponding 
orthographic and phonemic transcriptions. This corpus was created by the 
Center for Spoken Language Understanding (CSLU), Oregon Health & Science 
University, Beaverton, Oregon.

Participants received a list of 18-29 six-digit strings (e.g., "a 2 b 4 
5 g"); 1102 different strings were used throughout the course of the 
data collection. The lists were set up to balance for phonetic context 
between all letter and digit pairs. The data were recorded directly from 
a digital phone line without digital-to-analog or analog-to-digital 
conversion at the recording end using the CSLU T1 digital data 
collection system. The sampling rate was 8khz and the files were stored 
in 8-bit mu-law format on a UNIX file system. The files have been 
converted to RIFF standard file format, 16-bit linearly encoded.

All of the files included in this corpus have corresponding 
non-time-aligned word-level transcriptions and time aligned 
phoneme-level transcriptions (automatic forced alignment) that comply 
with the conventions in the CSLU Labeling Guide.  

***

(2) GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08> 
contains transcripts and English translations of 22.9 hours of Chinese 
broadcast news programming from China Central TV (CCTV) and Phoenix TV. 
It does not contain the audio files from which the transcripts and 
translations were generated. GALE Phase 1 Chinese Broadcast News 
Parallel Text - Part 2 is the second of the three-part GALE Phase 1 
Chinese Broadcast News Parallel Text, which, along with other corpora, 
was used as training data in year 1 (Phase 1) of the DARPA-funded GALE 
program. 

A total of 22.9 hours of Chinese broadcast news recordings were selected 
from two sources, CCTV (a broadcaster from Mainland China) and Phoenix 
TV (a Hong Kong based satellite TV station). The transcripts and 
translations represent recordings of five different programs.

A manual selection procedure was used to choose data appropriate for the 
GALE program, namely, news programs focusing on current events. Stories 
on topics such as sports, entertainment and stock markets were excluded 
from the data set.  Manual sentence units/segments (SU) annotation was 
also performed on a subset of files following LDC's Quick Rich 
Transcription specification. Three types of end of sentence SU were 
identified: statement SU, question SU, and incomplete SU. After 
transcription and SU annotation, they were reformatted into a 
human-readable translation format, and the files were then assigned to 
professional translators for careful translation. Translators followed 
LDC's GALE Translation guidelines, which describe the makeup of the 
translation team, the source, data format, the translation data format, 
best practices for translating certain linguistic features (such as 
names and speech disfluencies), and quality control procedures applied 
to completed translations. 


------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu


*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080730/899a4999/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list