[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Jul 30 15:47:01 UTC 2008
*- Collaboration between LDC and Georgetown University Press -
**LDC2008S06*
*- CSLU: Alphadigit Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06> -*
*LDC2008T08*
*- GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08> -*
*The Linguistic Data Consortium (LDC) would like to report on recent
developments and announce the availability of two new publications.
*
------------------------------------------------------------------------
*
Collaboration between LDC and **Georgetown** **University** Press
*
LDC is pleased to announce that the U.S. Department of Education
<http://www.ed.gov/index.jhtml>, International Education Programs
Service <http://www.ed.gov/about/offices/list/ope/iegps/index.html>, has
funded a collaboration between LDC and Georgetown University Press
<http://www.press.georgetown.edu/> (GUP) to create up-to-date lexical
databases, with translations to and from English, for three dialects of
colloquial Arabic. The databases will be used for interactive computer
access and for new print publications of dictionaries in Iraqi,
Syrian/Levantine and Moroccan dialects.
The databases will be based on three GUP source dictionaries: /A
Dictionary of Iraqi Arabic, English-Arabic, Arabic-English /(Clarity, et
al., 2003), /A Dictionary of Syrian Arabic, English-Arabic/ (Stowasser
and Ani, 2004) and a /Dictionary of Moroccan Arabic, Arabic-English,
English-Arabic/ (Harrell and Sobelman, 2004). Utilizing contemporary
principles of computational linguistics and current pedagogical
requirements in order to reflect current vocabulary and usage, the work
will provide a standardized system of transcription and use the Arabic
script, both vocalized and unvocalized, to show vowel pronunciation as
well as standard orthography. A searchable version on CD-ROM will
accompany each print reference. The project has been funded for three
years. Work will commence in Year 1 with the Iraqi Arabic dictionary,
proceed to the Syrian/Levantine dictionary and conclude with the
Moroccan Arabic dictionary.
The proposed dictionaries and databases aim to provide U.S. students and
teachers of Arabic with current dialectal Arabic lexical information to
enable them to communicate orally with native and non-native Arabic
speakers. The scholarship used to create a modernized transcription
system and to provide existing and new terms in Arabic script (including
diacritics) may also help integrate instruction in dialect and Modern
Standard Arabic by providing tools for curriculum developers.
*New Publications
*
(1) CSLU: Alphadigit Version 1.3
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S06>
is a collection of 78,044 utterances from 3,025 speakers saying
six-digit strings of letters and digits over the telephone for a total
of approximately 82 hours of speech. Each speech file has corresponding
orthographic and phonemic transcriptions. This corpus was created by the
Center for Spoken Language Understanding (CSLU), Oregon Health & Science
University, Beaverton, Oregon.
Participants received a list of 18-29 six-digit strings (e.g., "a 2 b 4
5 g"); 1102 different strings were used throughout the course of the
data collection. The lists were set up to balance for phonetic context
between all letter and digit pairs. The data were recorded directly from
a digital phone line without digital-to-analog or analog-to-digital
conversion at the recording end using the CSLU T1 digital data
collection system. The sampling rate was 8khz and the files were stored
in 8-bit mu-law format on a UNIX file system. The files have been
converted to RIFF standard file format, 16-bit linearly encoded.
All of the files included in this corpus have corresponding
non-time-aligned word-level transcriptions and time aligned
phoneme-level transcriptions (automatic forced alignment) that comply
with the conventions in the CSLU Labeling Guide.
***
(2) GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T08>
contains transcripts and English translations of 22.9 hours of Chinese
broadcast news programming from China Central TV (CCTV) and Phoenix TV.
It does not contain the audio files from which the transcripts and
translations were generated. GALE Phase 1 Chinese Broadcast News
Parallel Text - Part 2 is the second of the three-part GALE Phase 1
Chinese Broadcast News Parallel Text, which, along with other corpora,
was used as training data in year 1 (Phase 1) of the DARPA-funded GALE
program.
A total of 22.9 hours of Chinese broadcast news recordings were selected
from two sources, CCTV (a broadcaster from Mainland China) and Phoenix
TV (a Hong Kong based satellite TV station). The transcripts and
translations represent recordings of five different programs.
A manual selection procedure was used to choose data appropriate for the
GALE program, namely, news programs focusing on current events. Stories
on topics such as sports, entertainment and stock markets were excluded
from the data set. Manual sentence units/segments (SU) annotation was
also performed on a subset of files following LDC's Quick Rich
Transcription specification. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, they were reformatted into a
human-readable translation format, and the files were then assigned to
professional translators for careful translation. Translators followed
LDC's GALE Translation guidelines, which describe the makeup of the
translation team, the source, data format, the translation data format,
best practices for translating certain linguistic features (such as
names and speech disfluencies), and quality control procedures applied
to completed translations.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080730/899a4999/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list