Arabic-L:LING:LDC New Arabic Resources
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Tue Jan 23 19:46:59 UTC 2007
------------------------------------------------------------------------
Arabic-L: Tue 23 Jan 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:LDC New Arabic Resources
-------------------------Messages-----------------------------------
1)
Date: 23 Jan 2007
From:ldc at ldc.upenn.edu
Subject:LDC New Arabic Resources
Preview of Membership Year 2007
LDC Collaborating with IRCAM
LDC2007T02
English Chinese Translation Treebank v 1.0
LDC2007S01
Levantine Arabic Conversational Telephone Speech
LDC2007T01
Levantine Arabic Conversational Telephone Speech, Transcripts
In this month's newsletter, the Linguistic Data Consortium (LDC)
would like to provide a preview of Membership Year 2007, note a
recent collaboration, and announce the availability of three new
publications.
Preview of Membership Year 2007
Membership Year (MY) 2007 is gearing up to be an exciting one for the
LDC. First and foremost, MY 2007 marks the LDC's 15th Anniversary!
As we reflect on the past fifteen years, it is essential to note how
greatly the LDC has evolved while still adhering to our goal to share
language-technology resources. A quick review of our online catalog
underscores the LDC's growing role in data collection and creation.
In 1993, all corpora the LDC distributed were externally provided,
while last year almost 40% of our publications were produced in-house
and authored by LDC Staff. By creating data that we distribute, the
LDC remains responsive to the changing needs of the research
community that it has supported for fifteen years.
As in previous years, MY 2007 will offer a substantial selection of
corpora. A few of the corpora in the pipeline are updates to our
Gigaword corpora and data used in the GALE evaluation including
OntoNotes and parallel web text. Brief descriptions of our proposed
releases will be provided in our February newsletter.
Additionally, to ensure that the processing of our customer's credit
card information is as speedy and secure as possible, we will
transition to online credit card processing this year. Stay tuned
for future announcements regarding our online payment center.
Why not help us celebrate our 15th anniversary and sustain our
operations by becoming a member of the LDC? It's easier than
generative syntax! Click here for further information. Members of the
LDC are more popular, funnier and taller than their fellow non-
members-- so what are you waiting for?
LDC Collaborating with IRCAM
LDC is pleased to announce that it has entered into a collaboration
with Institut Royal de la Culture Amazighe (IRCAM), Rabat, Morocco, a
organization devoted to the preservation and promotion of the Amazigh
language and culture. Two Amazigh scholars from IRCAM, Aïcha Bouhjar
and Rachid Laabdelaoui, just completed a month-long stay at LDC
during which they worked with LDC’s team on the Less Commonly Taught
Languages (LCTL) project to develop language resources for Amazigh.
LDC looks forward to future joint projects and to a long and
successful collaboration with IRCAM.
New Publications
English Chinese Translation Treebank v 1.0 consists of 146,300 words
in 325 files of individual news stories from Xinhua News Agency
(corresponding to the Xinhua data in the Chinese Treebank 5.0, LDC
Catalog No.: LDC2005T01) that are translated into English, part-of-
speech tagged and treebanked. The files were compressed using gzip.
The source files for the treebank annotation contain the final
updated translation of these files. Translation errors that prevented
complete treebank annotation have been corrected. This translation
and annotation were completed in October 2004, and this supersedes
any earlier translation. English Chinese Translation Treebank v 1.0
is distributed via web download.
2007 Subscription Members will automatically receive two copies of
this corpus on disc. 2007 Standard Members may request a copy as part
of their 16 free membership corpora. Nonmembers may license this data
for US$500.
*
Levantine Arabic Conversational Telephone Speech contains 982
Levantine Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Levantine Arabic. A total of 985
conversation sides are provided (there are three speakers who each
appear in two disctinct conversations). The average duration per side
is between 5 and 6 minutes. Levantine Arabic Conversational
Telephone Speech is distributed on one DVD-ROM.
2007 Subscription Members will automatically receive two copies of
this corpus on disc. 2007 Standard Members may request a copy as part
of their 16 free membership corpora. Nonmembers may license this data
for US$400.
*
Levantine Arabic Conversational Telephone Speech, Transcripts
contains 982 Levantine Arabic speakers taking part in spontaneous
telephone conversations in Colloquial Levantine Arabic. A total of
985 conversation sides are provided (there are three speakers who
each appear in two disctinct conversations). The average duration per
side is between 5 and 6 minutes.
Each transcript file is a flat, plain-text table, where each line
contains information for a single contiguous utterance, presented via
the following tab-delimited fields:
1. beginning and ending time stamps, in seconds; each time stamp is
in square brackets, and the two values are separated by a space (e.g.
"[5.7189] [9.2135]" -- here, duration is about 3.5 sec)
2. channel/speaker-ID ("A:" or "B:")
3. MSA-based "consonant skeleton" orthography for the utterance,
using Arabic script characters in UTF-8 encoding
4. Fully "diacritized" orthography for the utterance, reflecting the
actual pronunciation, using Arabic characters in Buckwalter (ASCII)
transliteration
Levantine Arabic Conversational Telephone Speech, Transcripts is
distributed via web download.
2007 Subscription Members will automatically receive two copies of
this corpus on disc. 2007 Standard Members may request a copy as part
of their 16 free membership corpora. Nonmembers may license this data
for US$200.
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
------------------------------------------------------------------------
--
End of Arabic-L: 23 Jan 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070123/0e9432cf/attachment.htm>
More information about the Arabic-l
mailing list