Arabic-L:LING:LDC New Arabic Resources

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Tue Jan 23 19:46:59 UTC 2007


------------------------------------------------------------------------
Arabic-L: Tue 23 Jan 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:LDC New Arabic Resources

-------------------------Messages-----------------------------------
1)
Date: 23 Jan 2007
From:ldc at ldc.upenn.edu
Subject:LDC New Arabic Resources

Preview of Membership Year 2007

LDC Collaborating with IRCAM

LDC2007T02
English Chinese Translation Treebank v 1.0

LDC2007S01
Levantine Arabic Conversational Telephone Speech

LDC2007T01
Levantine Arabic Conversational Telephone Speech, Transcripts

In this month's newsletter, the Linguistic Data Consortium (LDC)  
would like to provide a preview of Membership Year 2007, note a  
recent collaboration, and announce the availability of three new  
publications.




Preview of Membership Year 2007

Membership Year (MY) 2007 is gearing up to be an exciting one for the  
LDC.   First and foremost, MY 2007 marks the LDC's 15th Anniversary!   
As we reflect on the past fifteen years, it is essential to note how  
greatly the LDC has evolved while still adhering to our goal to share  
language-technology resources.  A quick review of our online catalog  
underscores the LDC's growing role in data collection and creation.   
In 1993, all corpora the LDC distributed were externally provided,  
while last year almost 40% of our publications were produced in-house  
and authored by LDC Staff.  By creating data that we distribute, the  
LDC remains responsive to the changing needs of the research  
community that it has supported for fifteen years.

As in previous years, MY 2007 will offer a substantial selection of  
corpora.  A few of the corpora in the pipeline are updates to our  
Gigaword corpora and data used in the GALE evaluation including  
OntoNotes and parallel web text.  Brief descriptions of our proposed  
releases will be provided in our February newsletter.

Additionally, to ensure that the processing of our customer's credit  
card information is as speedy and secure as possible, we will  
transition to online credit card processing this year.  Stay tuned  
for future announcements regarding our online payment center.

Why not help us celebrate our 15th anniversary and sustain our  
operations by becoming a member of the LDC?  It's easier than  
generative syntax! Click here for further information. Members of the  
LDC are more popular, funnier and taller than their fellow non- 
members-- so what are you waiting for?


LDC Collaborating with IRCAM

LDC is pleased to announce that it has entered into a collaboration  
with Institut Royal de la Culture Amazighe (IRCAM), Rabat, Morocco, a  
organization devoted to the preservation and promotion of the Amazigh  
language and culture. Two Amazigh scholars from IRCAM, Aïcha Bouhjar  
and Rachid Laabdelaoui, just completed a month-long stay at LDC  
during which they worked with LDC’s team on the Less Commonly Taught  
Languages (LCTL) project to develop language resources for Amazigh.  
LDC looks forward to future joint projects and to a long and  
successful collaboration with IRCAM.


New Publications

English Chinese Translation Treebank v 1.0 consists of 146,300 words  
in 325 files of individual news stories from Xinhua News Agency  
(corresponding to the Xinhua data in the Chinese Treebank 5.0, LDC  
Catalog No.: LDC2005T01) that are translated into English, part-of- 
speech tagged and treebanked. The files were compressed using gzip.

The source files for the treebank annotation contain the final  
updated translation of these files. Translation errors that prevented  
complete treebank annotation have been corrected. This translation  
and annotation were completed in October 2004, and this supersedes  
any earlier translation.  English Chinese Translation Treebank v 1.0  
is distributed via web download.

2007 Subscription Members will automatically receive two copies of  
this corpus on disc. 2007 Standard Members may request a copy as part  
of their 16 free membership corpora. Nonmembers may license this data  
for US$500.

*

Levantine Arabic Conversational Telephone Speech contains 982  
Levantine Arabic speakers taking part in spontaneous telephone  
conversations in Colloquial Levantine Arabic. A total of 985  
conversation sides are provided (there are three speakers who each  
appear in two disctinct conversations). The average duration per side  
is between 5 and 6 minutes.  Levantine Arabic Conversational  
Telephone Speech is distributed on one DVD-ROM.

2007 Subscription Members will automatically receive two copies of  
this corpus on disc. 2007 Standard Members may request a copy as part  
of their 16 free membership corpora. Nonmembers may license this data  
for US$400.

*

Levantine Arabic Conversational Telephone Speech, Transcripts  
contains 982 Levantine Arabic speakers taking part in spontaneous  
telephone conversations in Colloquial Levantine Arabic. A total of  
985 conversation sides are provided (there are three speakers who  
each appear in two disctinct conversations). The average duration per  
side is between 5 and 6 minutes.

Each transcript file is a flat, plain-text table, where each line  
contains information for a single contiguous utterance, presented via  
the following tab-delimited fields:

1. beginning and ending time stamps, in seconds; each time stamp is  
in square brackets, and the two values are separated by a space (e.g.  
"[5.7189] [9.2135]" -- here, duration is about 3.5 sec)

2. channel/speaker-ID ("A:" or "B:")

3. MSA-based "consonant skeleton" orthography for the utterance,  
using Arabic script characters in UTF-8 encoding

4. Fully "diacritized" orthography for the utterance, reflecting the  
actual pronunciation, using Arabic characters in Buckwalter (ASCII)  
transliteration

Levantine Arabic Conversational Telephone Speech, Transcripts is  
distributed via web download.

2007 Subscription Members will automatically receive two copies of  
this corpus on disc. 2007 Standard Members may request a copy as part  
of their 16 free membership corpora. Nonmembers may license this data  
for US$200.




Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu

------------------------------------------------------------------------ 
--
End of Arabic-L:  23 Jan 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070123/0e9432cf/attachment.htm>


More information about the Arabic-l mailing list