Arabic-L:LING:New Releases from the LDC

Dilworth Parkinson dilworth_parkinson at
Mon Mar 7 22:48:21 UTC 2005

Arabic-L: Mon 07 Mar  2005
Moderator: Dilworth Parkinson <dilworth_parkinson at>
[To post messages to the list, send them to arabic-l at]
[To unsubscribe, send message from same address you subscribed from to
listserv at with first line reading:
            unsubscribe arabic-l                                      ]


1) Subject:New Releases from the LDC

Date: 07 Mar  2005
From:ldc at (from CORPORA LIST)
Subject:New Releases from the LDC

  (1)  ACE Time Normalization (TERN) 2004 English Training Data contains  
the English training data prepared for the 2004 Time Expression  
Recognition and Normalization (TERN) Evaluation.  The purpose of this  
corpus and the TERN evaluation is to advance the state of the art in  
the automatic recognition and normalization of natural language  
temporal expressions. In most language contexts such expressions are  
indexical. For example, with "Monday", "last week", or "three months  
starting October 1", one must know the narrative reference time in  
order to pinpoint the time interval being conveyed by the expression.

In addition, for data exchange purposes, it is essential that the  
identified interval be rendered according to an established standard,  
i.e., normalized. Accurate identification and normalization of temporal  
expressions is in turn essential for the temporal reasoning being  
demanded by advanced NLP applications such as question answering,  
information extraction, and summarization. 

(2)  Arabic Treebank: Part 1 v 3.0 (POS with full vocalization and  
syntactic analysis) is a re-release of LDC corpus, Arabic Treebank:  
Part 1 v 2.0, with the addition of improved  
morphological/part-of-speech annotation including full vocalization and  
case endings.  The corpus supports the development of data-driven  
approaches to natural language processing (NLP), human language  
technologies, automatic content extraction, cross-lingual information  
retrieval, information detection, and other forms of linguistic  
research on Modern Standard Arabic.

The project targets the description of a written Modern Standard Arabic  
corpus from the Agence France Presse (AFP) newswire archives for  
July-November 2000. This corpus includes 734 stories representing 145K  

(3) Multiple Translation Arabic (MTA) Part 2 supports the development  
of automatic means for evaluating translation quality. The corpus  
contains 4 sets of human translations and 2 sets of  
commercial-off-the-shelf systems (COTS) outputs for a single set of  
Arabic source materials.  Additionally, there is one output set from a  
TIDES 2003 MT Evaluation participant, which is representative for the  
state-of-the-art research systems.
  To see if automatic evaluation systems, such as BLEU, track human  
assessment, the LDC performed human assessment on the two COTS outputs  
and the TIDES research system. The corpus includes the assessment  
results for one of the two COTS systems, the assessment result for the  
TIDES research system, and the specifications used for conducting the  

If you need further information, or would like to inquire about  
membership to the LDC, please email ldc at or call +1 215  
573 2175.


Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                                          ldc at
Philadelphia, PA 19104            

End of Arabic-L:  07 Mar  2005

More information about the Arabic-l mailing list