Arabic-L:LING:New Releases from the LDC
Dilworth Parkinson
dilworth_parkinson at byu.edu
Mon Mar 7 22:48:21 UTC 2005
------------------------------------------------------------------------
-
Arabic-L: Mon 07 Mar 2005
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New Releases from the LDC
-------------------------Messages-----------------------------------
1)
Date: 07 Mar 2005
From:ldc at ldc.upenn.edu (from CORPORA LIST)
Subject:New Releases from the LDC
(1) ACE Time Normalization (TERN) 2004 English Training Data contains
the English training data prepared for the 2004 Time Expression
Recognition and Normalization (TERN) Evaluation. The purpose of this
corpus and the TERN evaluation is to advance the state of the art in
the automatic recognition and normalization of natural language
temporal expressions. In most language contexts such expressions are
indexical. For example, with "Monday", "last week", or "three months
starting October 1", one must know the narrative reference time in
order to pinpoint the time interval being conveyed by the expression.
In addition, for data exchange purposes, it is essential that the
identified interval be rendered according to an established standard,
i.e., normalized. Accurate identification and normalization of temporal
expressions is in turn essential for the temporal reasoning being
demanded by advanced NLP applications such as question answering,
information extraction, and summarization.
(2) Arabic Treebank: Part 1 v 3.0 (POS with full vocalization and
syntactic analysis) is a re-release of LDC corpus, Arabic Treebank:
Part 1 v 2.0, with the addition of improved
morphological/part-of-speech annotation including full vocalization and
case endings. The corpus supports the development of data-driven
approaches to natural language processing (NLP), human language
technologies, automatic content extraction, cross-lingual information
retrieval, information detection, and other forms of linguistic
research on Modern Standard Arabic.
The project targets the description of a written Modern Standard Arabic
corpus from the Agence France Presse (AFP) newswire archives for
July-November 2000. This corpus includes 734 stories representing 145K
words.
(3) Multiple Translation Arabic (MTA) Part 2 supports the development
of automatic means for evaluating translation quality. The corpus
contains 4 sets of human translations and 2 sets of
commercial-off-the-shelf systems (COTS) outputs for a single set of
Arabic source materials. Additionally, there is one output set from a
TIDES 2003 MT Evaluation participant, which is representative for the
state-of-the-art research systems.
To see if automatic evaluation systems, such as BLEU, track human
assessment, the LDC performed human assessment on the two COTS outputs
and the TIDES research system. The corpus includes the assessment
results for one of the two COTS systems, the assessment result for the
TIDES research system, and the specifications used for conducting the
assessments.
If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215
573 2175.
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu
------------------------------------------------------------------------
--
End of Arabic-L: 07 Mar 2005
More information about the Arabic-l
mailing list