Arabic-L:LING:Cross Language Information Retrieval Eval Campaign

Dilworth B. Parkinson Dilworth_Parkinson at byu.edu
Mon May 7 14:40:36 UTC 2001


----------------------------------------------------------------------
Arabic-L: Mon 07 May 2001
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message to listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory-------------------------------------

1) Subject: Cross Language Information Retrieval Eval Campaign

-------------------------Messages--------------------------------------
1)
Date: 07 May 2001
From:  Doug Oard <oard at glue.umd.edu>
Subject: Cross Language Information Retrieval Eval Campaign

Members of this list might find this evaluation campaign to be of
interest.  We would be happy to answer any questions about how to get
involved.  We're also interested in learning about Arabic resources that
participants in the evaluation might find useful.

Fred Gey and Doug Oard

 =========================================================================
TREC-2001 Cross Language Information Retrieval (CLIR) Track Guidelines


The U.S. National Institute of Standards and Technology (NIST) will
conduct an evaluation of Cross-Language Information Retrieval (CLIR)
technology in conjunction with the Text Retrieval Conference
(TREC-2001).  The focus this year will be retrieval of Arabic language
newswire documents from topics in English or French.  Participation is
open to all TREC participants (information on joining TREC is
available at http://trec.nist.gov).

Corpus: 383,872 Arabic documents (896 MB), AFP newswire, in Unicode
(encoded as UTF-8), with SGML markup.  The corpus is available now
from the Linguistic Data Consortium (LDC) Catalog Number LDC2001T55
(see http://www.ldc.upenn.edu/Catalog/LDC2001T55.html) using one of
three arrangements:

(1) Organizations with membership in the Linguistic Data Consortium
(for 2001) may order the corpus at no additional charge.  If your
research group is not a member, the LDC can check and tell you if
another part of your organization already has a membership for this
year.  If so (and if you are geographically colocated), it may be
possible for that group to order the corpus without additional charge
through their membership.  Membership in the Linguistic Data
Consortium costs $2,000 per year for nonprofit organizations
(profit-making organizations that are not currently members will
likely prefer the next option) and provides rights to research use
(that do not expire) for all materials released by the LDC during that
year.

(2) Non-members may purchase rights to use the corpus for research
purposes for $800.  These rights do not expire, and are described in
more detail at http://www.ldc.upenn.edu/Membership/FAQ_NonMembers.html.

(3) The Linguistic Data Consortium can negotiate an evaluation-only
license at no cost for research groups that are unable to pay the $800
fee.  An evaluation-only license permits use of the data only for the
duration of the TREC-2001 CLIR evaluation.  Please contact
ldc at ldc.upenn.edu if you need further information on evaluation-only
licenses.

Topics: Twenty-five topics are being developed in English by NIST, in
the same format as typical TREC topics (title, description, and
narrative).  Translations of the topics into French will be available
for use by teams that prefer French/Arabic CLIR.  Arabic translations
of the topics will also be available for use in monolingual
runs.

Result submission: Results will be submitted to NIST for pooling,
relevance assessment, and scoring in the standard TREC format (top
1000 documents in rank order for each query).  Participants may submit
up to 5 runs, and may score additional runs locally using the
relevance judgments that will be provided after relevance assessment
is completed.  It may not be possible to include all submitted runs in
the document pools that serve as a basis for relevance assessment, so
participants submitting more than one run should specify the order of
preference for scoring that would result in the most diverse possible
pools.

Categories of runs: Participants will submit results for runs in one
or more of the following categories.  The principal focus of CLIR
track discussions at TREC-2001 will be on results in the Automatic
CLIR and Manual CLIR categories, but submission of results in the
Monolingual category are also welcome since they both enrich
the relevance assessment pools and provide the opportunity to
for comparison to CLIR approaches.

   Automatic CLIR: Automatic CLIR systems formulate queries from the
   English or French topic content (Title, Description, Narrative fields)
   with no human intervention, and produce ranked lists of documents
   completely automatically based on those queries.  In general, any
   portion of the topic description may be used by automatic systems, but
   participants that submit any automatic run are required to submit one
   automatic run in which only terms from the title and description fields
   are used to facilitate cross-system comparison under similar conditions.

   Manual CLIR: Manual CLIR runs are any runs in which a user that has
   no practical knowledge of Arabic intervenes in any way in the
   process of query formulation and/or production of the ranked list
   for one or more topics.  The intervention might be as simple as
   manual removal of stop structure ("a relevant document will
   contain...") or as complex as manual query reformulation after
   examining translations of retrieved documents using an initial
   query.  A "practical knowledge of Arabic" is defined for this
   purpose as the ability to understand the gist of an Arabic news
   story or to carry on a simple conversation in Arabic.  Knowledge of
   a few Arabic words or an understanding of Arabic linguistic
   characteristics such as morphology or grammar does not constitute a
   "practical knowledge of Arabic" for this purpose.

   Monolingual Arabic: Monolingual runs are any runs in which use is
   made of the Arabic version of the topic description or in which a user
   who has a practical knowledge of Arabic intervenes in the process
   of query formulation and/or production of the ranked list.
   Monolingual runs can be either automatic (no human intervention
   in the process of query development and no changing of system
   structure or parameters after examining the topics) or manual
   (any other human intervention) and should be appropriately
   tagged as such upon submission.

Resources: Links to Web-accessible resources for Arabic information
retrieval and natural language processing are available at
http://www.clis.umd.edu/dlrg/clir/arabic.html.  Participants are
invited to submit additional resources to this list (by email to
oard at glue.umd.edu).

Communications: All communications between participants is conducted
by email.  The track mailing list (xlingual at nist.gov) is open to
anyone with an interest in the track, regardless of whether they plan
to participate in 2001.  To join the list, send email to
listproc at nist.gov with the single line in the body (not the subject)
"subscribe xlingual <FirstName> <LastName>" (note: please send this to
listproc, not to xlingual!).  The track coordinators can help out if
you have trouble subscribing.

Track Meeting: Track results will be discussed at four sessions
during the TREC-2001 meeting in Gaithersburg, MD:

   Track breakout session: (Tuesday, November 13, afternoon) This will
   provide an opportunity for track participants to make brief
   presentations and a panel discussion of lessons learned.

   Plenary session: (time TBA) Presentation of a track summary by the
   organizers and a few presentations by track participants that are
   selected for their potential interest to all conference attendees.

   Poster Session: (time TBA) An opportunity for all track participants
   to present their work as in poster form.  A "boaster session"
   will provide an opportunity to introduce the subject of your poster
   to the conference attendees.

   Track Planning Session: (time TBA, near the end of the conference)
   This will provide an opportunity to discuss what has been learned
   and to plan for future CLIR evaluations.

Schedule:

Now            Documents available from the LDC
ASAP           Sign up for TREC-2001 at http://trec.nist.gov
ASAP           Join the xlingual at nist.gov mailing list
June 5         English and Arabic Topics available from NIST
June 15        French Topics available from NIST (earlier if possible)
August  5      Results due to NIST
October 1      Relevance judgments available from NIST
October 1      Scored results returned to participants
November 13-16 TREC-2001 Meeting, Gaithersburg, MD

Track Coordinators:
Fred Gey  (gey at ucdata.berkeley.edu)
Doug Oard (oard at glue.umd.edu)

Date last modified: April 20, 2001
--------------------------------------------------------------------------
End of Arabic-L: 07 May 2001



More information about the Arabic-l mailing list