[Corpora-List] Recent LDC Corpora

Thu Aug 7 19:44:58 UTC 2003

                              LDC2003T12

                       *   Arabic Gigaword   *

                          *   LDC2003V01   *

                   *   FORM2 Kinematic Gesture   *

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of two new releases.

1.  Arabic Gigaword is a comprehensive archive of newswire text data
that has been acquired from Arabic news sources by the LDC.  The
newswire texts are drawn from four sources:

   Agence France Presse (afp)
   Al Hayat News Agency (alh)
   Al Nahar News Agency (ann)
   Xinhua News Agency (xin)

Much of the Agence France Presse content in this collection has been
published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55).
   The entire Al Hayat, An Nahar and Xinhua Arabic content, as well as
AFP content for 2001-2002, is previously unreleased material.

Arabic Gigaword consists of 319 files, totaling approximately 1.1GB in
compressed form (4348 MB uncompressed, and 391619 Kwords).  All text
files corpus have been converted to UTF-8 character encoding.  Arabic
Gigaword is distributed on DVD.

For further information, including a link to online documentation,
please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T12

Institutions that have membership in the LDC during the 2003
Membership Year will be able to receive this corpus free of charge.
Nonmembers may license this publication for $2,500.

                                 *

2.  FORM is a gesture annotation scheme designed to capture the
kinematic information in gesture from videos of speakers. FORM2
Kinematic Gesture is a detailed database of gesture-annotated videos
stored in the Anvil and FORM file formats. FORM encodes the "phonetics"
of gesture by giving geometric descriptions of location and movement of
the right and left arms. Other kinematic information such as effort and
shape are also recorded.

FORM2 Kinematic Gesture contains a total of 24 data files: 8 movie
files, 8 Anvil files, and 8 Form files.  The movie files represent 12
minutes of audio and video recordings excerpted from a lecture given by
Brian MacWhinney on January 24, 2000 at Carnegie Mellon University.
These video recordings were chosen because they are part of the
NSF-funded Talkbank project.

For further information, including a link to the FORM website and online
documentation, please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01

The cost of the first 50 copies of this publication (not including the
copies distributed to LDC members) is covered by sponsoring grants.
These copies are, therefore, free of charge to qualified researchers;
a $30 shipping and handling fee applies. After these first 50 copies
are distributed, additional copies will be available for the production
cost of $500 per CD-ROM.

                                 *

If you need additional information before placing your order, or
would like to inquire about membership in the LDC, please send email to
<ldc at ldc.upenn.edu> or call (215) 573-1275.

---------------------------------------------------------------------
Linguistic Data Consortium          Phone: (215) 573-1275
3600 Market Street                  Fax:   (215) 573-2175
Suite 810                           email: ldc at ldc.upenn.edu
Philadelphia, PA 19104-2653         www: http://www.ldc.upenn.edu