[Corpora-List] News from LDC

Tue Nov 24 16:56:53 UTC 2009

*- LDC Incentives: Early Renewal Discounts for Membership Year (MY) 2010 
<#2010> -*
*
**- 2007 NIST Language Recognition Evaluation Supplemental Training Set 
<#LRE> -*

*- French Gigaword Second Edition <#french> -*

- *NXT Switchboard Annotations <#NXT>* -

------------------------------------------------------------------------

***LDC Incentives:  Early Renewal Discounts for Membership Year (MY) 2010
*

We would like to invite all current and previous members of LDC to 
renew, as well as new members to join, for Membership Year (MY) 2010.  
For MY2010, LDC is pleased to maintain membership fees at last year's 
rates -- membership fees will not increase.  Additionally, in last 
month's newsletter, we announced an LDC Incentives Package which will 
include a host of incentives to help lower the cost of LDC membership 
and data licensing fees.  As part of this package, LDC will extend 
discounts to members who keep their membership current and who join 
early in the year.

The details of our *Early Renewal Discounts* 
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1> 
for MY2010 are as follows:

    * Organizations who joined for MY2009, will receive a 5% discount
      when renewing. This discount will apply throughout 2010,
      regardless of time of renewal. MY2009 members renewing before
      March 1, 2010 will receive an additional 5% discount, for a total
      10% discount off the membership fee.
    * New members as well as organizations who did not join for MY2009,
      but who held membership in any of the previous MY's (1993-2008),
      will also be eligible for a 5% discount provided that they
      join/renew before March 1, 2010.

The Membership Fee Table provides exact pricing information.

*MY2010 Fee*

*MY2010 Fee
with 5% Discount *

*MY2010 Fee
with 10% Discount *

*Not-for-Profit*

Standard

US$2400

US$2280

US$2160

Subscription

US$3850

US$3657.50

US$3465

*For-Profit*

Standard

US$24000

US$22800

US$21600

Subscription

US$27500

US$26125

US$24750

Publications for MY2010 are still being planned but it will be another 
productive year with a broad selection of publications.  The working 
titles of data sets we intend to provide include:

Arabic Treebank: Part 2 v 4.0

Fisher Spanish

Chinese Treebank 7.0

LCTL Bengali

Chinese Web N-gram Version 1.0

NPS Chat Corpus

In addition to receiving new publications, current year members of the 
LDC also enjoy the benefit of licensing older data at reduced costs; 
current year for-profit members may use most data for commercial 
applications.

This past year, nearly 100 organizations who renewed membership or 
joined early received a discount on membership fees for MY2009.  Taken 
together, these members saved over US$50,000!  Be sure to keep an eye 
out on your mail - all LDC members have been sent an invitation to join 
letter and renewal invoice for MY2010.  Renew early for MY2010 and save 
today!

[ top <#top>]

*New Publications
*

(1) 2007 NIST Language Recognition Evaluation Supplemental Training Set 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S05> 
consists of 118 hours of conversational telephone speech segments in the 
following languages and dialects: Arabic (Egyptian colloquial), Bengali, 
Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, 
Mexican Spanish, Thai, Urdu and Tamil.

The goal of the NIST (National Institute of Standards and Technology) 
<http://www.itl.nist.gov/iad/> Language Recognition Evaluation (LRE) 
<http://www.itl.nist.gov/iad/mig/tests/lre/> is to establish the 
baseline of current performance capability for language recognition of 
conversational telephone speech and to lay the groundwork for further 
research efforts in the field. NIST conducted three previous language 
recognition evaluations, in 1996 
<http://www.itl.nist.gov/iad/mig/tests/lre/1996/>, 2003 
<http://www.itl.nist.gov/iad/mig/tests/lre/2003/> and 2005 
<http://www.itl.nist.gov/iad/mig/tests/lre/2005/>. The most significant 
differences between those evaluations and the 2007 task were the 
increased number of languages and dialects, the greater emphasis on a 
basic detection task for evaluation and the variety of evaluation 
conditions. Thus, in 2007, given a segment of speech and a language of 
interest to be detected (i.e., a target language), the task was to 
decide whether that target language was in fact spoken in the given 
telephone speech segment (yes or no), based on an automated analysis of 
the data contained in the segment.

The supplemental training material in this release consists of the 
following:

    * Approximately 53 hours of conversational telephone speech segments
      in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan
      Chinese,Wu Chinese, Russian, Thai and Urdu. This material is taken
      from LDC's CALLHOME, CALLFRIEND and Mixer collections.
    * Approximately 65 hours of full telephone conversations in Mandarin
      Chinese (Taiwan), Spanish (Mexican) and Tamil. This material was
      collected by Oregon Health and Science University (OHSU),
      Beaverton, Oregon. The test segments used in the 2005 NIST
      Language Recognition Evaluation
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05>
      were derived from these full conversations.

In addition to the supplemental material contained in this release, the 
training data for the 2007 NIST Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S04> 
consisted of data from previous LRE evaluation test sets, namely, 2003 
NIST Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31> 
and 2005 NIST Language Recognition Evaluation 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05>.

[ top <#top>]

***

(2) French Gigaword Second Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T28> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by LDC. This second edition updates French Gigaword 
First Edition (LDC2006T7) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17> 
and adds material collected from August 1, 2006 through December 31, 2008.

The two distinct international sources of French newswire in this 
edition, and the time spans of collection covered for each, are as follows:

    * Agence France-Presse (afp_fre) May 1994 - Dec 2008
    * Associated Press Worldstream, French (apw_fre) Nov 1994 - Dec 2008

The seven-letter codes in parentheses include the three-character source 
name abbreviations and the three-character language code ("fre") 
separated by an underscore ("_") character. The three-letter language 
code conforms to LDC's internal convention based on the ISO 639-3 
standard. These codes are used in the directory names where the data 
files are found and in the prefix that appears at the beginning of every 
data file name. They are also used (in all UPPER CASE) as the initial 
portion of the DOC "id" strings that uniquely identify each news story.

The overall totals for each source are summarized below. The "Totl-MB" 
numbers show the amount of data obtained when the files are uncompressed 
(i.e., approximately 15 gigabytes, total); the "Gzip-MB" column shows 
totals for compressed file sizes as stored on the DVD-ROM; and the 
"K-wrds" numbers are the number of whitespace-separated tokens (of all 
types) after all SGML tags are eliminated.

Source

#Files

Gzip-MB

Totl-MB

K-wrds

#DOCs

AFP_FRE

172

2408

4079

560000

2060803

APW_FRE

171

2280

1719

241324

0872573

TOTAL

343

4688

5789

801324

2933376

The data has undergone a consistent extent of quality control to 
eliminate out-of-band content and other obvious forms of corruption. 
Since the source data is generated manually on a daily basis, there will 
be a small percentage of human errors common to all sources: missing 
whitespace, incorrect or variant spellings, badly formed sentences, and 
so on, as are normally seen in newspapers. No attempt has been made to 
address this property of the data.

[ top <#top>]

***

(3) NXT Switchboard Annotations 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T26>, 
brings together in NITE XML <http://groups.inf.ed.ac.uk/nxt/>, a single 
XML format, the multiple layers of annotation performed on a transcript 
subset from Switchboard 1- Release 2, LDC97S62 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62>. 
NXT Switchboard Annotations was developed in a collaboration among 
researchers from Edinburgh University, Stanford University and the 
University of Washington.

The original Switchboard corpus is a collection of spontaneous telephone 
conversations between previously unacquainted speakers of American 
English on a variety of topics chosen from a pre-determined list. A 
subset of one million words from those conversations was annotated for 
syntactic structure and disfluencies as part of the Penn Treebank 
project <http://www.cis.upenn.edu/%7Etreebank/>. Phonetic transcripts 
were generated by the International Computer Science Institute 
<http://www.icsi.berkeley.edu/>, University of California Berkeley and 
later corrected by the Institute for Signal Information Processing, 
Mississippi State Univeristy. The Penn Treebank transcripts provided the 
basis for the NXT Switchboard corpus, and the noun phrases from that 
subset were annotated for animacy. The Treebank transcript was then 
aligned with the corresponding subset from the corrected Mississippi 
State (MS-State) transcript 
<http://www.isip.piconepress.com/projects/switchboard/> in order to 
provide word timing information. Focus/contrast and prosodic 
annotations, as well as phone/syllable alignment were next added to the 
annotations. The previous annotations of dialog acts and prosody were 
converted to NITE XML. Lastly, hand annotations for markables were added 
to provide information about their animacy and information structure, 
including coreferential links.

NXT is an open source toolkit that enables multiple linguistic 
annotations to be assembled into a unified database. It uses a stand-off 
XML data format that consists of several XML files that point to each 
other. The NXT format provides a data model that describes how the 
various annotations for a corpus relate to one another. For that reason, 
it does not impose any particular linguistic theory or any particular 
markup structure. Instead, users define their annotations in a 
"metadata" file that expresses their contents and how they relate to 
each other in terms of the graph structure for the corpus annotations 
overall. The relationships that can be defined in the data model draw 
annotations together into a set of intersecting trees, but also allow 
arbitrary links between annotations over the top of this structure, 
giving a representation that is highly expressive, easier to process 
than arbitrary graphs and structured in a way that helps data users. 
NXT's other core component is a query language designed specifically for 
working with data conforming to this data model. Together, the data 
model and query language allow annotations to be treated as one coherent 
set containing both structural and timing information.

[ top <#top>]
------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091124/9b2e5233/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora