[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Nov 24 16:56:53 UTC 2009
*- LDC Incentives: Early Renewal Discounts for Membership Year (MY) 2010
<#2010> -*
*
**- 2007 NIST Language Recognition Evaluation Supplemental Training Set
<#LRE> -*
*- French Gigaword Second Edition <#french> -*
- *NXT Switchboard Annotations <#NXT>* -
------------------------------------------------------------------------
***LDC Incentives: Early Renewal Discounts for Membership Year (MY) 2010
*
We would like to invite all current and previous members of LDC to
renew, as well as new members to join, for Membership Year (MY) 2010.
For MY2010, LDC is pleased to maintain membership fees at last year's
rates -- membership fees will not increase. Additionally, in last
month's newsletter, we announced an LDC Incentives Package which will
include a host of incentives to help lower the cost of LDC membership
and data licensing fees. As part of this package, LDC will extend
discounts to members who keep their membership current and who join
early in the year.
The details of our *Early Renewal Discounts*
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1>
for MY2010 are as follows:
* Organizations who joined for MY2009, will receive a 5% discount
when renewing. This discount will apply throughout 2010,
regardless of time of renewal. MY2009 members renewing before
March 1, 2010 will receive an additional 5% discount, for a total
10% discount off the membership fee.
* New members as well as organizations who did not join for MY2009,
but who held membership in any of the previous MY's (1993-2008),
will also be eligible for a 5% discount provided that they
join/renew before March 1, 2010.
The Membership Fee Table provides exact pricing information.
*MY2010 Fee*
*MY2010 Fee
with 5% Discount *
*MY2010 Fee
with 10% Discount *
*Not-for-Profit*
Standard
US$2400
US$2280
US$2160
Subscription
US$3850
US$3657.50
US$3465
*For-Profit*
Standard
US$24000
US$22800
US$21600
Subscription
US$27500
US$26125
US$24750
Publications for MY2010 are still being planned but it will be another
productive year with a broad selection of publications. The working
titles of data sets we intend to provide include:
Arabic Treebank: Part 2 v 4.0
Fisher Spanish
Chinese Treebank 7.0
LCTL Bengali
Chinese Web N-gram Version 1.0
NPS Chat Corpus
In addition to receiving new publications, current year members of the
LDC also enjoy the benefit of licensing older data at reduced costs;
current year for-profit members may use most data for commercial
applications.
This past year, nearly 100 organizations who renewed membership or
joined early received a discount on membership fees for MY2009. Taken
together, these members saved over US$50,000! Be sure to keep an eye
out on your mail - all LDC members have been sent an invitation to join
letter and renewal invoice for MY2010. Renew early for MY2010 and save
today!
[ top <#top>]
*New Publications
*
(1) 2007 NIST Language Recognition Evaluation Supplemental Training Set
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S05>
consists of 118 hours of conversational telephone speech segments in the
following languages and dialects: Arabic (Egyptian colloquial), Bengali,
Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian,
Mexican Spanish, Thai, Urdu and Tamil.
The goal of the NIST (National Institute of Standards and Technology)
<http://www.itl.nist.gov/iad/> Language Recognition Evaluation (LRE)
<http://www.itl.nist.gov/iad/mig/tests/lre/> is to establish the
baseline of current performance capability for language recognition of
conversational telephone speech and to lay the groundwork for further
research efforts in the field. NIST conducted three previous language
recognition evaluations, in 1996
<http://www.itl.nist.gov/iad/mig/tests/lre/1996/>, 2003
<http://www.itl.nist.gov/iad/mig/tests/lre/2003/> and 2005
<http://www.itl.nist.gov/iad/mig/tests/lre/2005/>. The most significant
differences between those evaluations and the 2007 task were the
increased number of languages and dialects, the greater emphasis on a
basic detection task for evaluation and the variety of evaluation
conditions. Thus, in 2007, given a segment of speech and a language of
interest to be detected (i.e., a target language), the task was to
decide whether that target language was in fact spoken in the given
telephone speech segment (yes or no), based on an automated analysis of
the data contained in the segment.
The supplemental training material in this release consists of the
following:
* Approximately 53 hours of conversational telephone speech segments
in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan
Chinese,Wu Chinese, Russian, Thai and Urdu. This material is taken
from LDC's CALLHOME, CALLFRIEND and Mixer collections.
* Approximately 65 hours of full telephone conversations in Mandarin
Chinese (Taiwan), Spanish (Mexican) and Tamil. This material was
collected by Oregon Health and Science University (OHSU),
Beaverton, Oregon. The test segments used in the 2005 NIST
Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05>
were derived from these full conversations.
In addition to the supplemental material contained in this release, the
training data for the 2007 NIST Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S04>
consisted of data from previous LRE evaluation test sets, namely, 2003
NIST Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31>
and 2005 NIST Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S05>.
[ top <#top>]
***
(2) French Gigaword Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T28>
is a comprehensive archive of newswire text data that has been acquired
over several years by LDC. This second edition updates French Gigaword
First Edition (LDC2006T7)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T17>
and adds material collected from August 1, 2006 through December 31, 2008.
The two distinct international sources of French newswire in this
edition, and the time spans of collection covered for each, are as follows:
* Agence France-Presse (afp_fre) May 1994 - Dec 2008
* Associated Press Worldstream, French (apw_fre) Nov 1994 - Dec 2008
The seven-letter codes in parentheses include the three-character source
name abbreviations and the three-character language code ("fre")
separated by an underscore ("_") character. The three-letter language
code conforms to LDC's internal convention based on the ISO 639-3
standard. These codes are used in the directory names where the data
files are found and in the prefix that appears at the beginning of every
data file name. They are also used (in all UPPER CASE) as the initial
portion of the DOC "id" strings that uniquely identify each news story.
The overall totals for each source are summarized below. The "Totl-MB"
numbers show the amount of data obtained when the files are uncompressed
(i.e., approximately 15 gigabytes, total); the "Gzip-MB" column shows
totals for compressed file sizes as stored on the DVD-ROM; and the
"K-wrds" numbers are the number of whitespace-separated tokens (of all
types) after all SGML tags are eliminated.
Source
#Files
Gzip-MB
Totl-MB
K-wrds
#DOCs
AFP_FRE
172
2408
4079
560000
2060803
APW_FRE
171
2280
1719
241324
0872573
TOTAL
343
4688
5789
801324
2933376
The data has undergone a consistent extent of quality control to
eliminate out-of-band content and other obvious forms of corruption.
Since the source data is generated manually on a daily basis, there will
be a small percentage of human errors common to all sources: missing
whitespace, incorrect or variant spellings, badly formed sentences, and
so on, as are normally seen in newspapers. No attempt has been made to
address this property of the data.
[ top <#top>]
***
(3) NXT Switchboard Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T26>,
brings together in NITE XML <http://groups.inf.ed.ac.uk/nxt/>, a single
XML format, the multiple layers of annotation performed on a transcript
subset from Switchboard 1- Release 2, LDC97S62
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62>.
NXT Switchboard Annotations was developed in a collaboration among
researchers from Edinburgh University, Stanford University and the
University of Washington.
The original Switchboard corpus is a collection of spontaneous telephone
conversations between previously unacquainted speakers of American
English on a variety of topics chosen from a pre-determined list. A
subset of one million words from those conversations was annotated for
syntactic structure and disfluencies as part of the Penn Treebank
project <http://www.cis.upenn.edu/%7Etreebank/>. Phonetic transcripts
were generated by the International Computer Science Institute
<http://www.icsi.berkeley.edu/>, University of California Berkeley and
later corrected by the Institute for Signal Information Processing,
Mississippi State Univeristy. The Penn Treebank transcripts provided the
basis for the NXT Switchboard corpus, and the noun phrases from that
subset were annotated for animacy. The Treebank transcript was then
aligned with the corresponding subset from the corrected Mississippi
State (MS-State) transcript
<http://www.isip.piconepress.com/projects/switchboard/> in order to
provide word timing information. Focus/contrast and prosodic
annotations, as well as phone/syllable alignment were next added to the
annotations. The previous annotations of dialog acts and prosody were
converted to NITE XML. Lastly, hand annotations for markables were added
to provide information about their animacy and information structure,
including coreferential links.
NXT is an open source toolkit that enables multiple linguistic
annotations to be assembled into a unified database. It uses a stand-off
XML data format that consists of several XML files that point to each
other. The NXT format provides a data model that describes how the
various annotations for a corpus relate to one another. For that reason,
it does not impose any particular linguistic theory or any particular
markup structure. Instead, users define their annotations in a
"metadata" file that expresses their contents and how they relate to
each other in terms of the graph structure for the corpus annotations
overall. The relationships that can be defined in the data model draw
annotations together into a set of intersecting trees, but also allow
arbitrary links between annotations over the top of this structure,
giving a representation that is highly expressive, easier to process
than arbitrary graphs and structured in a way that helps data users.
NXT's other core component is a query language designed specifically for
working with data conforming to this data model. Together, the data
model and query language allow annotations to be treated as one coherent
set containing both structural and timing information.
[ top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091124/9b2e5233/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list