[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Sep 22 15:18:50 UTC 2009
/In this newsletter:/
LDC2009T27
- *Chinese Gigaword Fourth Edition* <#ChGig4thEd> -
LDC2009S03
- *CSLU: S4X Release 1.2 <#CSLUS4X>* -
LDC2009T23
- *FactBank 1.0 <#Fact>* -
- *LDC's Free Resources <#Free>* -
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E650572#Free>
- *Release of XTrans <#XTrans> * -
------------------------------------------------------------------------
*New Publications*
*
*(1) Chinese Gigaword Fourth Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T27>
is a comprehensive archive of newswire text data that has been acquired
over several years by the LDC. This edition includes all of the contents
in Chinese Gigaword Third Edition (LDC2007T38)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38>
as well as newly collected data. In addition, four entirely new sources
have been added in the fourth edition, Central News Service, Guangming
Daily, People's Liberation Army Daily, and People's Daily.
The eight distinct international sources of Chinese newswire included in
this edition are the following:
* Agence France Presse
* Central News Agency, Taiwan
* Central News Service
* Guangming Daily
* People's Daily
* People's Liberation Army Daily
* Xinhua News Agency
* Zaobao Newspaper
The original data received by the LDC from AFP, People's Liberation Army
Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were
in Big-5, and those from GMW, CNS, and People's Daily were in a
combination of GB-2312 and GB-18030. To avoid the problems and confusion
that could result from differences in character-set specifications, all
text files in this corpus have been converted to UTF-8 character encoding.
New in the Fourth Edition:
* Two years worth of new articles (January 2007 through December
2008) have been added to the Xinhua, Agence France Presse, and CNA
data sets.
* Four new data sources have been added - Guangming Daily, Central
News Service , People's Daily, and People's Liberation Army daily,
covering a timespan from November 2006 through December 2008.
[ Return to top <#top>]
***
(2)* *CSLU: S4X Release 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S03>
was created by the Center for Spoken Language Understanding, Oregon
Health and Science University (CSLU). The corpus consists of 36 speakers
(22 male, 14 female) uttering 11 specified words. The speakers repeated
the following words six times on each of four channels: startrek,
supernova, tektronix, generation, nebula, processing, singularity,
71523, abracadabra, sungeeta and computer. The four channels used were
office phone, home phone, carbon microphone telephone and speaker phone.
Each speech file has a corresponding time-aligned phoneme-level
transcription (achieved using automatic forced alignment) and an
automatically-generated world-level transcription. Humans reviewed each
utterance in two passes and classified it as good, bad, noisy or different.
The data was recorded with the CSLU T1 digital data collection system.
Each utterance is recorded as a separate file. These files were sampled
at 8 khz 8-bit and stored as ulaw files. All of the data use the RIFF
standard file format. This file format is 16-bit linearly encoded.
[ Return to top <#top>]
*
(3) FactBank 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T23>
consists of 208 documents (over 77,000 tokens) from newswire and
broadcast news reports in which event mentions are annotated with their
degree of factuality, that is, the degree to which they correspond to
those events. FactBank 1.0 was built on top of TimeBank 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>
and a fragment of the AQUAINT TimeML Corpus
<http://www.timeml.org/site/timebank/timebank.html>, both of which used
the TimeML specification language. This resulted in a double-layered
annotation of event factuality. TimeBank 1.2 and AQUAINT TimeML encode
most of the basic structural elements expressing factuality information
while FactBank 1.0 represents the resulting factuality interpretation.
The combination of the factuality values in FactBank with the structural
information in TimeML-annotated corpora facilitates the development of
tools aimed at automatically identifying the factuality values of
events, a component fundamental in tasks requiring some degree of text
understanding, such as Textual Entailment, Question Answering, or
Narrative Understanding.
FactBank annotations indicate whether the event mention describes actual
situations in the world, situations that have not happened, or
situations of uncertain interpretation. Event factuality is not an
inherent feature of events but a matter of perspective. Different
discourse participants may present divergent views about the factuality
of the very same event. Consequently, in FactBank, the factuality degree
of events is assigned relative to the relevant sources at play. In this
way, it can adequately reflect the divergence of opinions regarding the
factual status of events, as is common in news reports.
All FactBank markup is standoff and is represented through a set of 20
tables which can be easily loaded into a database. Each table resides in
an independent text file, where fields are separated by three
consecutive bars (i.e., |||). The data in fields of string type are
presented between simple quotations ('). Because FactBank 1.0 was built
on top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up
with inline XML-based annotation, this release contains the TimeBank 1.2
and AQUAINT TimeML annotation in standoff, table-based format as well.
Non-members may license this data by completing the LDC User Agreement
for Non-members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to
this address. The collection is being made available at no charge.
[ Return to top <#top>]
*LDC's Free Resources
*
LDC is pleased to distribute FactBank 1.0 which is available at no
cost. To license a copy of this data, non-members should complete the
LDC User Agreement for Non-members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>
and fax to +1 215 573 2175 or scan and email to this address. FactBank
joins a host of LDC resources which are available for free. These
resources include tools and corpora developed at LDC as well as corpora
made available through LDC's strong network of data providers.
Since LDC's founding, we have distributed over 1300 copies of corpora at
no cost including:
* over 700 non-member downloads of Buckwalter Arabic Morphological
Anaylzer 1.0
* 400 copies of Talkbank-sponsored data including popular releases
such as the American National Corpus and the Santa Barbara Corpora
of Spoken American English
* nearly 200 copies of Web 1T 5-gram Version 1, sponsored by Google Inc.
* over 30 copies of TimeBank 1.2
* over a dozen copies of the corpora developed for the Unified
Linguistic Annotation (ULA) project
For further information, visit our What's New! What's Free! Archive
<http://www.ldc.upenn.edu/About/whatsnew.shtml>.
[ Return to top <#top>]
*Release of XTrans
*
At InterSpeech 2009 <http://www.interspeech2009.org/>, LDC introduced
XTrans <http://www.ldc.upenn.edu/tools/XTrans/>, a new tool for manual
transcription and annotation of audio recordings. XTrans is a next
generation transcription tool that is designed to support transcription
tasks in multiple languages on multiple platforms. XTrans provides a
flexible and intuitive graphical user interface for a multitude of
speech annotation tasks including (virtual) segmentation of audio into
smaller units like turns and sentences; speaker identification;
orthographic transcription in any language; and labeling of structural
elements of the transcript like topics. Its versatile and powerful
waveform display/playback component can load multiple audio files of
different file formats and sampling rates at the same time. LDC and its
partners have used XTrans to generate over 3500 hours of time-aligned
verbatim transcripts in a variety of genres and languages.
With an intuitive interface, user configurability and embedded QC
functions, XTrans is optimized for high-quality, high-volume
transcription tasks involving real world data. XTrans successfully
addresses the challenges of real world data including transcribing
multiple speakers in a single channel through Virtual Speaker Channel,
which enables an unlimited number of distinct speakers to be associated
with the same audio channel. Furthermore, XTrans allows transcribers to
open an effectively unlimited number of audio files for simultaneous
transcription. Transcribers can switch focus between one, two or
multiple speakers as needed. XTrans also provides strong multilingual
support, with bidirectional text input for languages like Arabic, Farsi,
Urdu, and Hebrew.
Realtime transcription rates have improved dramatically in LDC projects
using XTrans, with rates for some tasks cut by as much as half. XTrans
also brings key quality control functions directly into the interface,
giving transcribers the power to improve the quality of their own work.
XTrans components are written in Python and C++, utilizing LDC's QWave
waveform display module. Even with very large files or multiple
recordings, XTrans provides users with fast display and playback
capabilities. A range of audio formats is supported, including .sph,
.wav, .aiff, .flac, and .ogg. Transcripts are output in a Tab Delimited
Format (TDF), which is easily converted to other common formats and is
readily usable by downstream manual and automatic annotation tasks.
Availability:
XTrans for Linux and Windows platforms is available from the LDC at no
cost under GPLv3 and can be downloaded here
<http://www.ldc.upenn.edu/tools/XTrans/downloads/>.
[ Return to top <#top>]
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090922/a673b084/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list