[Corpora-List] News from LDC

Tue Sep 22 15:18:50 UTC 2009

/In this newsletter:/

LDC2009T27
- *Chinese Gigaword Fourth Edition* <#ChGig4thEd> -

LDC2009S03
- *CSLU: S4X Release 1.2 <#CSLUS4X>* -

LDC2009T23
- *FactBank 1.0 <#Fact>* -

- *LDC's Free Resources <#Free>* - 
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E650572#Free>

- *Release of XTrans <#XTrans> * -

------------------------------------------------------------------------

*New Publications*

*
*(1) Chinese Gigaword Fourth Edition 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T27> 
is a comprehensive archive of newswire text data that has been acquired 
over several years by the LDC. This edition includes all of the contents 
in Chinese Gigaword Third Edition (LDC2007T38) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38> 
as well as newly collected data. In addition, four entirely new sources 
have been added in the fourth edition, Central News Service, Guangming 
Daily, People's Liberation Army Daily, and People's Daily.

The eight distinct international sources of Chinese newswire included in 
this edition are the following:

    * Agence France Presse 
    * Central News Agency, Taiwan
    * Central News Service
    * Guangming Daily
    * People's Daily
    * People's Liberation Army Daily
    * Xinhua News Agency
    * Zaobao Newspaper

The original data received by the LDC from AFP, People's Liberation Army 
Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were 
in Big-5, and those from GMW, CNS, and People's Daily were in a 
combination of GB-2312 and GB-18030. To avoid the problems and confusion 
that could result from differences in character-set specifications, all 
text files in this corpus have been converted to UTF-8 character encoding.

New in the Fourth Edition:

    * Two years worth of new articles (January 2007 through December
      2008) have been added to the Xinhua, Agence France Presse, and CNA
      data sets.
    * Four new data sources have been added - Guangming Daily, Central
      News Service , People's Daily, and People's Liberation Army daily,
      covering a timespan from November 2006 through December 2008.

[ Return to top <#top>]

***

(2)*  *CSLU: S4X Release 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S03> 
was created by the Center for Spoken Language Understanding, Oregon 
Health and Science University (CSLU). The corpus consists of 36 speakers 
(22 male, 14 female) uttering 11 specified words.  The speakers repeated 
the following words six times on each of four channels: startrek, 
supernova, tektronix, generation, nebula, processing, singularity, 
71523, abracadabra, sungeeta and computer. The four channels used were 
office phone, home phone, carbon microphone telephone and speaker phone. 
Each speech file has a corresponding time-aligned phoneme-level 
transcription (achieved using automatic forced alignment) and an 
automatically-generated world-level transcription.  Humans reviewed each 
utterance in two passes and classified it as good, bad, noisy or different. 

The data was recorded with the CSLU T1 digital data collection system. 
Each utterance is recorded as a separate file. These files were sampled 
at 8 khz 8-bit and stored as ulaw files. All of the data use the RIFF 
standard file format. This file format is 16-bit linearly encoded.

[ Return to top <#top>]

*

(3)  FactBank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T23> 
consists of 208 documents (over 77,000 tokens) from newswire and 
broadcast news reports in which event mentions are annotated with their 
degree of factuality, that is, the degree to which they correspond to 
those events. FactBank 1.0 was built on top of TimeBank 1.2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08> 
and a fragment of the AQUAINT TimeML Corpus 
<http://www.timeml.org/site/timebank/timebank.html>, both of which used 
the TimeML specification language. This resulted in a double-layered 
annotation of event factuality. TimeBank 1.2 and AQUAINT TimeML encode 
most of the basic structural elements expressing factuality information 
while FactBank 1.0 represents the resulting factuality interpretation. 
The combination of the factuality values in FactBank with the structural 
information in TimeML-annotated corpora facilitates the development of 
tools aimed at automatically identifying the factuality values of 
events, a component fundamental in tasks requiring some degree of text 
understanding, such as Textual Entailment, Question Answering, or 
Narrative Understanding.

FactBank annotations indicate whether the event mention describes actual 
situations in the world, situations that have not happened, or 
situations of uncertain interpretation. Event factuality is not an 
inherent feature of events but a matter of perspective. Different 
discourse participants may present divergent views about the factuality 
of the very same event. Consequently, in FactBank, the factuality degree 
of events is assigned relative to the relevant sources at play. In this 
way, it can adequately reflect the divergence of opinions regarding the 
factual status of events, as is common in news reports.

All FactBank markup is standoff and is represented through a set of 20 
tables which can be easily loaded into a database. Each table resides in 
an independent text file, where fields are separated by three 
consecutive bars (i.e., |||). The data in fields of string type are 
presented between simple quotations (').  Because FactBank 1.0 was built 
on top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up 
with inline XML-based annotation, this release contains the TimeBank 1.2 
and AQUAINT TimeML annotation in standoff, table-based format as well.

Non-members may license this data by completing the LDC User Agreement 
for Non-members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.  
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address.  The collection is being made available at no charge.

[ Return to top <#top>]

*LDC's Free Resources

*

LDC is pleased to distribute FactBank 1.0 which is available at no 
cost.  To license a copy of this data, non-members should complete the 
LDC User Agreement for Non-members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf> 
and fax to +1 215 573 2175 or scan and email to this address. FactBank 
joins a host of LDC resources which are available for free.  These 
resources include tools and corpora developed at LDC as well as corpora 
made available through LDC's strong network of data providers.   

Since LDC's founding, we have distributed over 1300 copies of corpora at 
no cost including:

    * over 700 non-member downloads of Buckwalter Arabic Morphological
      Anaylzer 1.0
    * 400 copies of Talkbank-sponsored data including popular releases
      such as the American National Corpus and the Santa Barbara Corpora
      of Spoken American English
    * nearly 200 copies of Web 1T 5-gram Version 1, sponsored by Google Inc.
    * over 30 copies of TimeBank 1.2
    * over a dozen copies of the corpora developed for the Unified
      Linguistic Annotation (ULA) project

For further information, visit our What's New! What's Free! Archive 
<http://www.ldc.upenn.edu/About/whatsnew.shtml>.

[ Return to top <#top>]

*Release of XTrans
* 

At InterSpeech 2009 <http://www.interspeech2009.org/>, LDC introduced 
XTrans <http://www.ldc.upenn.edu/tools/XTrans/>, a new tool for manual 
transcription and annotation of audio recordings.  XTrans is a next 
generation transcription tool that is designed to support transcription 
tasks in multiple languages on multiple platforms.   XTrans provides a 
flexible and intuitive graphical user interface for a multitude of 
speech annotation tasks including (virtual) segmentation of audio into 
smaller units like turns and sentences; speaker identification; 
orthographic transcription in any language; and labeling of structural 
elements of the transcript like topics.  Its versatile and powerful 
waveform display/playback component can load multiple audio files of 
different file formats and sampling rates at the same time. LDC and its 
partners have used XTrans to generate over 3500 hours of time-aligned 
verbatim transcripts in a variety of genres and languages. 

With an intuitive interface, user configurability and embedded QC 
functions, XTrans is optimized for high-quality, high-volume 
transcription tasks involving real world data. XTrans successfully 
addresses the challenges of real world data including transcribing 
multiple speakers in a single channel through Virtual Speaker Channel, 
which enables an unlimited number of distinct speakers to be associated 
with the same audio channel.  Furthermore, XTrans allows transcribers to 
open an effectively unlimited number of audio files for simultaneous 
transcription. Transcribers can switch focus between one, two or 
multiple speakers as needed.  XTrans also provides strong multilingual 
support, with bidirectional text input for languages like Arabic, Farsi, 
Urdu, and Hebrew.

Realtime transcription rates have improved dramatically in LDC projects 
using XTrans, with rates for some tasks cut by as much as half.   XTrans 
also brings key quality control functions directly into the interface, 
giving transcribers the power to improve the quality of their own work.  
XTrans components are written in Python and C++, utilizing LDC's QWave 
waveform display module. Even with very large files or multiple 
recordings, XTrans provides users with fast display and playback 
capabilities.  A range of audio formats is supported, including .sph, 
.wav, .aiff, .flac, and .ogg. Transcripts are output in a Tab Delimited 
Format (TDF), which is easily converted to other common formats and is 
readily usable by downstream manual and automatic annotation tasks.

Availability:

XTrans for Linux and Windows platforms is available from the LDC at no 
cost under GPLv3 and can be downloaded here 
<http://www.ldc.upenn.edu/tools/XTrans/downloads/>.

[ Return to top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090922/a673b084/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora