<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p style="text-align: center;" align="center"><i><a name="top">In this

newsletter:</a></i><br>

<br>

LDC2009T27<br>

- <a href="#ChGig4thEd"><b>Chinese

Gigaword Fourth Edition</b></a> -<br>

<br>

LDC2009S03<br>

- <b><a href="#CSLUS4X">CSLU:

S4X Release 1.2</a></b> -<br>

<br>

LDC2009T23<br>

- <b><a href="#Fact">FactBank

1.0</a></b> -<br>

<br>

- <b><a href="#Free">LDC's

Free Resources</a></b>

-<a

 href="imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E650572#Free"></a><br>

<br>

- <b><a href="#XTrans">Release

of XTrans</a> </b> -<o:p></o:p><br>

</p>

<div class="MsoNormal" style="text-align: center;" align="center">

<hr size="2" width="100%"><o:p> </o:p></div>

<p class="MsoNormal" style="text-align: center;" align="center"><b>New

Publications</b><o:p></o:p></p>

<p class="MsoNormal"><b><br>

</b><a name="ChGig4thEd">(1)</a> <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T27">Chinese

Gigaword Fourth Edition</a> is a comprehensive archive of newswire text

data

that has been acquired over several years by the LDC. This edition

includes all

of the contents in <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38">Chinese

Gigaword Third Edition (LDC2007T38)</a> as well as newly collected

data. In

addition, four entirely new sources have been added in the fourth

edition,

Central News Service, Guangming Daily, People's Liberation Army Daily,

and

People's Daily. <o:p></o:p></p>

<p>The eight distinct international sources of Chinese newswire

included in

this edition are the following: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Agence <st1:country-region><st1:place>France</st1:place></st1:country-region>

Presse  <o:p></o:p></li>

  <li class="MsoNormal" style="">Central <st1:place><st1:city>News

Agency</st1:city>, <st1:country-region>Taiwan</st1:country-region></st1:place>

    <o:p></o:p></li>

  <li class="MsoNormal" style="">Central News Service <o:p></o:p></li>

  <li class="MsoNormal" style="">Guangming Daily <o:p></o:p></li>

  <li class="MsoNormal" style="">People's Daily <o:p></o:p></li>

  <li class="MsoNormal" style="">People's Liberation Army Daily <o:p></o:p></li>

  <li class="MsoNormal" style="">Xinhua News Agency <o:p></o:p></li>

  <li class="MsoNormal" style="">Zaobao Newspaper <o:p></o:p></li>

</ul>

<p class="MsoNormal">The original data received by the LDC from AFP,

People's

Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312,

those from

CNA were in Big-5, and those from GMW, CNS, and People's Daily were in

a

combination of GB-2312 and GB-18030. To avoid the problems and

confusion that

could result from differences in character-set specifications, all text

files

in this corpus have been converted to UTF-8 character encoding.<br>

<br>

New in the Fourth Edition:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Two years worth of new articles

(January 2007 through December 2008) have been added to the Xinhua,

Agence France Presse, and CNA data sets.<o:p></o:p></li>

  <li class="MsoNormal" style="">Four new data sources have been added

- Guangming Daily, Central News Service , People's Daily, and People's

Liberation Army daily, covering a timespan from November 2006 through

December 2008.<o:p></o:p></li>

</ul>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

</p>

<center>[<a href="#top">

Return to top </a>]</center>

<p><br>

</p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>

<p class="MsoNormal"><br>

<a name="CSLUS4X">(2)</a><b>  </b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S03">CSLU:

S4X Release 1.2</a> was created by the Center for Spoken Language

Understanding, Oregon Health and Science University (CSLU). The corpus

consists

of 36 speakers (22 male, 14 female) uttering 11 specified words.  The

speakers repeated the following words six times on each of four

channels:

startrek, supernova, tektronix, generation, nebula, processing,

singularity,

71523, abracadabra, sungeeta and computer. The four channels used were

office

phone, home phone, carbon microphone telephone and speaker phone. Each

speech

file has a corresponding time-aligned phoneme-level transcription

(achieved

using automatic forced alignment) and an automatically-generated

world-level

transcription.  Humans reviewed each utterance in two passes and

classified it as good, bad, noisy or different. <o:p></o:p></p>

<p>The data was recorded with the CSLU T1 digital data collection

system. Each

utterance is recorded as a separate file. These files were sampled at 8

khz

8-bit and stored as ulaw files. All of the data use the RIFF standard

file

format. This file format is 16-bit linearly encoded.<o:p></o:p></p>

<br>

<center>[<a href="#top">

Return to top </a>]<br>

<br>

</center>

<div align="center">*<br>

</div>

<p><o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p><a name="Fact">(3)</a>  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T23">FactBank

1.0</a> consists of 208 documents (over 77,000 tokens) from newswire

and

broadcast news reports in which event mentions are annotated with their

degree

of factuality, that is, the degree to which they correspond to those

events.

FactBank 1.0 was built on top of <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08">TimeBank

1.2</a> and a fragment of the <a

 href="http://www.timeml.org/site/timebank/timebank.html">AQUAINT

TimeML Corpus</a>,

both of which used the TimeML specification language. This resulted in

a

double-layered annotation of event factuality. TimeBank 1.2 and AQUAINT

TimeML

encode most of the basic structural elements expressing factuality

information

while FactBank 1.0 represents the resulting factuality interpretation.

The

combination of the factuality values in FactBank with the structural

information in TimeML-annotated corpora facilitates the development of

tools

aimed at automatically identifying the factuality values of events, a

component

fundamental in tasks requiring some degree of text understanding, such

as

Textual Entailment, Question Answering, or Narrative Understanding. <o:p></o:p></p>

<p>FactBank annotations indicate whether the event mention describes

actual

situations in the world, situations that have not happened, or

situations of

uncertain interpretation. Event factuality is not an inherent feature

of events

but a matter of perspective. Different discourse participants may

present

divergent views about the factuality of the very same event.

Consequently, in

FactBank, the factuality degree of events is assigned relative to the

relevant

sources at play. In this way, it can adequately reflect the divergence

of

opinions regarding the factual status of events, as is common in news

reports. <o:p></o:p></p>

<p>All FactBank markup is standoff and is represented through a set of

20

tables which can be easily loaded into a database. Each table resides

in an

independent text file, where fields are separated by three consecutive

bars

(i.e., |||). The data in fields of string type are presented between

simple

quotations (').  Because FactBank 1.0 was built on top of TimeBank 1.2

and

AQUAINT TimeML, both of which are marked up with inline XML-based

annotation,

this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in

standoff, table-based format as well.<o:p></o:p></p>

<p style="margin-bottom: 12pt;">Non-members

may license this data by completing the <a

 href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User Agreement for Non-members</a>.  The agreement can be faxed to +1

215

573 2175 or scanned and emailed to this address.  The collection is

being

made available at no charge.<br>

</p>

<center>[<a href="#top">

Return to top </a>]<br>

<br>

</center>

<p style="text-align: center;" align="center"> <b><a name="Free">LDC's

Free Resources</a> <br>

<br>

</b><o:p></o:p></p>

<p>LDC is pleased to distribute FactBank 1.0 which is available at no

cost.  To license a copy of this data, non-members should complete the <a

 href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User Agreement for Non-members</a> and fax to +1 215 573 2175 or scan

and email

to this address. <span class="moz-txt-star">FactBank joins a host of

LDC

resources which are available for free.  These resources include tools

and

corpora developed at LDC as well as corpora made available through

LDC's strong

network of data providers.  <span style=""> </span></span><o:p></o:p></p>

<p><span class="moz-txt-star">Since LDC's founding, we have distributed

over 1300

copies of corpora at no cost including:</span><o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style=""><span class="moz-txt-star">over 700

non-member downloads of Buckwalter Arabic Morphological Anaylzer 1.0</span><o:p></o:p></li>

  <li class="MsoNormal" style=""><span class="moz-txt-star">400 copies

of Talkbank-sponsored data including popular releases such as the

American National Corpus and the Santa Barbara Corpora of Spoken

American English</span><o:p></o:p></li>

  <li class="MsoNormal" style=""><span class="moz-txt-star">nearly 200

copies of Web 1T 5-gram Version 1, sponsored by Google Inc.</span><o:p></o:p></li>

  <li class="MsoNormal" style=""><span class="moz-txt-star">over 30

copies of TimeBank 1.2</span><o:p></o:p></li>

  <li class="MsoNormal" style=""><span class="moz-txt-star">over a

dozen copies of the corpora developed for the Unified Linguistic

Annotation </span>(ULA) project <o:p></o:p></li>

</ul>

<p><span class="moz-txt-star">For further information, visit our <a

 href="http://www.ldc.upenn.edu/About/whatsnew.shtml">What's New!

What's Free!

Archive</a>.<br>

</span></p>

<p><br>

</p>

<center>[<a href="#top">

Return to top </a>]</center>

<p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p style="text-align: center;" align="center"><b><a name="XTrans">Release

of XTrans<br>

</a></b> <o:p></o:p></p>

<p>At <a href="http://www.interspeech2009.org/">InterSpeech 2009</a>,

LDC

introduced <a href="http://www.ldc.upenn.edu/tools/XTrans/">XTrans</a>,

a new

tool for manual transcription and annotation of audio recordings. 

XTrans

is a next generation transcription tool that is designed to support

transcription tasks in multiple languages on multiple platforms.  

XTrans provides a flexible and intuitive graphical user interface for a

multitude of speech annotation tasks including (virtual) segmentation

of audio

into smaller units like turns and sentences; speaker identification;

orthographic transcription in any language; and labeling of structural

elements

of the transcript like topics.  Its versatile and powerful waveform

display/playback component can load multiple audio files of different

file

formats and sampling rates at the same time. LDC and its partners have

used

XTrans to generate over 3500 hours of time-aligned verbatim transcripts

in a

variety of genres and languages.  <o:p></o:p></p>

<p>With an intuitive interface, user configurability and embedded QC

functions,

XTrans is optimized for high-quality, high-volume transcription tasks

involving

real world data. XTrans successfully addresses the challenges of real

world

data including transcribing multiple speakers in a single channel

through

Virtual Speaker Channel, which enables an unlimited number of distinct

speakers

to be associated with the same audio channel.  Furthermore, XTrans

allows

transcribers to open an effectively unlimited number of audio files for

simultaneous transcription. Transcribers can switch focus between one,

two or

multiple speakers as needed.  XTrans also provides strong multilingual

support, with bidirectional text input for languages like Arabic,

Farsi, Urdu,

and Hebrew.<o:p></o:p></p>

<p>Realtime transcription rates have improved dramatically in LDC

projects

using XTrans, with rates for some tasks cut by as much as half.  

XTrans also brings key quality control functions directly into the

interface,

giving transcribers the power to improve the quality of their own

work. 

XTrans components are written in Python and C++, utilizing LDC's QWave

waveform

display module. Even with very large files or multiple recordings,

XTrans

provides users with fast display and playback capabilities.  A range of

audio formats is supported, including .sph, .wav, .aiff, .flac, and

.ogg.

Transcripts are output in a Tab Delimited Format (TDF), which is easily

converted to other common formats and is readily usable by downstream

manual

and automatic annotation tasks.<o:p></o:p></p>

<p>Availability:<o:p></o:p></p>

<p>XTrans for Linux and Windows platforms is available from the LDC at

no cost under

GPLv3 and can be downloaded <a

 href="http://www.ldc.upenn.edu/tools/XTrans/downloads/">here</a>.</p>

<center>[<a href="#top">

Return to top </a>]</center>

<br>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>

Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

</body>

</html>