<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p style="text-align: center;" align="center"><i><a name="top">In this
newsletter:</a></i><br>
<br>
LDC2009T27<br>
- <a href="#ChGig4thEd"><b>Chinese
Gigaword Fourth Edition</b></a> -<br>
<br>
LDC2009S03<br>
- <b><a href="#CSLUS4X">CSLU:
S4X Release 1.2</a></b> -<br>
<br>
LDC2009T23<br>
- <b><a href="#Fact">FactBank
1.0</a></b> -<br>
<br>
- <b><a href="#Free">LDC's
Free Resources</a></b>
-<a
href="imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E650572#Free"></a><br>
<br>
- <b><a href="#XTrans">Release
of XTrans</a> </b> -<o:p></o:p><br>
</p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr size="2" width="100%"><o:p> </o:p></div>
<p class="MsoNormal" style="text-align: center;" align="center"><b>New
Publications</b><o:p></o:p></p>
<p class="MsoNormal"><b><br>
</b><a name="ChGig4thEd">(1)</a> <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T27">Chinese
Gigaword Fourth Edition</a> is a comprehensive archive of newswire text
data
that has been acquired over several years by the LDC. This edition
includes all
of the contents in <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38">Chinese
Gigaword Third Edition (LDC2007T38)</a> as well as newly collected
data. In
addition, four entirely new sources have been added in the fourth
edition,
Central News Service, Guangming Daily, People's Liberation Army Daily,
and
People's Daily. <o:p></o:p></p>
<p>The eight distinct international sources of Chinese newswire
included in
this edition are the following: <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">Agence <st1:country-region><st1:place>France</st1:place></st1:country-region>
Presse <o:p></o:p></li>
<li class="MsoNormal" style="">Central <st1:place><st1:city>News
Agency</st1:city>, <st1:country-region>Taiwan</st1:country-region></st1:place>
<o:p></o:p></li>
<li class="MsoNormal" style="">Central News Service <o:p></o:p></li>
<li class="MsoNormal" style="">Guangming Daily <o:p></o:p></li>
<li class="MsoNormal" style="">People's Daily <o:p></o:p></li>
<li class="MsoNormal" style="">People's Liberation Army Daily <o:p></o:p></li>
<li class="MsoNormal" style="">Xinhua News Agency <o:p></o:p></li>
<li class="MsoNormal" style="">Zaobao Newspaper <o:p></o:p></li>
</ul>
<p class="MsoNormal">The original data received by the LDC from AFP,
People's
Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312,
those from
CNA were in Big-5, and those from GMW, CNS, and People's Daily were in
a
combination of GB-2312 and GB-18030. To avoid the problems and
confusion that
could result from differences in character-set specifications, all text
files
in this corpus have been converted to UTF-8 character encoding.<br>
<br>
New in the Fourth Edition:<o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">Two years worth of new articles
(January 2007 through December 2008) have been added to the Xinhua,
Agence France Presse, and CNA data sets.<o:p></o:p></li>
<li class="MsoNormal" style="">Four new data sources have been added
- Guangming Daily, Central News Service , People's Daily, and People's
Liberation Army daily, covering a timespan from November 2006 through
December 2008.<o:p></o:p></li>
</ul>
<p class="MsoNormal" style="margin-bottom: 12pt;"><br>
</p>
<center>[<a href="#top">
Return to top </a>]</center>
<p><br>
</p>
<p class="MsoNormal" style="text-align: center;" align="center"><b>*</b><o:p></o:p></p>
<p class="MsoNormal"><br>
<a name="CSLUS4X">(2)</a><b> </b><a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009S03">CSLU:
S4X Release 1.2</a> was created by the Center for Spoken Language
Understanding, Oregon Health and Science University (CSLU). The corpus
consists
of 36 speakers (22 male, 14 female) uttering 11 specified words. The
speakers repeated the following words six times on each of four
channels:
startrek, supernova, tektronix, generation, nebula, processing,
singularity,
71523, abracadabra, sungeeta and computer. The four channels used were
office
phone, home phone, carbon microphone telephone and speaker phone. Each
speech
file has a corresponding time-aligned phoneme-level transcription
(achieved
using automatic forced alignment) and an automatically-generated
world-level
transcription. Humans reviewed each utterance in two passes and
classified it as good, bad, noisy or different. <o:p></o:p></p>
<p>The data was recorded with the CSLU T1 digital data collection
system. Each
utterance is recorded as a separate file. These files were sampled at 8
khz
8-bit and stored as ulaw files. All of the data use the RIFF standard
file
format. This file format is 16-bit linearly encoded.<o:p></o:p></p>
<br>
<center>[<a href="#top">
Return to top </a>]<br>
<br>
</center>
<div align="center">*<br>
</div>
<p><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p><a name="Fact">(3)</a> <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T23">FactBank
1.0</a> consists of 208 documents (over 77,000 tokens) from newswire
and
broadcast news reports in which event mentions are annotated with their
degree
of factuality, that is, the degree to which they correspond to those
events.
FactBank 1.0 was built on top of <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08">TimeBank
1.2</a> and a fragment of the <a
href="http://www.timeml.org/site/timebank/timebank.html">AQUAINT
TimeML Corpus</a>,
both of which used the TimeML specification language. This resulted in
a
double-layered annotation of event factuality. TimeBank 1.2 and AQUAINT
TimeML
encode most of the basic structural elements expressing factuality
information
while FactBank 1.0 represents the resulting factuality interpretation.
The
combination of the factuality values in FactBank with the structural
information in TimeML-annotated corpora facilitates the development of
tools
aimed at automatically identifying the factuality values of events, a
component
fundamental in tasks requiring some degree of text understanding, such
as
Textual Entailment, Question Answering, or Narrative Understanding. <o:p></o:p></p>
<p>FactBank annotations indicate whether the event mention describes
actual
situations in the world, situations that have not happened, or
situations of
uncertain interpretation. Event factuality is not an inherent feature
of events
but a matter of perspective. Different discourse participants may
present
divergent views about the factuality of the very same event.
Consequently, in
FactBank, the factuality degree of events is assigned relative to the
relevant
sources at play. In this way, it can adequately reflect the divergence
of
opinions regarding the factual status of events, as is common in news
reports. <o:p></o:p></p>
<p>All FactBank markup is standoff and is represented through a set of
20
tables which can be easily loaded into a database. Each table resides
in an
independent text file, where fields are separated by three consecutive
bars
(i.e., |||). The data in fields of string type are presented between
simple
quotations ('). Because FactBank 1.0 was built on top of TimeBank 1.2
and
AQUAINT TimeML, both of which are marked up with inline XML-based
annotation,
this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in
standoff, table-based format as well.<o:p></o:p></p>
<p style="margin-bottom: 12pt;">Non-members
may license this data by completing the <a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User Agreement for Non-members</a>. The agreement can be faxed to +1
215
573 2175 or scanned and emailed to this address. The collection is
being
made available at no charge.<br>
</p>
<center>[<a href="#top">
Return to top </a>]<br>
<br>
</center>
<p style="text-align: center;" align="center"> <b><a name="Free">LDC's
Free Resources</a> <br>
<br>
</b><o:p></o:p></p>
<p>LDC is pleased to distribute FactBank 1.0 which is available at no
cost. To license a copy of this data, non-members should complete the <a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User Agreement for Non-members</a> and fax to +1 215 573 2175 or scan
and email
to this address. <span class="moz-txt-star">FactBank joins a host of
LDC
resources which are available for free. These resources include tools
and
corpora developed at LDC as well as corpora made available through
LDC's strong
network of data providers. <span style=""> </span></span><o:p></o:p></p>
<p><span class="moz-txt-star">Since LDC's founding, we have distributed
over 1300
copies of corpora at no cost including:</span><o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style=""><span class="moz-txt-star">over 700
non-member downloads of Buckwalter Arabic Morphological Anaylzer 1.0</span><o:p></o:p></li>
<li class="MsoNormal" style=""><span class="moz-txt-star">400 copies
of Talkbank-sponsored data including popular releases such as the
American National Corpus and the Santa Barbara Corpora of Spoken
American English</span><o:p></o:p></li>
<li class="MsoNormal" style=""><span class="moz-txt-star">nearly 200
copies of Web 1T 5-gram Version 1, sponsored by Google Inc.</span><o:p></o:p></li>
<li class="MsoNormal" style=""><span class="moz-txt-star">over 30
copies of TimeBank 1.2</span><o:p></o:p></li>
<li class="MsoNormal" style=""><span class="moz-txt-star">over a
dozen copies of the corpora developed for the Unified Linguistic
Annotation </span>(ULA) project <o:p></o:p></li>
</ul>
<p><span class="moz-txt-star">For further information, visit our <a
href="http://www.ldc.upenn.edu/About/whatsnew.shtml">What's New!
What's Free!
Archive</a>.<br>
</span></p>
<p><br>
</p>
<center>[<a href="#top">
Return to top </a>]</center>
<p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p style="text-align: center;" align="center"><b><a name="XTrans">Release
of XTrans<br>
</a></b> <o:p></o:p></p>
<p>At <a href="http://www.interspeech2009.org/">InterSpeech 2009</a>,
LDC
introduced <a href="http://www.ldc.upenn.edu/tools/XTrans/">XTrans</a>,
a new
tool for manual transcription and annotation of audio recordings.
XTrans
is a next generation transcription tool that is designed to support
transcription tasks in multiple languages on multiple platforms.
XTrans provides a flexible and intuitive graphical user interface for a
multitude of speech annotation tasks including (virtual) segmentation
of audio
into smaller units like turns and sentences; speaker identification;
orthographic transcription in any language; and labeling of structural
elements
of the transcript like topics. Its versatile and powerful waveform
display/playback component can load multiple audio files of different
file
formats and sampling rates at the same time. LDC and its partners have
used
XTrans to generate over 3500 hours of time-aligned verbatim transcripts
in a
variety of genres and languages. <o:p></o:p></p>
<p>With an intuitive interface, user configurability and embedded QC
functions,
XTrans is optimized for high-quality, high-volume transcription tasks
involving
real world data. XTrans successfully addresses the challenges of real
world
data including transcribing multiple speakers in a single channel
through
Virtual Speaker Channel, which enables an unlimited number of distinct
speakers
to be associated with the same audio channel. Furthermore, XTrans
allows
transcribers to open an effectively unlimited number of audio files for
simultaneous transcription. Transcribers can switch focus between one,
two or
multiple speakers as needed. XTrans also provides strong multilingual
support, with bidirectional text input for languages like Arabic,
Farsi, Urdu,
and Hebrew.<o:p></o:p></p>
<p>Realtime transcription rates have improved dramatically in LDC
projects
using XTrans, with rates for some tasks cut by as much as half.
XTrans also brings key quality control functions directly into the
interface,
giving transcribers the power to improve the quality of their own
work.
XTrans components are written in Python and C++, utilizing LDC's QWave
waveform
display module. Even with very large files or multiple recordings,
XTrans
provides users with fast display and playback capabilities. A range of
audio formats is supported, including .sph, .wav, .aiff, .flac, and
.ogg.
Transcripts are output in a Tab Delimited Format (TDF), which is easily
converted to other common formats and is readily usable by downstream
manual
and automatic annotation tasks.<o:p></o:p></p>
<p>Availability:<o:p></o:p></p>
<p>XTrans for Linux and Windows platforms is available from the LDC at
no cost under
GPLv3 and can be downloaded <a
href="http://www.ldc.upenn.edu/tools/XTrans/downloads/">here</a>.</p>
<center>[<a href="#top">
Return to top </a>]</center>
<br>
<hr size="2" width="100%">
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big><br>
Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
</body>
</html>