<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#ffffff">
<div class="moz-text-html" lang="x-western">
<div class="moz-text-html" lang="x-western">
<p style="text-align: center;" align="center"><font face="Times
New Roman, Times, serif"><i>In this newsletter:</i><b><br>
</b></font></p>
<p style="text-align: center;" align="center"><font face="Times
New Roman, Times, serif"><b>- </b></font><b><font
face="Times New Roman, Times, serif"><a href="#pipeline">Publications
Pipeline for 2011</a></font></b><font face="Times New
Roman, Times, serif"><b> -</b></font><font face="Times New
Roman, Times, serif"><b> </b></font></p>
<p style="text-align: center;" align="center"><font face="Times
New Roman, Times, serif"><i>New free publications:</i><b><br>
</b></font></p>
<p style="text-align: center;" align="center"><font face="Times
New Roman, Times, serif"><b>- </b></font><a><font
face="Times New Roman, Times, serif"><b><a href="#pos">Indian
Language Part-of-Speech Tagset: Sanskrit </a></b></font></a><font
face="Times New Roman, Times, serif"><b> -</b></font><font
face="Times New Roman, Times, serif"><b> </b></font></p>
<p style="text-align: center;" align="center"><font face="Times
New Roman, Times, serif"><b>- </b><b><a href="#onto">OntoNotes
4.0</a></b></font><font face="Times New Roman, Times,
serif"><b> </b></font><font face="Times New Roman, Times,
serif"><b>-</b></font></p>
<hr width="100%" size="2">
<p style="text-align: center;" align="center"><a name="pipeline"></a><font
face="Times New Roman"><b>Publications Pipeline for 2011</b></font></p>
<p><font face="Times New Roman">LDC is pleased to provide the
following information on our planned releases for Membership
Year 2011 (MY2011) and would like to remind our data users
that there is still time to save on membership fees for
MY2011, but time is quickly running out! Any organization
which joins or renews membership for 2011 through Tuesday,
March 1, 2011, is entitled to a 5% discount on membership
fees. Organizations which held membership for MY2010 can
receive a 10% discount on fees provided they renew prior to
March 1, 2011.</font></p>
<p style="margin-bottom: 12pt;"><font face="Times New Roman">Many
publications for MY2011 are still in development, but we
plan to release updates to some of our popular Gigaword
corpora as well as new speech corpora. Please note that the
list is tentative and subject to modifications. Our planned
publications for this year include:</font></p>
<blockquote>
<p class="MsoNormal"><font face="Times New Roman"><i>2005 NIST
Speaker Recognition Evaluation</i> - the 2005 data from
the ongoing series of yearly evaluations conducted by NIST
(National Institute of Standards and Technology). These
evaluations provide an important contribution to the
direction of research efforts and the calibration of
technical capabilities. They are intended to be of
interest to all researchers working on the general problem
of text-independent speaker recognition. </font></p>
<p><font face="Times New Roman"><i>Arabic Gigaword Fifth
Edition</i> ~ LDC’s Arabic newswire collection from 2009
and 2010 as well as the contents of Arabic Gigaword Fourth
Edition (LDC2009T30). The news sources represented
include Agence France Presse, An Nahar, Al Hayat, Al-Quds
Al-Arabi, Asharq Al-Awsat, Assabah <span style=""> </span>Al-
Ahram, Ummah Press and <span style=""> </span>Xinhua News
Agency.</font></p>
<p class="MsoNormal"><font face="Times New Roman"><i>Chinese
Gigaword Fifth Edition </i>~ LDC’s Chinese newswire
collection from 2009 and 2010 as well as the contents of <span
style=""> </span>Chinese Gigaword Fourth Edition
(LDC2009T27). The news sources represented include Agence
France Presse, Central News Agency (Taiwan), Xinhua News
Agency, Zaobao, People's Liberation Army Daily, People’s
Daily, Guangming Daily and China News Service.<br>
<br>
<i>Digital Archive of Southern Speech</i></font> <font
face="Times New Roman"> ~ a geographical sampling of
colloquial speech in the Southern United States. Samples
of speech were collected through interviews of single
subjects speaking on a variety of common topics like
family, the weather, household articles and activities,
agriculture, and social connections. Speakers range in age
from 15 to 90, with an average age of 61.<br>
<br>
<i>English Gigaword Fifth Edition</i></font> <font
face="Times New Roman"> ~ LDC’s English newswire <span
style=""> </span>collection from 2009 and 2010 as well
as the contents of <span style=""> </span>English
Gigaword Fourth Edition (LDC2009T13). The news sources
represented include Agence France Presse, Associated
Press, Central News Agency (Taiwan), NY Times, Washington
Post, Los Angeles Times and Xinhua News Agency.<br>
<br>
<i>MALACH English</i></font> <font face="Times New Roman">
~ over 300 hours of English audio recordings of
interviews conducted under the auspices of <span style=""> </span>the
USC Shoah Foundation Institute for Visual History and
Education and associated transcripts produced as part of
the Multilingual Access to Large Spoken ArCHives (MALACH)
project. The data was collected using table microphones.
Recordings are 2-channel, 128 kBps, 44.1 kHz mp2 files,
with a different speaker generally predominant in each
channel. </font></p>
</blockquote>
<p class="MsoNormal" style="margin-bottom: 12pt;"><font
face="Times New Roman"><br>
2011 Subscription Members are automatically sent all MY2011
data as it is released. 2011 Standard Members are entitled
to request 16 corpora for free from MY2011. Non-members
may license most data for research use.<br>
<br>
</font></p>
<div align="center"><font face="Times New Roman"><b>New Free
Publications</b></font></div>
<p class="MsoNormal"><font face="Times New Roman"> </font></p>
<p><a name="pos"> </a><font face="Times New Roman">(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T04">Indian
Language
Part-of-Speech Tagset: Sanskrit</a> was developed by
Microsoft Research (MSR) India to support the task of
Part-of-Speech Tagging (POS) and other data-driven
linguistic research on Indian Languages in general. It is
created as a part of the <a
href="http://research.microsoft.com/en-us/groups/mls/default.aspx">Indian
Language
Part-of-Speech Tagset (IL-POST)</a> project, a
collaborative effort among linguists and computer scientists
from MSR India, AU-KBC (Anna University, Chennai), Delhi
University, IIT Bombay, Jawaharlal Nehru University (Delhi)
and Tamil University (Tamilnadu). </font></p>
<p><font face="Times New Roman">The goal of the IL-POST project
is to provide a common tagset framework for Indian Languages
that offers flexibility, cross-linguistic compatibility and
resuability across those languages. It supports a
three-level hierarchy of Categories, Types and Attributes.
The corpus mainly consists therefore of two different levels
of information for each lexical token: (a) lexical Category
and Types, and (b) set morphological attributes and their
associated values in the context. </font></p>
<p class="MsoNormal" style="margin-bottom: 12pt;"><font
face="Times New Roman">This corpus contains 3,703 sentences
(57,218 words) of manually annotated Sanskrit text selected
from the <a
href="http://en.wikipedia.org/wiki/Panchatantra">Panchatrantra</a>
stories, a collection of animal fables in verse and prose
dating from the third century BCE. All annotated data is
provided in both xml and text files. The xml files are
contained in the "XML_files" folder and the text files in
the "text_files" folder. Each data file contains between
12,000-45,000 words. The XML file contains metadata about
the material, such as language, encoding and data size.<br>
<br>
<font face="Times New Roman">N</font></font><font
face="Times New Roman">on-members may license this data by
submitting a completed copy of the <a
href="http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Sanskrit_License_Agreement.htm">Microsoft
Research
India License Agreement</a>. The agreement can be faxed to
+1 215 573 2175 or scanned and emailed to this address.
This data is available at no charge.</font></p>
<p class="MsoNormal" style="text-align: center;" align="center"><font
face="Times New Roman">*</font></p>
<p><a name="onto"></a><font face="Times New Roman">(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T03">OntoNotes
Release
4.0</a> was developed as part of the OntoNotes project, a
collaborative effort between BBN Technologies, the
University of Colorado, the University of Pennsylvania and
the University of California's Information Sciences
Institute. The goal of the project is to annotate a large
corpus comprising various genres of text (news,
conversational telephone speech, weblogs, usenet newsgroups,
broadcast, talk shows) in three languages (English, Chinese,
and Arabic) with structural information (syntax and
predicate argument structure) and shallow semantics (word
sense linked to an ontology and coreference).</font></p>
<p><font face="Times New Roman">OntoNotes Release 4.0 contains
the content of earlier releases -- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21">OntoNotes
Release
1.0 LDC2007T21</a>,<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04">OntoNotes
Release 2.0 LDC2008T04</a> and <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T24">OntoNotes
Release
3.0 LDC2009T24</a> -- and adds newswire, broadcast news,
broadcast conversation and web data in English and Chinese
and newswire data in Arabic. This cumulative publication
consists of 2.4 million words as follows: 300k words of
Arabic newswire; 250k words of Chinese newswire, 250k words
of Chinese broadcast news, 150k words of Chinese broadcast
conversation and 150k words of Chinese web text; and 600k
words of English newswire, 200k word of English broadcast
news, 200k words of English broadcast conversation and 300k
words of English web text. </font></p>
<p><font face="Times New Roman">The OntoNotes project builds on
two time-tested resources, following the <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">Penn
Treebank</a> for syntax and the <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14">Penn
PropBank</a> for predicate-argument structure. Its
semantic representation will include word sense
disambiguation for nouns and verbs, with each word sense
connected to an ontology, and coreference. </font></p>
<p class="MsoNormal"><font face="Times New Roman">Documents
describing the annotation guidelines and the routines for
deriving various views of the data from the database are
included in the documentation directory of this release. The
annotation is provided both in separate text files for each
annotation layer (Treebank, PropBank, word sense, etc.) and
in the form of an integrated relational database with a
Python API to provide convenient cross-layer access.<br>
<br>
</font><font face="Times New Roman">Non-members may request
this data by completing a copy of the </font> <font
face="Times New Roman"><a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User
Agreement for Non-Members</a>. The agreement can be faxed
+1 215 573 2175 or scanned and emailed to this address.
This data is available at no charge, but is subject to
non-member shipping and handling fees.<br>
</font></p>
<hr width="100%" size="2"><br>
<div align="center">
<pre class="moz-signature" cols="72">Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
</div>
<p class="MsoNormal"><font face="Times New Roman"><br>
</font></p>
</div>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>