<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    <div class="moz-text-html" lang="x-western">

      <div class="moz-text-html" lang="x-western">

        <p style="text-align: center;" align="center"><font face="Times

            New Roman, Times, serif"><i>In this newsletter:</i><b><br>

            </b></font></p>

        <p style="text-align: center;" align="center"><font face="Times

            New Roman, Times, serif"><b>-   </b></font><b><font

              face="Times New Roman, Times, serif"><a href="#pipeline">Publications

                Pipeline for 2011</a></font></b><font face="Times New

            Roman, Times, serif"><b>   -</b></font><font face="Times New

            Roman, Times, serif"><b> </b></font></p>

        <p style="text-align: center;" align="center"><font face="Times

            New Roman, Times, serif"><i>New free publications:</i><b><br>

            </b></font></p>

        <p style="text-align: center;" align="center"><font face="Times

            New Roman, Times, serif"><b>-   </b></font><a><font

              face="Times New Roman, Times, serif"><b><a href="#pos">Indian

                  Language Part-of-Speech Tagset: Sanskrit </a></b></font></a><font

            face="Times New Roman, Times, serif"><b>   -</b></font><font

            face="Times New Roman, Times, serif"><b> </b></font></p>

        <p style="text-align: center;" align="center"><font face="Times

            New Roman, Times, serif"><b>-   </b><b><a href="#onto">OntoNotes

                4.0</a></b></font><font face="Times New Roman, Times,

            serif"><b>   </b></font><font face="Times New Roman, Times,

            serif"><b>-</b></font></p>

        <hr width="100%" size="2">

        <p style="text-align: center;" align="center"><a name="pipeline"></a><font

            face="Times New Roman"><b>Publications Pipeline for 2011</b></font></p>

        <p><font face="Times New Roman">LDC is pleased to provide the

            following information on our planned releases for Membership

            Year 2011 (MY2011) and would like to remind our data users

            that there is still time to save on membership fees for

            MY2011, but time is quickly running out!   Any organization

            which joins or renews membership for 2011 through Tuesday,

            March 1, 2011, is entitled to a 5% discount on membership

            fees.  Organizations which held membership for MY2010 can

            receive a 10% discount on fees provided they renew prior to

            March 1, 2011.</font></p>

        <p style="margin-bottom: 12pt;"><font face="Times New Roman">Many

            publications for MY2011 are still in development, but we

            plan to release updates to some of our popular Gigaword

            corpora as well as new speech corpora.  Please note that the

            list is tentative and subject to modifications.  Our planned

            publications for this year include:</font></p>

        <blockquote>

          <p class="MsoNormal"><font face="Times New Roman"><i>2005 NIST

                Speaker Recognition Evaluation</i> - the 2005 data from

              the ongoing series of yearly evaluations conducted by NIST

              (National Institute of Standards and Technology). These

              evaluations provide an important contribution to the

              direction of research efforts and the calibration of

              technical capabilities. They are intended to be of

              interest to all researchers working on the general problem

              of text-independent speaker recognition. </font></p>

          <p><font face="Times New Roman"><i>Arabic Gigaword Fifth

                Edition</i> ~ LDC’s Arabic newswire collection from 2009

              and 2010 as well as the contents of Arabic Gigaword Fourth

              Edition (LDC2009T30).  The news sources represented

              include Agence France Presse, An Nahar, Al Hayat, Al-Quds

              Al-Arabi, Asharq Al-Awsat, Assabah <span style=""> </span>Al-

              Ahram, Ummah Press and <span style=""> </span>Xinhua News

              Agency.</font></p>

          <p class="MsoNormal"><font face="Times New Roman"><i>Chinese

                Gigaword Fifth Edition </i>~ LDC’s Chinese newswire

              collection from 2009 and 2010 as well as the contents of <span

                style=""> </span>Chinese Gigaword Fourth Edition

              (LDC2009T27).  The news sources represented include Agence

              France Presse, Central News Agency (Taiwan), Xinhua News

              Agency, Zaobao, People's Liberation Army Daily, People’s

              Daily, Guangming Daily and China News Service.<br>

              <br>

              <i>Digital Archive of Southern Speech</i></font> <font

              face="Times New Roman"> ~ a geographical sampling of

              colloquial speech in the Southern United States. Samples

              of speech were collected through interviews of single

              subjects speaking on a variety of common topics like

              family, the weather, household articles and activities,

              agriculture, and social connections. Speakers range in age

              from 15 to 90, with an average age of 61.<br>

              <br>

              <i>English Gigaword Fifth Edition</i></font> <font

              face="Times New Roman"> ~ LDC’s English newswire <span

                style=""> </span>collection from 2009 and 2010 as well

              as the contents of <span style=""> </span>English

              Gigaword Fourth Edition (LDC2009T13).  The news sources

              represented include Agence France Presse, Associated

              Press, Central News Agency (Taiwan), NY Times, Washington

              Post, Los Angeles Times and Xinhua News Agency.<br>

              <br>

              <i>MALACH English</i></font> <font face="Times New Roman">

              ~  over 300 hours of English audio recordings of

              interviews conducted under the auspices of <span style=""> </span>the

              USC Shoah Foundation Institute for Visual History and

              Education and associated transcripts produced as part of

              the Multilingual Access to Large Spoken ArCHives (MALACH)

              project.  The data was collected using table microphones. 

              Recordings are 2-channel, 128 kBps, 44.1 kHz mp2 files,

              with a different speaker generally predominant in each

              channel.  </font></p>

        </blockquote>

        <p class="MsoNormal" style="margin-bottom: 12pt;"><font

            face="Times New Roman"><br>

            2011 Subscription Members are automatically sent all MY2011

            data as it is released.  2011 Standard Members are entitled

            to request 16 corpora for free from MY2011.   Non-members

            may license most data for research use.<br>

            <br>

          </font></p>

        <div align="center"><font face="Times New Roman"><b>New Free

              Publications</b></font></div>

        <p class="MsoNormal"><font face="Times New Roman"> </font></p>

        <p><a name="pos"> </a><font face="Times New Roman">(1) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T04">Indian

Language

              Part-of-Speech Tagset: Sanskrit</a> was developed by

            Microsoft Research (MSR) India to support the task of

            Part-of-Speech Tagging (POS) and other data-driven

            linguistic research on Indian Languages in general. It is

            created as a part of the <a

              href="http://research.microsoft.com/en-us/groups/mls/default.aspx">Indian

Language

              Part-of-Speech Tagset (IL-POST)</a> project, a

            collaborative effort among linguists and computer scientists

            from MSR India, AU-KBC (Anna University, Chennai), Delhi

            University, IIT Bombay, Jawaharlal Nehru University (Delhi)

            and Tamil University (Tamilnadu). </font></p>

        <p><font face="Times New Roman">The goal of the IL-POST project

            is to provide a common tagset framework for Indian Languages

            that offers flexibility, cross-linguistic compatibility and

            resuability across those languages. It supports a

            three-level hierarchy of Categories, Types and Attributes.

            The corpus mainly consists therefore of two different levels

            of information for each lexical token: (a) lexical Category

            and Types, and (b) set morphological attributes and their

            associated values in the context. </font></p>

        <p class="MsoNormal" style="margin-bottom: 12pt;"><font

            face="Times New Roman">This corpus contains 3,703 sentences

            (57,218 words) of manually annotated Sanskrit text selected

            from the <a

              href="http://en.wikipedia.org/wiki/Panchatantra">Panchatrantra</a>

            stories, a collection of animal fables in verse and prose

            dating from the third century BCE. All annotated data is

            provided in both xml and text files. The xml files are

            contained in the "XML_files" folder and the text files in

            the "text_files" folder. Each data file contains between

            12,000-45,000 words. The XML file contains metadata about

            the material, such as language, encoding and data size.<br>

            <br>

            <font face="Times New Roman">N</font></font><font

            face="Times New Roman">on-members may license this data by

            submitting a completed copy of the <a

href="http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Sanskrit_License_Agreement.htm">Microsoft

Research

              India License Agreement</a>. The agreement can be faxed to

            +1 215 573 2175 or scanned and emailed to this address. 

            This data is available at no charge.</font></p>

        <p class="MsoNormal" style="text-align: center;" align="center"><font

            face="Times New Roman">*</font></p>

        <p><a name="onto"></a><font face="Times New Roman">(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T03">OntoNotes

Release

              4.0</a> was developed as part of the OntoNotes project, a

            collaborative effort between BBN Technologies, the

            University of Colorado, the University of Pennsylvania and

            the University of California's Information Sciences

            Institute. The goal of the project is to annotate a large

            corpus comprising various genres of text (news,

            conversational telephone speech, weblogs, usenet newsgroups,

            broadcast, talk shows) in three languages (English, Chinese,

            and Arabic) with structural information (syntax and

            predicate argument structure) and shallow semantics (word

            sense linked to an ontology and coreference).</font></p>

        <p><font face="Times New Roman">OntoNotes Release 4.0 contains

            the content of earlier releases -- <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21">OntoNotes

Release

              1.0 LDC2007T21</a>,<a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04">OntoNotes

              Release 2.0 LDC2008T04</a> and <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T24">OntoNotes

Release

              3.0 LDC2009T24</a> -- and adds newswire, broadcast news,

            broadcast conversation and web data in English and Chinese

            and newswire data in Arabic. This cumulative publication

            consists of 2.4 million words as follows: 300k words of

            Arabic newswire; 250k words of Chinese newswire, 250k words

            of Chinese broadcast news, 150k words of Chinese broadcast

            conversation and 150k words of Chinese web text; and 600k

            words of English newswire, 200k word of English broadcast

            news, 200k words of English broadcast conversation and 300k

            words of English web text. </font></p>

        <p><font face="Times New Roman">The OntoNotes project builds on

            two time-tested resources, following the <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42">Penn

              Treebank</a> for syntax and the <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T14">Penn

              PropBank</a> for predicate-argument structure. Its

            semantic representation will include word sense

            disambiguation for nouns and verbs, with each word sense

            connected to an ontology, and coreference. </font></p>

        <p class="MsoNormal"><font face="Times New Roman">Documents

            describing the annotation guidelines and the routines for

            deriving various views of the data from the database are

            included in the documentation directory of this release. The

            annotation is provided both in separate text files for each

            annotation layer (Treebank, PropBank, word sense, etc.) and

            in the form of an integrated relational database with a

            Python API to provide convenient cross-layer access.<br>

            <br>

          </font><font face="Times New Roman">Non-members may request

            this data by completing a copy of the </font> <font

            face="Times New Roman"><a

href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User

              Agreement for Non-Members</a>.  The agreement can be faxed

            +1 215 573 2175 or scanned and emailed to this address. 

            This data is available at no charge, but is subject to

            non-member shipping and handling fees.<br>

          </font></p>

        <hr width="100%" size="2"><br>

        <div align="center">

          <pre class="moz-signature" cols="72">Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

        </div>

        <p class="MsoNormal"><font face="Times New Roman"><br>

          </font></p>

      </div>

    </div>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>