<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    <div class="moz-text-html" lang="x-western">

      <p style="text-align: center;" align="center"><i>In this

          newsletter:</i></p>

      <p style="text-align: center;" align="center"><b>-  <a>Spring

            2011 LDC Data Scholarship

            Program</a></b><b>  -</b></p>

      <p class="MsoNormal" style="margin-bottom: 12pt; text-align:

        center;" align="center"><i>New

          publications:</i><br>

        <br>

        LDC2010T24<br>

        <b>-  <a>Indian

            Language Part-of-Speech Tagset: Hindi </a></b><b>

          -</b></p>

      <p class="MsoNormal" style="margin-bottom: 12pt; text-align:

        center;" align="center"><span style="">LDC2010T22</span><br>

        <b>-  <a>Manually

            Annotated Sub-Corpus First Release</a></b><b>  -</b></p>

      <p class="MsoNormal" style="margin-bottom: 12pt; text-align:

        center;" align="center">LDC2010T23<br>

        <b>-  </b><a>

          <b>NIST 2009 Open Machine Translation

            (OpenMT) Evaluation</b></a><b>  -</b></p>

      <div class="MsoNormal" style="text-align: center;" align="center">

        <hr width="100%" align="center" size="2"></div>

      <p style="text-align: center;" align="center"> <br>

        <a name="data"></a><b>Spring 2011 LDC Data Scholarship

          Program</b></p>

      <p class="MsoNormal">Applications are now being accepted through

        January 31, 2011 for the

        Spring 2011 LDC Data Scholarship

        program!  The LDC Data Scholarship program provides university

        students

        with access to LDC data at no-cost.  LDC offered data

        scholarships for

        the

        first time earlier this year.  We received many strong

        applications

        from

        students with a range of research interests.  Our student

        winners

        received

        no-cost copies of LDC data valued at over US$10,000.  <br>

        <br>

        This program is open to students pursuing both undergraduate and

        graduate

        studies in an accredited college or university. LDC Data

        Scholarships

        are not

        restricted to any particular field of study; however, students

        must

        demonstrate

        a well-developed research agenda and a bona fide inability to

        pay.  <br>

        <br>

        The application consists of two parts: </p>

      <blockquote>

        <p class="MsoNormal" style="">(1) <em><b>Data Use Proposal</b></em>.

          Applicants must submit a proposal

          describing

          their intended use of the data. The proposal must contain the

          applicant's name,

          university, and field of study. The proposal should state

          which data

          the

          student plans to use and contain a description of their

          research

          project. 

          Students are advised to consult the <a

            href="http://www.ldc.upenn.edu/Catalog/index.jsp">LDC Corpus

            Catalog</a>

          for a

          complete list of data distributed by LDC. Due to certain

          restrictions,

          a

          handful of LDC corpora are restricted to members of the

          Consortium. </p>

        <p>(2) <em><b>Letter of Support</b></em>. Applicants must

          submit one

          letter of

          support from their thesis adviser or department chair. The

          letter must

          confirm

          that the department or university lacks the funding to pay the

          full

          Non-member

          Fee for the data and verify the student's need for data.</p>

      </blockquote>

      <p>For further information on application materials and program

        rules,

        please

        visit the <a

          href="http://www.ldc.upenn.edu/About/scholarships.html">LDC

          Data

          Scholarship</a> page.  </p>

      <p>Students can email their applications to the <a

          href="mailto:datascholarships@ldc.upenn.edu">LDC Data

          Scholarship

          program</a>.

        Decisions will be sent by email from the same address.</p>

      <p>The deadline for the Spring 2011 program cycle is January 31,

        2011.<br>

      </p>

      <br>

      <p style="text-align: center;" align="center"><b>New Publications</b></p>

      <p><a name="hindi"></a>(1)  <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T24">Indian

Language

          Part-of-Speech Tagset: Hindi</a> is a

        corpus developed by Microsoft Research (MSR) India

        to support the task of Part-of-Speech Tagging (POS) and other

        data-driven

        linguistic research on Indian Languages in general. It is

        created as a

        part of

        the <a

          href="http://research.microsoft.com/en-us/groups/mls/default.aspx">Indian

Language

          Part-of-Speech Tagset (IL-POST)</a> project, a collaborative

        effort

        among linguists and computer scientists from MSR India, AU-KBC

        (Anna

        University, Chennai), Delhi University, IIT Bombay, Jawaharlal

        Nehru

        University

        (Delhi) and Tamil University (Tamilnadu). </p>

      <p>The goal of the IL-POST project is to provide a common tagset

        framework for

        Indian Languages that offers flexibility, cross-linguistic

        compatibility and

        reusability across those languages. It supports a three-level

        hierarchy

        of

        Categories, Types and Attributes. The corpus mainly consists

        therefore

        of two

        different levels of information for each lexical token: (a)

        lexical

        Category

        and Types, and (b) set morphological attributes and their

        associated

        values in

        the context. </p>

      <p class="MsoNormal">This corpus contains 4859 sentences (98,450

        words)

        of

        manually annotated Hindi text randomly collected from the

        Microsoft

        Hindi

        Research Corpus, sourced from the publisher <a

          href="http://www.webdunia.com/">WebDunia</a>.

        All annotated data is provided in both xml and text files. The

        xml

        files are

        contained in the "XML_files" folder and the text files in the

        "text_files" folder. Each data file contains between 900-5,000

        words.

        The XML file contains metadata about the material, such as

        language,

        encoding

        and data size. </p>

      <p>The Annotation Guidelines for Hindi, included in this release,

        contain a

        detailed description of the annotation methodology. The

        Annotation Tool

        Guideline 1.0, also included in this publication, describes the

        annotation

        interface developed for the IL-POST framework; the tool is not

        included

        in this

        corpus.</p>

      <p>Non-members may license this data by submitting a

        completed

        copy of the <a

href="http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Hindi_License_Agreement.htm">Microsoft

Research

          India License Agreement</a>. The agreement can be faxed to +1

        215 573

        2175 or scanned and emailed to this address.  This data is

        available at

        no

        charge.</p>

      <p align="center"> *</p>

      <p class="MsoNormal" style=""><a name="masc"></a><span style="">(2) <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T22">

            Manually Annotated Sub-Corpus First Release (MASC I)</a> is

          the first

          of three

          releases of 500,000 words of MASC data developed as part of

          the <a href="http://www.americannationalcorpus.org/">American

            National Corpus</a>

          (ANC) project. MASC I consists of approximately 80,000 words

          of

          contemporary

          spoken and written American English annotated for a variety of

          linguistic

          phenomena. The <a

            href="http://www.americannationalcorpus.org/MASC/Home.html">MASC</a>

          project is sponsored by the National Science Foundation and

          was

          established to

          address, to the extent possible, many of the obstacles to the

          creation

          of

          large-scale, robust, multiply-annotated corpora of English

          covering a

          wide

          range of genres of written and spoken language data.

          Researchers from </span><span style="">Vassar</span><span

          style=""> </span><span style="">College</span><span style="">,

        </span><span style="">Columbia</span><span style=""> </span><span

          style="">University</span><span style="">

          and the International Computer Science Institute, </span><span

          style="">University</span><span style=""> of </span><span

          style="">California</span><span style="">

          at </span><span style="">Berkeley</span><span style="">

          are the principal participants; the <a

            href="http://wordnet.princeton.edu/">WordNet</a>

          project provides consulting.</span></p>

      <p class="MsoNormal" style=""><span style="">The

          source texts in MASC I are drawn from the open portion of the

          <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35">American

National

            Corpus (ANC) Second Release LDC2005T35</a>, which includes

          written

          texts and spoken transcripts of American English from a <span

            style=""> </span>broad

          range of genres produced since 1990; and

          from the <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10">Language

Understanding

            Annotation Corpus LDC2009T09</a>, (LU Corpus), a

          collection of

          various genres including broadcast, newswire, email and

          telephone

          speech

          annotated for committed belief, event and entity coreference,

          dialog

          acts and

          temporal relations. All of the words of data in MASC I have

          validated

          annotations for token, part of speech, sentence boundary, noun

          chunks,

          verb

          chunks, named entities and <a

            href="http://www.cis.upenn.edu/%7Etreebank/">Penn

            Treebank</a> syntax. Full-text <a

            href="http://framenet.icsi.berkeley.edu/">FrameNet</a>

          annotations are available for seventeen texts and WordNet word

          sense

          annotations are available for 1000 occurrences of each of

          fifty-three

          words.

          Annotations of all or portions of the sub-corpus for a wide

          variety of

          other

          linguistic phenomena have been contributed by other projects.

          Software

          and

          services available from the <a

            href="http://www.anc.org/MASC/Home.html">ANC

            project website</a> enable transduction of MASC into a wide

          variety of

          physical

          formats.</span></p>

      <p class="MsoNormal" style=""><span style="">The

          MASC directory contains two folders: "masc-1.0.3" and

          "masc_wordsense". masc-1.0.3 contains the actual MASC corpus

          and

          consists of two folders, "spoken" and "written". The spoken

          folder contains data and annotations for spoken material, and

          the

          written

          folder contains the same for written texts. The files in each

          of the

          respective

          folders have naming conventions that describe the contents of

          the

          file. 

          masc_wordsense contains the MASC sentence samples with word

          sense

          annotations

          using WordNet sense numbers as the annotation values. </span></p>

      Non-members may request this data by completing a copy of the <a

href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User

        Agreement for Non-Members</a>.<span style="">  </span>The

      agreement can be faxed +1 215 573 2175 or scanned and emailed to

      this

      address.<span style="">  </span>This data is available at no

      charge.<span style=""></span>

      <p style="text-align: center;" align="center"> <br>

        <big>*</big></p>

      <p><a name="mt09"></a>(3)  <a

href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T23">NIST

2009

          Open Machine Translation (OpenMT) Evaluation</a> is a

        package containing source data, reference translations and

        scoring

        software

        used in the NIST 2009 OpenMT evaluation. It is designed to help

        evaluate the

        effectiveness of machine translation systems. The package was

        compiled

        and

        scoring software was developed by researchers at NIST, making

        use of

        broadcast,

        newswire and web data and reference translations collected and

        developed by

        LDC. The 2009 task was to evaluate translation from Arabic to

        English

        and Urdu

        to English.</p>

      <p>This release contains<span style="">  </span>373 documents

        with corresponding sets of four separate human expert reference

        translations.

        The source data is comprised of Arabic and Urdu broadcast,

        newswire and

        weblog

        data collected by LDC in 2007 and 2009. The newswire and

        broadcast

        material are

        from Asharq Al-Awsat (Arabic), Agence France-Presse (Arabic),

        Al-Ahram

        (Arabic), Al Hayat (Arabic), Assabah (Arabic), An Nahar

        (Arabic),

        Al-Quds

        Al-Arabi (Arabic), Xinhua News Agency (Arabic), British

        Broadcasting

        Corporation (Urdu), Deutsche Welle (Urdu), Mehr News Agency

        (Urdu) and

        Voice of

        America (Urdu). </p>

      <p>For each language, the test set consists of two files: a source

        and

        a

        reference file. Each file contains four independent translations

        of the

        data

        set. The evaluation year, source language, test set (which, by

        default,

        is

        "evalset"), version of the data, and source vs. reference file

        (with

        the latter being indicated by "-ref") are reflected in the file

        name.

        A reference file contains four independent reference

        translations

        unless noted

        otherwise in the accompanying README.txt. </p>

      <p>This evaluation kit includes scoring software. The data is

        provided

        in both

        SGML and XML formats.<br>

      </p>

      Non-members may request this data by completing a copy of the <a

href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC

User

        Agreement for Non-Members</a>.<span style="">  </span>The

      agreement can be faxed +1 215 573 2175 or scanned and emailed to

      this

      address.<span style="">  </span>This data is available at for

      US$150.<br>

      <br>

      <hr width="100%" size="2">

      <br>

      <div align="center">

        <pre class="moz-signature" cols="72">Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>

      </div>

    </div>

  </body>

</html>