<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p class="MsoNormal" align="left"><b><b><a href="#fall">Fall 2014

            Data Scholarship Recipients</a></b></b><br>

      <b><br>

      </b><b><a href="#spring">Spring 2015 Data Scholarship Program</a><br>

      </b><br>

      <b><a href="#twitter">LDC is now on Twitter </a><br>

      </b></p>

    <i>New publications:</i>

    <p class="MsoNormal" align="left"><b><a href="#lies">Boulder Lies

          and Truth</a></b><b><br>

      </b><b><br>

      </b><b><a href="#galece">GALE Chinese-English Word Alignment and

          Tagging -- Broadcast Training Part 2</a></b><b><br>

      </b><b><br>

      </b><b><a href="#galep2">GALE Phase 2 Chinese Web Parallel Text</a></b></p>

    <hr size="2" width="100%">

    <hr size="2" width="100%">

    <p class="MsoNormal"><a name="fall"></a><b>Fall 2014 Data

        Scholarship Recipients</b><o:p></o:p></p>

    <p class="MsoNormal">LDC is pleased to announce the student

      recipients of the Fall 2014 <a

href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships">LDC

Data

        Scholarship program</a>.<span

        style="mso-special-character:comment"> </span> The following

      students have received no-cost copies of LDC data:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">Mohammed Abumatar ~ University of Jordan

        (Jordan), Bsc Candidate, Computer Engineering.  Mohammed has

        been awarded a copies of MADCAT Phase 1-3 Training Data for his

        work in handwriting recognition.<br>

        <br>

        Ramy Baly ~ American University of Beirut (Lebanon), PhD

        candidate, Electrical and Computer Engineering.  Ramy has been

        awarded a copies of Arabic Treebank Parts 1-3 for his work in

        opinion mining.<br>

        <br>

        Abbas Khosravanai ~ Amirkabir University of Technology (Iran),

        PhD candidate, Computer Engineering.  Abbas has been awarded a

        copy of 2008 NIST Speaker Recognition for his work in robust

        speaker recognition.<br>

        <br>

        Phuc Nguyen ~ University of North Texas (USA), PhD candidate,

        Computer Science and Engineering.  Phuc has been awarded a copy

        of Message Understanding Conference (MUC) 7 for his work in

        named entity recognition.<o:p></o:p></p>

    </blockquote>

    <o:p></o:p>

    <blockquote> </blockquote>

    <p class="MsoNormal"><a

        style="mso-comment-reference:DD_3;mso-comment-date:20141112T1004"><br>

      </a><a name="spring"></a><a

        style="mso-comment-reference:DD_3;mso-comment-date:20141112T1004"><b>Spring

2015

          Data Scholarship Program</b></a><span

        style="mso-special-character:comment"></span><o:p></o:p></p>

    <p class="MsoNormal">Applications are now being accepted through

      Thursday, January 15, 2015, 11:59PM EST for the Spring 2015 LDC

      Data Scholarship program. The LDC Data Scholarship program

      provides university students with access to LDC data at no-cost.

      During previous program cycles, LDC has awarded no-cost copies of

      LDC data to over 40 individual students and student research

      groups. This program is open to students pursuing both

      undergraduate and graduate studies in an accredited college or

      university. LDC Data Scholarships are not restricted to any

      particular field of study; however, students must demonstrate a

      well-developed research agenda and a bona fide inability to pay. <o:p></o:p></p>

    <p class="MsoNormal"><br>

      The application consists of two parts: <br>

      <br>

      (1) Data Use Proposal. Applicants must submit a proposal

      describing their intended use of the data. The proposal should

      state which data the student plans to use and how the data will

      benefit their research project as well as information on the

      proposed methodology or algorithm.<br>

      <br>

      (2) Letter of Support. Applicants must submit one letter of

      support from their thesis adviser or department chair. The letter

      must verify the student's need for data and confirm that the

      department or university lacks the funding to pay the full

      non-member fee for the data or to join the Consortium. <br>

      <br>

      For further information on application materials and program

      rules, please visit the <a

href="https://www.ldc.upenn.edu/language-resources/data/data-scholarships"

        target="_blank">LDC Data Scholarship</a> page. <br>

      <br>

      Students can email their applications to the <a

        href="mailto:datascholarships@ldc.upenn.edu">LDC Data

        Scholarship program</a>. Decisions will be sent by email from

      the same address.<br>

      <br>

      The deadline for the Spring 2015 program cycle is January 15,

      2015, 11:59PM EST.<o:p></o:p></p>

    <p class="MsoNormal"><br>

      <a name="twitter"></a><b>LDC is now on Twitter </b><br>

      <br>

      LDC now has a Twitter <a href="https://twitter.com/LDCupenn">feed</a>.

      Start following us today for updates on new corpora releases and

      the latest LDC news.<o:p></o:p></p>

    <p class="MsoNormal"><br>

      <br>

      <b>New publications</b><br>

      <br style="mso-special-character:line-break">

      <a name="lies"></a>(1) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T24">Boulder Lies and

        Truth</a> was developed at the University of Colorado Boulder

      and contains approximately 1,500 elicited English reviews of

      hotels and electronics for the purpose of studying deception in

      written language. Reviews were collected by crowd-sourcing with

      Amazon Medical Turk.<o:p></o:p></p>

    <p class="MsoNormal">Each review was required to be original and was

      checked for plagiarism against the web. Reviews were annotated

      with respect to the following three dimensions:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">Domain: Electronics (e.g., iPhone) or Hotels<o:p></o:p></p>

      <p class="MsoNormal">Sentiment: Positive or Negative<o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal">Truth Value:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">a) Truthful: a review about an object known

        by the writer reflecting the real sentiment of the writer toward

        the object of the review<o:p></o:p></p>

      <p class="MsoNormal" align="center">b) Opposition: A review about

        an object known by the writer reflecting the opposite sentiment

        of the writer toward the object of the review (i.e., if the

        writer liked the object they were asked to write a negative

        review; if the writer did not like the object, they were asked

        to write a positive review)<o:p></o:p></p>

      <p class="MsoNormal">c) Deceptive (i.e., fabricated): a review

        written about an object not known by the writer either positive

        or negative in sentiment; the objects reviewed were provided via

        a URL from the tasks in (a) and (b)<o:p></o:p></p>

      <p class="MsoNormal">Each review was judged a total of 30 times:

        (1) 10 times to evaluate its perceived quality (on a range from

        1-5); (2) 10 times with judgments about its perceived

        truthfulness (e.g., truthful or somehow deceptive, a lie or a

        fabrication); and (3) 10 times for its perceived sentiment

        (i.e., star rating).<o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal">This data is available at no-cost under this <a

href="https://catalog.ldc.upenn.edu/license/boulder-lies-and-truth.pdf">user

license

        agreement</a>.<br>

      <o:p></o:p></p>

    <p class="MsoNormal" align="center"><o:p> *</o:p></p>

    <p class="MsoNormal"><a name="galece"></a>(2) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T25">GALE

        Chinese-English Word Alignment and Tagging -- Broadcast Training

        Part 2</a> was developed by LDC and contains 65,069 tokens of

      word aligned Chinese and English parallel text enriched with

      linguistic tags. This material was used as training data in the

      DARPA GALE (Global Autonomous Language Exploitation) program.<o:p></o:p></p>

    <p class="MsoNormal">Some approaches to statistical machine

      translation include the incorporation of linguistic knowledge in

      word aligned text as a means to improve automatic word alignment

      and machine translation quality. This is accomplished with two

      annotation schemes: alignment and tagging. Alignment identifies

      minimum translation units and translation relations by using

      minimum-match and attachment annotation approaches. A set of word

      tags and alignment link tags are designed in the tagging scheme to

      describe these translation units and relations. Tagging adds

      contextual, syntactic and language-specific features to the

      alignment annotation.<o:p></o:p></p>

    <p class="MsoNormal">This release consists of Chinese source

      broadcast conversation (BC) programming collected by LDC in 2008.

      <o:p></o:p></p>

    <p class="MsoNormal">The Chinese word alignment tasks consisted of

      the following components:<o:p></o:p></p>

    <blockquote>

      <p class="MsoNormal">Identifying, aligning, and tagging eight

        different types of links<o:p></o:p></p>

      <p class="MsoNormal">Identifying, attaching, and tagging

        local-level unmatched words<o:p></o:p></p>

      <p class="MsoNormal">Identifying and tagging

        sentence/discourse-level unmatched words<o:p></o:p></p>

      <p class="MsoNormal">Identifying and tagging all instances of

        Chinese 的(DE) except when they were a part of a semantic link<o:p></o:p></p>

    </blockquote>

    <p class="MsoNormal" align="center">*<br>

    </p>

    <p class="MsoNormal"><a name="galep2"></a>(3) <a

        href="https://catalog.ldc.upenn.edu/LDC2014T26">GALE Phase 2

        Chinese Web Parallel Text</a> was developed by LDC and along

      with other corpora, the parallel text in this release comprised

      training data for Phase 2 of the DARPA GALE (Global Autonomous

      Language Exploitation) Program. This corpus contains Chinese

      source text and corresponding English translations selected from

      weblog and newsgroup data collected by LDC and translated by LDC

      or under its direction.<o:p></o:p></p>

    <p class="MsoNormal">This release includes 46 source-translation

      document pairs, comprising 66,779 tokens of translated data. Data

      is drawn from four Chinese weblog and newsgroup sources.<o:p></o:p></p>

    <p class="MsoNormal">Data was manually selected for translation

      according to several criteria, including linguistic features and

      topic features. The files were formatted into a human-readable

      translation format and assigned to translation vendors.

      Translators followed LDC's Chinese to English translation

      guidelines. Bilingual LDC staff performed quality control

      procedures on the completed translations.<o:p></o:p></p>

    <br>

    <hr size="2" width="100%"><br>

    <pre class="moz-signature" cols="72">-- 

--

Ilya Ahtaridis

Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                  Phone: 1 (215) 573-1275

University of Pennsylvania                    Fax: 1 (215) 573-2175

3600 Market St., Suite 810                        <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

Philadelphia, PA 19104 USA                 <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

  </body>

</html>