<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#ffffff">
<div class="moz-text-html" lang="x-western">
<p style="text-align: center;" align="center"><i>In this
newsletter:</i></p>
<p style="text-align: center;" align="center"><b>- <a>Spring
2011 LDC Data Scholarship
Program</a></b><b> -</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align:
center;" align="center"><i>New
publications:</i><br>
<br>
LDC2010T24<br>
<b>- <a>Indian
Language Part-of-Speech Tagset: Hindi </a></b><b>
-</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align:
center;" align="center"><span style="">LDC2010T22</span><br>
<b>- <a>Manually
Annotated Sub-Corpus First Release</a></b><b> -</b></p>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align:
center;" align="center">LDC2010T23<br>
<b>- </b><a>
<b>NIST 2009 Open Machine Translation
(OpenMT) Evaluation</b></a><b> -</b></p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr width="100%" align="center" size="2"></div>
<p style="text-align: center;" align="center"> <br>
<a name="data"></a><b>Spring 2011 LDC Data Scholarship
Program</b></p>
<p class="MsoNormal">Applications are now being accepted through
January 31, 2011 for the
Spring 2011 LDC Data Scholarship
program! The LDC Data Scholarship program provides university
students
with access to LDC data at no-cost. LDC offered data
scholarships for
the
first time earlier this year. We received many strong
applications
from
students with a range of research interests. Our student
winners
received
no-cost copies of LDC data valued at over US$10,000. <br>
<br>
This program is open to students pursuing both undergraduate and
graduate
studies in an accredited college or university. LDC Data
Scholarships
are not
restricted to any particular field of study; however, students
must
demonstrate
a well-developed research agenda and a bona fide inability to
pay. <br>
<br>
The application consists of two parts: </p>
<blockquote>
<p class="MsoNormal" style="">(1) <em><b>Data Use Proposal</b></em>.
Applicants must submit a proposal
describing
their intended use of the data. The proposal must contain the
applicant's name,
university, and field of study. The proposal should state
which data
the
student plans to use and contain a description of their
research
project.
Students are advised to consult the <a
href="http://www.ldc.upenn.edu/Catalog/index.jsp">LDC Corpus
Catalog</a>
for a
complete list of data distributed by LDC. Due to certain
restrictions,
a
handful of LDC corpora are restricted to members of the
Consortium. </p>
<p>(2) <em><b>Letter of Support</b></em>. Applicants must
submit one
letter of
support from their thesis adviser or department chair. The
letter must
confirm
that the department or university lacks the funding to pay the
full
Non-member
Fee for the data and verify the student's need for data.</p>
</blockquote>
<p>For further information on application materials and program
rules,
please
visit the <a
href="http://www.ldc.upenn.edu/About/scholarships.html">LDC
Data
Scholarship</a> page. </p>
<p>Students can email their applications to the <a
href="mailto:datascholarships@ldc.upenn.edu">LDC Data
Scholarship
program</a>.
Decisions will be sent by email from the same address.</p>
<p>The deadline for the Spring 2011 program cycle is January 31,
2011.<br>
</p>
<br>
<p style="text-align: center;" align="center"><b>New Publications</b></p>
<p><a name="hindi"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T24">Indian
Language
Part-of-Speech Tagset: Hindi</a> is a
corpus developed by Microsoft Research (MSR) India
to support the task of Part-of-Speech Tagging (POS) and other
data-driven
linguistic research on Indian Languages in general. It is
created as a
part of
the <a
href="http://research.microsoft.com/en-us/groups/mls/default.aspx">Indian
Language
Part-of-Speech Tagset (IL-POST)</a> project, a collaborative
effort
among linguists and computer scientists from MSR India, AU-KBC
(Anna
University, Chennai), Delhi University, IIT Bombay, Jawaharlal
Nehru
University
(Delhi) and Tamil University (Tamilnadu). </p>
<p>The goal of the IL-POST project is to provide a common tagset
framework for
Indian Languages that offers flexibility, cross-linguistic
compatibility and
reusability across those languages. It supports a three-level
hierarchy
of
Categories, Types and Attributes. The corpus mainly consists
therefore
of two
different levels of information for each lexical token: (a)
lexical
Category
and Types, and (b) set morphological attributes and their
associated
values in
the context. </p>
<p class="MsoNormal">This corpus contains 4859 sentences (98,450
words)
of
manually annotated Hindi text randomly collected from the
Microsoft
Hindi
Research Corpus, sourced from the publisher <a
href="http://www.webdunia.com/">WebDunia</a>.
All annotated data is provided in both xml and text files. The
xml
files are
contained in the "XML_files" folder and the text files in the
"text_files" folder. Each data file contains between 900-5,000
words.
The XML file contains metadata about the material, such as
language,
encoding
and data size. </p>
<p>The Annotation Guidelines for Hindi, included in this release,
contain a
detailed description of the annotation methodology. The
Annotation Tool
Guideline 1.0, also included in this publication, describes the
annotation
interface developed for the IL-POST framework; the tool is not
included
in this
corpus.</p>
<p>Non-members may license this data by submitting a
completed
copy of the <a
href="http://www.ldc.upenn.edu/Catalog/nonmem_agree/Indian_Language_POS_Tagset_Hindi_License_Agreement.htm">Microsoft
Research
India License Agreement</a>. The agreement can be faxed to +1
215 573
2175 or scanned and emailed to this address. This data is
available at
no
charge.</p>
<p align="center"> *</p>
<p class="MsoNormal" style=""><a name="masc"></a><span style="">(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T22">
Manually Annotated Sub-Corpus First Release (MASC I)</a> is
the first
of three
releases of 500,000 words of MASC data developed as part of
the <a href="http://www.americannationalcorpus.org/">American
National Corpus</a>
(ANC) project. MASC I consists of approximately 80,000 words
of
contemporary
spoken and written American English annotated for a variety of
linguistic
phenomena. The <a
href="http://www.americannationalcorpus.org/MASC/Home.html">MASC</a>
project is sponsored by the National Science Foundation and
was
established to
address, to the extent possible, many of the obstacles to the
creation
of
large-scale, robust, multiply-annotated corpora of English
covering a
wide
range of genres of written and spoken language data.
Researchers from </span><span style="">Vassar</span><span
style=""> </span><span style="">College</span><span style="">,
</span><span style="">Columbia</span><span style=""> </span><span
style="">University</span><span style="">
and the International Computer Science Institute, </span><span
style="">University</span><span style=""> of </span><span
style="">California</span><span style="">
at </span><span style="">Berkeley</span><span style="">
are the principal participants; the <a
href="http://wordnet.princeton.edu/">WordNet</a>
project provides consulting.</span></p>
<p class="MsoNormal" style=""><span style="">The
source texts in MASC I are drawn from the open portion of the
<a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T35">American
National
Corpus (ANC) Second Release LDC2005T35</a>, which includes
written
texts and spoken transcripts of American English from a <span
style=""> </span>broad
range of genres produced since 1990; and
from the <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T10">Language
Understanding
Annotation Corpus LDC2009T09</a>, (LU Corpus), a
collection of
various genres including broadcast, newswire, email and
telephone
speech
annotated for committed belief, event and entity coreference,
dialog
acts and
temporal relations. All of the words of data in MASC I have
validated
annotations for token, part of speech, sentence boundary, noun
chunks,
verb
chunks, named entities and <a
href="http://www.cis.upenn.edu/%7Etreebank/">Penn
Treebank</a> syntax. Full-text <a
href="http://framenet.icsi.berkeley.edu/">FrameNet</a>
annotations are available for seventeen texts and WordNet word
sense
annotations are available for 1000 occurrences of each of
fifty-three
words.
Annotations of all or portions of the sub-corpus for a wide
variety of
other
linguistic phenomena have been contributed by other projects.
Software
and
services available from the <a
href="http://www.anc.org/MASC/Home.html">ANC
project website</a> enable transduction of MASC into a wide
variety of
physical
formats.</span></p>
<p class="MsoNormal" style=""><span style="">The
MASC directory contains two folders: "masc-1.0.3" and
"masc_wordsense". masc-1.0.3 contains the actual MASC corpus
and
consists of two folders, "spoken" and "written". The spoken
folder contains data and annotations for spoken material, and
the
written
folder contains the same for written texts. The files in each
of the
respective
folders have naming conventions that describe the contents of
the
file.
masc_wordsense contains the MASC sentence samples with word
sense
annotations
using WordNet sense numbers as the annotation values. </span></p>
Non-members may request this data by completing a copy of the <a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User
Agreement for Non-Members</a>.<span style=""> </span>The
agreement can be faxed +1 215 573 2175 or scanned and emailed to
this
address.<span style=""> </span>This data is available at no
charge.<span style=""></span>
<p style="text-align: center;" align="center"> <br>
<big>*</big></p>
<p><a name="mt09"></a>(3) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T23">NIST
2009
Open Machine Translation (OpenMT) Evaluation</a> is a
package containing source data, reference translations and
scoring
software
used in the NIST 2009 OpenMT evaluation. It is designed to help
evaluate the
effectiveness of machine translation systems. The package was
compiled
and
scoring software was developed by researchers at NIST, making
use of
broadcast,
newswire and web data and reference translations collected and
developed by
LDC. The 2009 task was to evaluate translation from Arabic to
English
and Urdu
to English.</p>
<p>This release contains<span style=""> </span>373 documents
with corresponding sets of four separate human expert reference
translations.
The source data is comprised of Arabic and Urdu broadcast,
newswire and
weblog
data collected by LDC in 2007 and 2009. The newswire and
broadcast
material are
from Asharq Al-Awsat (Arabic), Agence France-Presse (Arabic),
Al-Ahram
(Arabic), Al Hayat (Arabic), Assabah (Arabic), An Nahar
(Arabic),
Al-Quds
Al-Arabi (Arabic), Xinhua News Agency (Arabic), British
Broadcasting
Corporation (Urdu), Deutsche Welle (Urdu), Mehr News Agency
(Urdu) and
Voice of
America (Urdu). </p>
<p>For each language, the test set consists of two files: a source
and
a
reference file. Each file contains four independent translations
of the
data
set. The evaluation year, source language, test set (which, by
default,
is
"evalset"), version of the data, and source vs. reference file
(with
the latter being indicated by "-ref") are reflected in the file
name.
A reference file contains four independent reference
translations
unless noted
otherwise in the accompanying README.txt. </p>
<p>This evaluation kit includes scoring software. The data is
provided
in both
SGML and XML formats.<br>
</p>
Non-members may request this data by completing a copy of the <a
href="http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf">LDC
User
Agreement for Non-Members</a>.<span style=""> </span>The
agreement can be faxed +1 215 573 2175 or scanned and emailed to
this
address.<span style=""> </span>This data is available at for
US$150.<br>
<br>
<hr width="100%" size="2">
<br>
<div align="center">
<pre class="moz-signature" cols="72">Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></pre>
</div>
</div>
</body>
</html>