<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#ffffff">
<div class="moz-text-html" lang="x-western">
<div align="center"><i>New publications:</i></div>
<p class="MsoNormal" style="text-align: center;" align="center">LDC2012T02<br>
<b><a href="#trans">- </a><a href="#tb">English Translation
Treeba</a></b><b><a href="#trans">nk: An Nahar Newswire</a>
-</b></p>
<p class="MsoNormal" style="text-align: center;" align="center">LDC2012S04<br>
<b> - <a href="#malto">Malto Speech and Transcripts</a> -</b></p>
<hr width="100%" size="2"><br>
<p class="MsoNormal" style="margin-bottom: 12pt; text-align:
center;" align="center"><b>New Publications</b></p>
<p class="MsoNormal" style=""><a name="tb"></a>(1) <a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T02">English
Translation Treebank: An Nahar Newswire</a> was developed by
LDC and consists of 599 distinct newswire stories from the
Lebanese publication An Nahar translated from Arabic to English
and annotated for part-of-speech and syntactic structure. </p>
<p class="MsoNormal" style="">This corpus is part of an ongoing
effort at LDC to produce parallel Arabic and English treebanks.
The guidelines followed for both part-of-speech and syntactic
annotation are Penn Treebank II style, with changes in the
tokenization of hyphenated words, part-of-speech and tree
changes necessitated by those tokenization changes and revisions
to the syntactic annotation to comply with the updated
annotation guidelines (including the "Treebank-PropBank merge"
or "Treebank IIa" and "treebank c" changes). The original Penn
Treebank II guidelines, addenda describing changes to the
guidelines and the tokenization specifications can be found on
LDC's <a
href="http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/">website</a>.</p>
<p class="MsoNormal" style="">The data consists of 461,489 tokens
in 599 individual files. The news stories in this release were
published in An Nahar in 2002.</p>
<p class="MsoNormal" style="">The English sources files
(translated from the Arabic) were automatically tokenized,
part-of-speech tagged and parsed; the tokens, tags and parses
were manually corrected. The quality control process consisted
of a series of specific searches for over 100 types of potential
inconsistency and parse or annotation error. Any errors found in
those searches were manually corrected. </p>
<p class="MsoNormal" style="">Annotations are in the following two
formats:</p>
<ul type="disc">
<li class="MsoNormal" style="line-height: normal;">Penn Style
Trees </li>
<ul type="circle">
<li class="MsoNormal" style="line-height: normal;">Bracketed
tree files following the basic form (NODE (TAG token)). Each
sentence is surrounded by a pair of empty parentheses.</li>
</ul>
<li class="MsoNormal" style="line-height: normal;">AG xml </li>
<ul type="circle">
<li class="MsoNormal" style="line-height: normal;">TreeEditor
.xml stand-off annotation files. These files contain the POS
and Treebank annotation and reference the source files by
character offset. DTD files for the AG xml files were moved
from their original location indicated in the readme to be
more consistent with LDC publications.</li>
</ul>
</ul>
<div align="center">*<br>
</div>
<p class="MsoNormal" style=""><a name="malto"></a>(2) <a
href="http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012S04">Malto
Speech and Transcripts</a> was developed by Masato Kobayashi,
Associate Professor in Linguistics at the University of Tokyo
(Japan), and Bablu Tirkey, research scholar at the Tribal and
Regional Languages Department, Ranchi University (India). It
contains approximately 8 hours of Malto speech data collected
between 2005 and 2009 from 27 speakers (22 males, 5 females).
Also included are accompanying transcripts, English translations
and glosses for 6 hours of the collection. Speakers were asked
to talk about themselves, their lives, rituals and folklore;
elicitation interviews were then conducted. The goal of the work
was to present the current state and dialectal variation of
Malto.</p>
<p class="MsoNormal" style="">Malto is a Dravidian language spoken
in northeastern India (principally the states of Bihar,
Jharkhand and West Bengal) and Bangladesh by people called the
Pahariyas. Indian census data places the number of Malto
speakers in a range of between 100,000-200,000 total speakers.
Most Malto speakers live in the three northeastern districts of
Jharkhand, i.e, Sahebganj, Godda and Pakur; the fieldwork that
resulted in this corpus was conducted in those districts. Of the
Pahariyas in that area, three subtribes, the Sawriya Pahariyas,
the Mal Pahariyas and the Kumarbhag Pahariyas, primarily speak
Malto. </p>
<p class="MsoNormal" style="">The transcribed data accounts for 6
hours of the collection and contains 21 speakers (17 male, 4
female). The untranscribed data accounts for 2 hours of the
collection and contains 10 speakers (9 male, 1 female). Four of
the male speakers are present in both groups.</p>
<p class="MsoNormal" style="">All audio is presented in .wav
format. Each audio file name includes a subject number, village
name, speaker name and the topic discussed. The transcripts and
glossary are UTF-8 text files. Because of ambiguities that occur
when writing Malto in Devenagari script, the transcripts were
developed using Roman script with symbols adapted from the
International Phonetic Alphabet (IPA) but are not considered
phonetic transcripts.</p>
The first 100 copies distributed to non-LDC member organizations
are available free of charge. Shipping and handling fees apply.<br>
<hr width="100%" size="2"><br>
<pre class="moz-signature" cols="72">--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 <a class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>
</pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>