[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Mar 26 20:17:23 UTC 2012


/New publications:/

LDC2012T02
*- <#trans>English Translation Treeba <#tb>**nk: An Nahar Newswire 
<#trans>  -*

LDC2012S04
* - Malto Speech and Transcripts <#malto>  -*

------------------------------------------------------------------------

*New Publications*

(1) English Translation Treebank: An Nahar Newswire 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012T02> was 
developed by LDC and consists of 599 distinct newswire stories from the 
Lebanese publication An Nahar translated from Arabic to English and 
annotated for part-of-speech and syntactic structure.

This corpus is part of an ongoing effort at LDC to produce parallel 
Arabic and English treebanks. The guidelines followed for both 
part-of-speech and syntactic annotation are Penn Treebank II style, with 
changes in the tokenization of hyphenated words, part-of-speech and tree 
changes necessitated by those tokenization changes and revisions to the 
syntactic annotation to comply with the updated annotation guidelines 
(including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank 
c" changes). The original Penn Treebank II guidelines, addenda 
describing changes to the guidelines and the tokenization specifications 
can be found on LDC's website 
<http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/>.

The data consists of 461,489 tokens in 599 individual files. The news 
stories in this release were published in An Nahar in 2002.

The English sources files (translated from the Arabic) were 
automatically tokenized, part-of-speech tagged and parsed; the tokens, 
tags and parses were manually corrected. The quality control process 
consisted of a series of specific searches for over 100 types of 
potential inconsistency and parse or annotation error. Any errors found 
in those searches were manually corrected.

Annotations are in the following two formats:

    * Penn Style Trees
          o Bracketed tree files following the basic form (NODE (TAG
            token)). Each sentence is surrounded by a pair of empty
            parentheses.
    * AG xml
          o TreeEditor .xml stand-off annotation files. These files
            contain the POS and Treebank annotation and reference the
            source files by character offset. DTD files for the AG xml
            files were moved from their original location indicated in
            the readme to be more consistent with LDC publications.

*

(2) Malto Speech and Transcripts 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2012S04> 
was developed by Masato Kobayashi, Associate Professor in Linguistics at 
the University of Tokyo (Japan), and Bablu Tirkey, research scholar at 
the Tribal and Regional Languages Department, Ranchi University (India). 
It contains approximately 8 hours of Malto speech data collected between 
2005 and 2009 from 27 speakers (22 males, 5 females). Also included are 
accompanying transcripts, English translations and glosses for 6 hours 
of the collection. Speakers were asked to talk about themselves, their 
lives, rituals and folklore; elicitation interviews were then conducted. 
The goal of the work was to present the current state and dialectal 
variation of Malto.

Malto is a Dravidian language spoken in northeastern India (principally 
the states of Bihar, Jharkhand and West Bengal) and Bangladesh by people 
called the Pahariyas. Indian census data places the number of Malto 
speakers in a range of between 100,000-200,000 total speakers. Most 
Malto speakers live in the three northeastern districts of Jharkhand, 
i.e, Sahebganj, Godda and Pakur; the fieldwork that resulted in this 
corpus was conducted in those districts. Of the Pahariyas in that area, 
three subtribes, the Sawriya Pahariyas, the Mal Pahariyas and the 
Kumarbhag Pahariyas, primarily speak Malto.

The transcribed data accounts for 6 hours of the collection and contains 
21 speakers (17 male, 4 female). The untranscribed data accounts for 2 
hours of the collection and contains 10 speakers (9 male, 1 female). 
Four of the male speakers are present in both groups.

All audio is presented in .wav format. Each audio file name includes a 
subject number, village name, speaker name and the topic discussed. The 
transcripts and glossary are UTF-8 text files. Because of ambiguities 
that occur when writing Malto in Devenagari script, the transcripts were 
developed using Roman script with symbols adapted from the International 
Phonetic Alphabet (IPA) but are not considered  phonetic transcripts.

The first 100 copies distributed to non-LDC member organizations are 
available free of charge.   Shipping and handling fees apply.
------------------------------------------------------------------------

--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120326/1d24514d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list