[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed May 23 15:39:02 UTC 2012
/New publications:/
LDC2012T05*
*- *Chinese Dependency Treebank 1.0 <#depend> * - *
*LDC2012T06 *
*- *GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
<#gale>** -*
**LDC2012S06 *
*<imap://ldc@imap.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E12993#turk>-
*Turkish Broadcast News Speech and Transcripts* <#turk> -
------------------------------------------------------------------------
*New Publications*
(1) Chinese Dependency Treebank 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T05>
was developed by the Harbin Institute of Technology's
<http://en.hit.edu.cn/> Research Center for Social Computing and
Information Retrieval <http://ir.hit.edu.cn/english/> (HIT-SCIR). It
contains 49,996 Chinese sentences (902,191 words) randomly selected from
People's Daily newswire stories published between 1992 and 1996 and
annotated with syntactic dependency structures. Ill-formed or short
sentences were eliminated from the randomly-selected sentences prior to
annotation. The data was segmented and annotated for part of speech
(POS), syntactic structures, verb subclasses and noun compounds. Word
segmentation and POS tagging were accomplished automatically using
statistical models trained on a larger, annotated corpus of People's
Daily newswire stories. Humans manually annotated the syntactic
structures and corrected word segmentation errors. POS tags were not
corrected.
The data is provided in the format of CoNLL-X and in UTF-8.
*
(2) GA
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06>LE
Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06>
was developed by LDC. Along with other corpora, the parallel text in
this release comprised machine translation training data for Phase 2 of
the DARPA GALE (Global Autonomous Language Exploitation) Program. This
corpus contains Modern Standard Arabic source text and corresponding
English translations selected from broadcast conversation (BC) data
collected by LDC between 2004 and 2007 and transcribed by LDC or under
its direction.
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes
36 source-translation document pairs, comprising 169,109 words of Arabic
source text and its English translation. Data is drawn from thirteen
distinct Arabic programs broadcast between 2004 and 2007 from the
following sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV,
and Radio Sawa. Broadcast conversation programming is generally more
interactive than traditional news broadcasts and includes talk shows,
interviews, call-in programs and roundtable discussions. The programs in
this release focus on current events topics.
The files in this release were transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with Quick
Rich Transcription
<http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V2.pdf>
guidelines developed by LDC. Transcribers indicated sentence boundaries
in addition to transcribing the text. Data was manually selected for
translation according to several criteria, including linguistic
features, transcription features and topic features. The transcribed and
segmented files were then reformatted into a human-readable translation
format and assigned to translation vendors. Translators followed LDC's
Arabic to English translation guidelines which are included with this
release. Bilingual LDC staff performed quality control procedures on the
completed translations.
Source data and translations are distributed in TDF format. All data are
encoded in UTF8.
*
(3) T
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06>urkish
Broadcast News Speech and Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06>
was developed by Bog(aziçi University
<http://www.boun.edu.tr/en-US/Content/About_BU/History.aspx>, Istanbul,
Turkey and contains approximately 130 hours of Voice of America (VOA)
Turkish radio broadcasts and corresponding transcripts. This is part of
a larger corpus of Turkish broadcast news data collected and transcribed
with the goal to facilitate research in Turkish automatic speech
recognition and its applications, such as speech retrieval.
The VOA material was collected between December 2006 and June 2009 using
a PC and TV/radio card setup. The data collected during the period
2006-2008 was recorded from analog FM radio; the 2009 broadcasts were
recorded from digital satellite transmissions. A quick manual
segmentation and transcription approach was followed.
The data was recorded at 32 kHz and re-sampled at 16 kHz. After
screening for recording quality, the files were segmented, transcribed,
and verified. The segmentation occurred in two steps, an initial
automatic segmentation followed by manual correction and annotation
which included information such as background conditions and speaker
boundaries.
The transcription guidelines were adapted from the LDC HUB4 and quick
transcription guidelines. An English version of the adapted guidelines
is provided with the data. Manual segmentation and transcripts were
created by native Turkish speakers at Bog(aziçi University using
Transcriber <http://trans.sourceforge.net/en/presentation.php>. The
transcriptions are provided in the ISO-8859-9 (Latin5) character set.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120523/caa61615/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list