[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed May 23 15:39:02 UTC 2012


/New publications:/

LDC2012T05*
*- *Chinese Dependency Treebank 1.0 <#depend> * - *

*LDC2012T06 *
*- *GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 
<#gale>**  -*

**LDC2012S06 *
*<imap://ldc@imap.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E12993#turk>- 
*Turkish Broadcast News Speech and Transcripts* <#turk>  -

------------------------------------------------------------------------
*New Publications*


(1) Chinese Dependency Treebank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T05> 
was developed by the Harbin Institute of Technology's 
<http://en.hit.edu.cn/> Research Center for Social Computing and 
Information Retrieval <http://ir.hit.edu.cn/english/> (HIT-SCIR). It 
contains 49,996 Chinese sentences (902,191 words) randomly selected from 
People's Daily newswire stories published between 1992 and 1996 and 
annotated with syntactic dependency structures. Ill-formed or short 
sentences were eliminated from the randomly-selected sentences prior to 
annotation. The data was segmented and annotated for part of speech 
(POS), syntactic structures, verb subclasses and noun compounds. Word 
segmentation and POS tagging were accomplished automatically using 
statistical models trained on a larger, annotated corpus of People's 
Daily newswire stories. Humans manually annotated the syntactic 
structures and corrected word segmentation errors. POS tags were not 
corrected.

The data is provided in the format of CoNLL-X and in UTF-8.


*


(2) GA 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06>LE 
Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06> 
was developed by LDC. Along with other corpora, the parallel text in 
this release comprised machine translation training data for Phase 2 of 
the DARPA GALE (Global Autonomous Language Exploitation) Program. This 
corpus contains Modern Standard Arabic source text and corresponding 
English translations selected from broadcast conversation (BC) data 
collected by LDC between 2004 and 2007 and transcribed by LDC or under 
its direction.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 
36 source-translation document pairs, comprising 169,109 words of Arabic 
source text and its English translation. Data is drawn from thirteen 
distinct Arabic programs broadcast between 2004 and 2007 from the 
following sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, 
and Radio Sawa. Broadcast conversation programming is generally more 
interactive than traditional news broadcasts and includes talk shows, 
interviews, call-in programs and roundtable discussions. The programs in 
this release focus on current events topics.

The files in this release were transcribed by LDC staff and/or 
transcription vendors under contract to LDC in accordance with Quick 
Rich Transcription 
<http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V2.pdf> 
guidelines developed by LDC. Transcribers indicated sentence boundaries 
in addition to transcribing the text. Data was manually selected for 
translation according to several criteria, including linguistic 
features, transcription features and topic features. The transcribed and 
segmented files were then reformatted into a human-readable translation 
format and assigned to translation vendors. Translators followed LDC's 
Arabic to English translation guidelines which are included with this 
release. Bilingual LDC staff performed quality control procedures on the 
completed translations.

Source data and translations are distributed in TDF format. All data are 
encoded in UTF8.



*


(3) T 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06>urkish 
Broadcast News Speech and Transcripts 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06> 
was developed by Bog(aziçi University 
<http://www.boun.edu.tr/en-US/Content/About_BU/History.aspx>, Istanbul, 
Turkey and contains approximately 130 hours of Voice of America (VOA) 
Turkish radio broadcasts and corresponding transcripts. This is part of 
a larger corpus of Turkish broadcast news data collected and transcribed 
with the goal to facilitate research in Turkish automatic speech 
recognition and its applications, such as speech retrieval.

The VOA material was collected between December 2006 and June 2009 using 
a PC and TV/radio card setup. The data collected during the period 
2006-2008 was recorded from analog FM radio; the 2009 broadcasts were 
recorded from digital satellite transmissions. A quick manual 
segmentation and transcription approach was followed.

The data was recorded at 32 kHz and re-sampled at 16 kHz. After 
screening for recording quality, the files were segmented, transcribed, 
and verified. The segmentation occurred in two steps, an initial 
automatic segmentation followed by manual correction and annotation 
which included information such as background conditions and speaker 
boundaries.

The transcription guidelines were adapted from the LDC HUB4 and quick 
transcription guidelines. An English version of the adapted guidelines 
is provided with the data. Manual segmentation and transcripts were 
created by native Turkish speakers at Bog(aziçi University using 
Transcriber <http://trans.sourceforge.net/en/presentation.php>. The 
transcriptions are provided in the ISO-8859-9 (Latin5) character set.


------------------------------------------------------------------------


Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120523/caa61615/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list