[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Feb 26 18:26:00 UTC 2007


The Linguistic Data Consortium (LDC) would like to announce the 
availability of two new publications and provide information regarding 
forthcoming publications.

LDC2007S03*
ARL Urdu Speech Database, Training Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03>
*

LDC2007T08
*ISI Arabic-English Automatically Extracted Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T08>
*

*TRECVID Data Update
*

*2007 Publications Pipeline
*

------------------------------------------------------------------------
*
New Publications

*

(1)  ARL Urdu Speech Database, Training Data 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03>, 
is a collection of recorded speech from 200 adult native Urdu speakers 
from Pakistan and Northern India. The database is divided into two 
parts, a training set containing approximately 80% of the data and a 
test set comprised of 20% of the data. This release consists of 
approximately 80% of the complete dataset (training and test).  The 
recordings in this release were collected by Appen Pty Ltd, Sydney, 
Australia in 2006.

Each speaker was presented with 400 prompts to read: sentences, place 
names, and person names. Two microphones set at different distances to 
the speaker were used for the recordings. The recorded speech was stored 
in raw format files with headers stored in separate directories.

Each utterance is transcribed in the corresponding label file for each 
recording. The transcriptions were encoded in UTF-8. Punctuation was 
omitted and numbers were written out in full. 


*

(2)  ISI Arabic-English Automatically Extracted Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T08> 
consists of Arabic-English parallel sentences which were extracted 
automatically from two monolingual corpora: Arabic Gigaword Second 
Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12). 
The data was extracted from news articles published by Xinhua News 
Agency and Agence France Presse.  The corpus contains 1,124,609 sentence 
pairs; the word count on the English side is approximately 31M words. 
The sentences in the parallel corpus preserve the form and encoding of 
the texts in the original Gigaword corpora.

For each sentence pair in the corpus we provide the names of the 
documents from which the two sentences were extracted, as well as a 
confidence score (between 0.5 and 1.0), which is indicative of their 
degree of parallelism. The parallel sentence identification approach is 
designed to judge sentence pairs in isolation from their contexts, and 
can therefore find parallel sentences within document pairs which are 
not parallel. 

In order to make this resource useful for research in Machine 
Translation (MT), we made efforts to detect potential overlaps between 
this data and the standard test and development data sets used by the MT 
community. 


*TRECVID Data Update

*

We've received many queries about the TRECVID data and are working on a 
plan to make all of this data available in the LDC catalog. We 
anticipate releasing the keyframes for TRECVID 2003 and 2005 later this 
year. Please watch our website for future announcements.

*2007 Publications Pipeline

*

Membership Year (MY) 2007 is projected to be another productive one for 
the LDC.  In addition to the aforementioned TRECVID keyframes data, we 
anticipate releasing a diverse and exciting selection of publications.  
Here is a glimpse of what is in the pipeline for MY 2007. (Disclaimer:  
unforeseen circumstances may lead to modifications of our plans.  Please 
regard this list as tentative).

    * GALE Year 1 - Chinese Broadcast Audio, Part 1- first portion of
      Mandarin Chinese audio collected for the DARPA GALE Program,
      including broadcast news plus talk shows, roundtable discussions
      and other conversational news genres.

    * GALE Year 1 - Chinese Broadcast Transcripts, Part 1 -  manual and
      web-harvested transcripts corresponding to the audio included in
      the GALE Year 1 Chinese Broadcast Audio corpus.  A subset of the
      transcripts include both verbatim transcription and manual SU
      (sentence-unit) identification plus other rich markup.

    * ISI Chinese-English Automatically Extracted Parallel Text* -*
      Chinese-English parallel sentences, which were extracted
      automatically from two monolingual corpora: Chinese Gigaword
      Second Edition (LDC2006T02) and English Gigaword Second Edition
      (LDC2005T12).  The corpus contains 558,567 sentence pairs; the
      word count on the English side is approximately 16M words. The
      sentences in the parallel corpus preserve the form and encoding of
      the texts in the original Gigaword corpora.

    * OntoNotes V 1.0 - English and Chinese broadcast news transcripts
      annotated for Treebank, PropBank, coreference and related information.

    * Spoken Levantine Arabic Treebank -  experimental pilot annotation
      developed for the Johns Hopkins University Center for Language and
      Speech Processing Summer Workshop (WS'05). The corpus covers
      morphological and syntactic annotations of approximately 26,000
      words of Levantine Arabic conversational telephone speech and was
      developed under severe time constraints.  Issues of morphological
      definitions of dialectal words, phrases and collocations were
      central to the whole linguistic description. Syntactic annotation
      focused on annotation of disfluencies and on new verbal paradigm
      and new structures (e.g., the use of present/active participles).

    * Tagged Chinese Gigaword - fully segmented and POS-tagged version
      of Chinese Gigaword Second Edition (LDC2005T14). The CKIP
      Segmentation and POS tags were applied uniformly to all texts
      regardless of its origin. The size of this tagged corpus, after
      compression, is about 1.53 GB.

As a reminder, MY 2006 will remain open for joining through December 31, 
2007 and MY 2007 through December 31, 2008.  Organizations may join for 
a future MY at any time.

------------------------------------------------------------------------

 
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070226/8567a600/attachment.htm>


More information about the Corpora mailing list