[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Feb 26 18:26:00 UTC 2007
The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications and provide information regarding
forthcoming publications.
LDC2007S03*
ARL Urdu Speech Database, Training Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03>
*
LDC2007T08
*ISI Arabic-English Automatically Extracted Parallel Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T08>
*
*TRECVID Data Update
*
*2007 Publications Pipeline
*
------------------------------------------------------------------------
*
New Publications
*
(1) ARL Urdu Speech Database, Training Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03>,
is a collection of recorded speech from 200 adult native Urdu speakers
from Pakistan and Northern India. The database is divided into two
parts, a training set containing approximately 80% of the data and a
test set comprised of 20% of the data. This release consists of
approximately 80% of the complete dataset (training and test). The
recordings in this release were collected by Appen Pty Ltd, Sydney,
Australia in 2006.
Each speaker was presented with 400 prompts to read: sentences, place
names, and person names. Two microphones set at different distances to
the speaker were used for the recordings. The recorded speech was stored
in raw format files with headers stored in separate directories.
Each utterance is transcribed in the corresponding label file for each
recording. The transcriptions were encoded in UTF-8. Punctuation was
omitted and numbers were written out in full.
*
(2) ISI Arabic-English Automatically Extracted Parallel Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T08>
consists of Arabic-English parallel sentences which were extracted
automatically from two monolingual corpora: Arabic Gigaword Second
Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12).
The data was extracted from news articles published by Xinhua News
Agency and Agence France Presse. The corpus contains 1,124,609 sentence
pairs; the word count on the English side is approximately 31M words.
The sentences in the parallel corpus preserve the form and encoding of
the texts in the original Gigaword corpora.
For each sentence pair in the corpus we provide the names of the
documents from which the two sentences were extracted, as well as a
confidence score (between 0.5 and 1.0), which is indicative of their
degree of parallelism. The parallel sentence identification approach is
designed to judge sentence pairs in isolation from their contexts, and
can therefore find parallel sentences within document pairs which are
not parallel.
In order to make this resource useful for research in Machine
Translation (MT), we made efforts to detect potential overlaps between
this data and the standard test and development data sets used by the MT
community.
*TRECVID Data Update
*
We've received many queries about the TRECVID data and are working on a
plan to make all of this data available in the LDC catalog. We
anticipate releasing the keyframes for TRECVID 2003 and 2005 later this
year. Please watch our website for future announcements.
*2007 Publications Pipeline
*
Membership Year (MY) 2007 is projected to be another productive one for
the LDC. In addition to the aforementioned TRECVID keyframes data, we
anticipate releasing a diverse and exciting selection of publications.
Here is a glimpse of what is in the pipeline for MY 2007. (Disclaimer:
unforeseen circumstances may lead to modifications of our plans. Please
regard this list as tentative).
* GALE Year 1 - Chinese Broadcast Audio, Part 1- first portion of
Mandarin Chinese audio collected for the DARPA GALE Program,
including broadcast news plus talk shows, roundtable discussions
and other conversational news genres.
* GALE Year 1 - Chinese Broadcast Transcripts, Part 1 - manual and
web-harvested transcripts corresponding to the audio included in
the GALE Year 1 Chinese Broadcast Audio corpus. A subset of the
transcripts include both verbatim transcription and manual SU
(sentence-unit) identification plus other rich markup.
* ISI Chinese-English Automatically Extracted Parallel Text* -*
Chinese-English parallel sentences, which were extracted
automatically from two monolingual corpora: Chinese Gigaword
Second Edition (LDC2006T02) and English Gigaword Second Edition
(LDC2005T12). The corpus contains 558,567 sentence pairs; the
word count on the English side is approximately 16M words. The
sentences in the parallel corpus preserve the form and encoding of
the texts in the original Gigaword corpora.
* OntoNotes V 1.0 - English and Chinese broadcast news transcripts
annotated for Treebank, PropBank, coreference and related information.
* Spoken Levantine Arabic Treebank - experimental pilot annotation
developed for the Johns Hopkins University Center for Language and
Speech Processing Summer Workshop (WS'05). The corpus covers
morphological and syntactic annotations of approximately 26,000
words of Levantine Arabic conversational telephone speech and was
developed under severe time constraints. Issues of morphological
definitions of dialectal words, phrases and collocations were
central to the whole linguistic description. Syntactic annotation
focused on annotation of disfluencies and on new verbal paradigm
and new structures (e.g., the use of present/active participles).
* Tagged Chinese Gigaword - fully segmented and POS-tagged version
of Chinese Gigaword Second Edition (LDC2005T14). The CKIP
Segmentation and POS tags were applied uniformly to all texts
regardless of its origin. The size of this tagged corpus, after
compression, is about 1.53 GB.
As a reminder, MY 2006 will remain open for joining through December 31,
2007 and MY 2007 through December 31, 2008. Organizations may join for
a future MY at any time.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070226/8567a600/attachment.htm>
More information about the Corpora
mailing list