[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Feb 23 22:28:53 UTC 2009
LDC2009V01*
- Audiovisual Database of Spoken American English
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01> -
*
LDC2009T03
- *GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03> -
*- LDC's Corpus Catalog Receives Top OLAC Rating
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1>*
-
- *2009 Publications Pipeline*
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#2>
-
------------------------------------------------------------------------
*New Publications*
(1) The Audiovisual Database of Spoken American English
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01>
was developed at Butler University, Indianapolis, IN in 2007 for use by
a a variety of researchers to evaluate speech production and speech
recognition. It contains approximately seven hours of audiovisual
recordings of fourteen American English speakers producing syllables,
word lists and sentences used in both academic and clinical settings.
All talkers were from the North Midland dialect region -- roughly
defined as Indianapolis and north within the state of Indiana -- and had
lived in that region for the majority of the time from birth to 18 years
of age. Each participant read 238 different words and 166 different
sentences. The sentences spoken were drawn from the following sources:
* Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J)
* Northwestern University Auditory Test No. 6 (Lists I-IV)
* Vowels in /hVd/ context (separate words)
* Texas Instruments/Massachusetts Institute for Technology (TIMIT)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>
sentences
The Audiovisual Database of Spoken American English will be of interest
in various disciplines: to linguists for studies of phonetics,
phonology, and prosody of American English; to speech scientists for
investigations of motor speech production and auditory-visual speech
perception; to engineers and computer scientists for investigations of
machine audio-visual speech recognition (AVSR); and to speech and
hearing scientists for clinical purposes, such as the examination and
improvement of speech perception by listeners with hearing loss.
Participants were recorded individually during a single session with a
Panasonic DVC-80 digital video camera to miniDV digital video cassette
tapes. All participants wore a Sennheiser MKE-2060 directional/cardioid
lapel microphone throughout the recordings. Each speaker produced a
total of 94 segmented files which were converted from Final Cut Express
to Quicktime (.mov) files.
***
(2) GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03>
was prepared by LDC and contains a total of 178,000 words (264 files) of
Arabic newsgroup text and its translation selected from thirty-five
sources. Newsgroups consist of posts to electronic bulletin boards,
Usenet newsgroups, discussion groups and similar forums. This release
was used as training data in Phase 1 (year 1) of the DARPA-funded GALE
program. Preparing the source data involved four stages of work: data
scouting, data harvesting, formatting and data selection.
Data scouting involved manually searching the web for suitable newsgroup
text. Data scouts were assigned particular topics and genres along with
a production target in order to focus their web search. Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to a
database. A nightly process queried the annotation database and
harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data
scout. Once the text was downloaded, its format was standardized so that
the data could be more easily integrated into downstream annotation
processes. Typically, a new script was required for each new domain name
that was identified. After scripts were run, an optional manual process
corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using
a semi-automatic process. A statistical approach was used to rank a
document's relevance to a set of already-selected documents labeled as
"good." An annotator then reviewed the list of relevance-ranked
documents and selected those which were suitable for a particular
annotation task or for annotation in general. These newly-judged
documents in turn provided additional input for the generation of new
ranked lists.
Manual sentence units/segments (SU) annotation was also performed as
part of the transcription task. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features and quality
control procedures applied to completed translations.
All final data are presented in Tab Delimited Format (TDF). TDF is
compatible with other transcription formats, such as the Transcriber
format and AG format making it easy to process.
*LDC's Corpus Catalog Receives Top OLAC Rating*
LDC is pleased to announce that The LDC Corpus Catalog
<http://www.ldc.upenn.edu/Catalog/> has been awarded a five-star quality
rating, the highest rating available, by the Open Language Archives
Community (OLAC) <http://www.language-archives.org/>. OLAC is an
international partnership of institutions and individuals who are
creating a worldwide virtual library of language resources by: (i)
developing consensus on best current practice for the digital archiving
of language resources, and (ii) developing a network of interoperating
repositories and services for housing and accessing such resources. LDC
supports OLAC and is among the 37 participating archives who have
contributed over 36,000 records to the combined catalog of language
resources. OLAC seeks to refine the quality of the metadata in catalog
records in order to improve the quality of searching that users can do
over that catalog. When resources are described following the best
practice guidelines established by OLAC, it increases the likelihood
that all the resources returned by a query are relevant (precision) and
that all relevant resources are returned (recall).
Certain metadata in the LDC catalog was missing, inaccurate and/or
non-compliant with OLAC standards for several fields. Over a period of
a few months, a team at LDC took several steps to make that metadata
OLAC-compliant. Most significantly, the language name and the language
ID for over 400 corpora were reviewed and changed when required to
conform to the new standard for language identification, ISO 639-3
<http://www.sil.org/iso639-3/>. Additional efforts focused on providing
author information for all corpora and fixing dead links. Finally, the
team added a new metadata field to consistently document the "type" of
each resource, using a standard vocabulary from the digital libraries
community called DCMI-Type, reliably distinguishing text and sound
resources. The benefits of these revisions include improving LDC's
management of resources in the catalog as well as assisting LDC users to
quickly identify all corpora which are relevant to their research.
*2009 Publications Pipeline
*
For Membership Year 2009 (MY2009), we anticipate releasing a varied
selection of publications. Many publications are still in development,
but here is a glimpse of what is in the pipeline for MY2009. Please
note that this list is tentative and subject to modifications. Our
planned publications include:
/Arabic Gigaword Fourth Edition/ ~ edition includes our recent
newswire collections as well as the contents of Arabic Gigaword
Third Edition (LDC2007T40). In addition to sources found in
previous releases such as Xihhuna, Agence France Presse, An Nahar,
Al Hayat, this release includes data from several new sources, such
as Al Quds, Asharq Al Awasat, and Al Ahram.
/Chinese Gigaword Fourth Edition /~ edition includes our recent
newswire collections as well as the contents of the Chinese Gigaword
Third Edition (LDC2007T38). In addition to sources found in previous
releases such as Agence France Presse, Central News Agency (Taiwan),
Xinhua and Zaobao, this release includes data from several new
sources, such as People's Liberation Army Daily, Guangming Daily,
and China News Service. * *
/Chinese Web 5-gram Corpus Version 1/ ~ contains n-grams (unigrams
to five-grams) and their observed counts in 880 billion tokens of
Chinese web data collected in March 2008. All text was converted to
UTF-8. A simple segmenter using the same algorithm used to generate
the data is included. The set contains 3.9 billion n-grams total.
/CoNLL 2008 Shared Task Corpus/ ~ includes syntactic and semantic
dependencies for Treebank-3 (LDC99T42) data. This corpus was
developed for the 2008 shared task of the Conference on Natural
Language Learning (CoNLL 2008). The syntactic information was
created by converting constituent trees from Treebank-3 to
dependencies using a set of head percolation rules and a series of
other transformations, e.g., named entity boundaries are included
from the BBN Pronoun Coreference and Entity Type Corpus
(LDC2005T33). The semantic dependencies were created by converting
semantic propositions to a dependency representation. The corpus
includes propositions centered around both verbal predicates - from
Proposition Bank I (LDC2004T14) - and around nominal predicates -
from NomBank 1.0 (LDC2008T24).
/English Gigaword Fourth Edition/ ~ edition includes our recent
collections as well as the contents of the English Gigaword Third
Edition (LDC2007T07). The sources of text data include Agence
France Presse, Associated Press, Central News Agency (Taiwan), NY
Times, Xinhua and Salon.com
/GALE Phase 1 Arabic Newsgroup Parallel Text Part 2/ ~ 145K words
(263 files) of Arabic newsgroup text and its English translation
selected from thirty sources. Newsgroups consist of posts to
electronic bulletin boards, Usenet newsgroups, discussion groups and
similar forums. This release was used as training data in Phase 1 of
the DARPA-funded GALE program.
/GALE Phase 1 Chinese Broadcast Conversation Parallel Text Part 2/ ~
total of 24 hours of Chinese broadcast conversation were selected
from three sources, China Central TV (CCTV) Phoenix TV, and Voice of
America. This release was used as training data in Phase 1 of the
DARPA-funded GALE program.
/GALE Phase 1 Chinese Newsgroup Parallel Text Part 1/ ~ 240K
characters (112 files) of Chinese newsgroup text and its English
translation selected from twenty-five sources. Newsgroups consist
of posts to electronic bulletin boards, Usenet newsgroups,
discussion groups and similar forums. This release was used as
training data in Phase 1 of the DARPA-funded GALE program.
/Japanese Web N-gram Corpus Version 1/ ~ contains n-grams (unigrams
to seven-grams) and their observed counts in 250 billion tokens of
Japanese web data collected in July 2007. All text was converted to
UTF-8 and segmented using the publicly available segmenter MeCab.
The set contains 3.2 billion n-grams total.
/NIST MetricsMATR08 Development Data/ ~ contains sample data
extracted from the NIST Open Machine Translation (MT) 2006
evaluation. Data includes the English machine translations from 8
systems and the human reference translations for 25 Arabic source
language newswire documents, along with corresponding human
assessments of adequacy and preference. This data set was
originally provided to NIST MetricsMATR08 participants for the
purpose of MT metric development.
* *
2009 Subscription Members are automatically sent all MY2009 data as it
is released. 2009 Standard Members are entitled to request 16 corpora
for free from MY2009. Non-members may license most data for research use.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090223/5108d211/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list