[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Feb 23 22:28:53 UTC 2009


LDC2009V01*
-  Audiovisual Database of Spoken American English 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01>  -
*

LDC2009T03
-  *GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1* 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03>  -

*-  LDC's Corpus Catalog Receives Top OLAC Rating 
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1>*  
-

-  *2009 Publications Pipeline* 
<http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#2>  
-

------------------------------------------------------------------------

*New Publications*

(1) The Audiovisual Database of Spoken American English 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01> 
was developed at Butler University, Indianapolis, IN in 2007 for use by 
a a variety of researchers to evaluate speech production and speech 
recognition. It contains approximately seven hours of audiovisual 
recordings of fourteen American English speakers producing syllables, 
word lists and sentences used in both academic and clinical settings.

All talkers were from the North Midland dialect region -- roughly 
defined as Indianapolis and north within the state of Indiana -- and had 
lived in that region for the majority of the time from birth to 18 years 
of age. Each participant read 238 different words and 166 different 
sentences. The sentences spoken were drawn from the following sources:

    * Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J)
    * Northwestern University Auditory Test No. 6 (Lists I-IV)
    * Vowels in /hVd/ context (separate words)
    * Texas Instruments/Massachusetts Institute for Technology (TIMIT)
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>
      sentences

The Audiovisual Database of Spoken American English will be of interest 
in various disciplines: to linguists for studies of phonetics, 
phonology, and prosody of American English; to speech scientists for 
investigations of motor speech production and auditory-visual speech 
perception; to engineers and computer scientists for investigations of 
machine audio-visual speech recognition (AVSR); and to speech and 
hearing scientists for clinical purposes, such as the examination and 
improvement of speech perception by listeners with hearing loss.

Participants were recorded individually during a single session with a 
Panasonic DVC-80 digital video camera to miniDV digital video cassette 
tapes. All participants wore a Sennheiser MKE-2060 directional/cardioid 
lapel microphone throughout the recordings.  Each speaker produced a 
total of 94 segmented files which were converted from Final Cut Express 
to Quicktime (.mov) files. 
 

 
***

(2) GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03> 
was prepared by LDC and contains a total of 178,000 words (264 files) of 
Arabic newsgroup text and its translation selected from thirty-five 
sources. Newsgroups consist of posts to electronic bulletin boards, 
Usenet newsgroups, discussion groups and similar forums. This release 
was used as training data in Phase 1 (year 1) of the DARPA-funded GALE 
program. Preparing the source data involved four stages of work: data 
scouting, data harvesting, formatting and data selection.

Data scouting involved manually searching the web for suitable newsgroup 
text. Data scouts were assigned particular topics and genres along with 
a production target in order to focus their web search. Formal 
annotation guidelines and a customized annotation toolkit helped data 
scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to a 
database. A nightly process queried the annotation database and 
harvested all designated URLs. Whenever possible, the entire site was 
downloaded, not just the individual thread or post located by the data 
scout. Once the text was downloaded, its format was standardized so that 
the data could be more easily integrated into downstream annotation 
processes. Typically, a new script was required for each new domain name 
that was identified. After scripts were run, an optional manual process 
corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability using 
a semi-automatic process. A statistical approach was used to rank a 
document's relevance to a set of already-selected documents labeled as 
"good." An annotator then reviewed the list of relevance-ranked 
documents and selected those which were suitable for a particular 
annotation task or for annotation in general. These newly-judged 
documents in turn provided additional input for the generation of new 
ranked lists.

Manual sentence units/segments (SU) annotation was also performed as 
part of the transcription task. Three types of end of sentence SU were 
identified: statement SU, question SU, and incomplete SU. After 
transcription and SU annotation, files were reformatted into a 
human-readable translation format and assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features and quality 
control procedures applied to completed translations. 

All final data are presented in Tab Delimited Format (TDF). TDF is 
compatible with other transcription formats, such as the Transcriber 
format and AG format making it easy to process.


*LDC's Corpus Catalog Receives Top OLAC Rating*

LDC is pleased to announce that The LDC Corpus Catalog 
<http://www.ldc.upenn.edu/Catalog/> has been awarded a five-star quality 
rating, the highest rating available, by the Open Language Archives 
Community (OLAC) <http://www.language-archives.org/>. OLAC is an 
international partnership of institutions and individuals who are 
creating a worldwide virtual library of language resources by: (i) 
developing consensus on best current practice for the digital archiving 
of language resources, and (ii) developing a network of interoperating 
repositories and services for housing and accessing such resources.  LDC 
supports OLAC and is among the 37 participating archives who have 
contributed over 36,000 records to the combined catalog of language 
resources. OLAC seeks to refine the quality of the metadata in catalog 
records in order to improve the quality of searching that users can do 
over that catalog. When resources are described following the best 
practice guidelines established by OLAC, it increases the likelihood 
that all the resources returned by a query are relevant (precision) and 
that all relevant resources are returned (recall).

Certain metadata in the LDC catalog was missing, inaccurate and/or 
non-compliant with OLAC standards for several fields.  Over a period of 
a few months, a team at LDC took several steps to make that metadata 
OLAC-compliant.  Most significantly, the language name and the language 
ID for over 400 corpora were reviewed and changed when required to 
conform to the new standard for language identification, ISO 639-3 
<http://www.sil.org/iso639-3/>.  Additional efforts focused on providing 
author information for all corpora and fixing dead links.  Finally, the 
team added a new metadata field to consistently document the "type" of 
each resource, using a standard vocabulary from the digital libraries 
community called DCMI-Type, reliably distinguishing text and sound 
resources.  The benefits of these revisions include improving LDC's 
management of resources in the catalog as well as assisting LDC users to 
quickly identify all corpora which are relevant to their research.

*2009 Publications Pipeline
*

For Membership Year 2009 (MY2009), we anticipate releasing a varied 
selection of publications. Many publications are still in development, 
but here is a glimpse of what is in the pipeline for MY2009.  Please 
note that this list is tentative and subject to modifications.  Our 
planned publications include:

    /Arabic Gigaword Fourth Edition/ ~ edition includes our recent
    newswire collections as well as the contents of Arabic Gigaword
    Third Edition (LDC2007T40).  In addition to sources found in
    previous releases such as Xihhuna, Agence France Presse, An Nahar,
    Al Hayat, this release includes data from several new sources, such
    as Al Quds, Asharq Al Awasat, and Al Ahram.

    /Chinese Gigaword Fourth Edition /~ edition includes our recent
    newswire collections as well as the contents of the Chinese Gigaword
    Third Edition (LDC2007T38). In addition to sources found in previous
    releases such as Agence France Presse, Central News Agency (Taiwan),
    Xinhua and Zaobao, this release includes data from several new
    sources, such as People's Liberation Army Daily, Guangming Daily,
    and China News Service. * *

    /Chinese Web 5-gram Corpus Version 1/ ~ contains n-grams (unigrams
    to five-grams) and their observed counts in 880 billion tokens of
    Chinese web data collected in March 2008. All text was converted to
    UTF-8. A simple segmenter using the same algorithm used to generate
    the data is included. The set contains 3.9 billion n-grams total.

    /CoNLL 2008 Shared Task Corpus/ ~ includes syntactic and semantic
    dependencies for Treebank-3 (LDC99T42) data. This corpus was
    developed for the 2008 shared task of the Conference on Natural
    Language Learning (CoNLL 2008). The syntactic information was
    created by converting constituent trees from Treebank-3 to
    dependencies using a set of head percolation rules and a series of
    other transformations, e.g., named entity boundaries are included
    from the BBN Pronoun Coreference and Entity Type Corpus
    (LDC2005T33). The semantic dependencies were created by converting
    semantic propositions to a dependency representation. The corpus
    includes propositions centered around both verbal predicates - from
    Proposition Bank I (LDC2004T14) - and around nominal predicates -
    from NomBank 1.0 (LDC2008T24).

    /English Gigaword Fourth Edition/ ~ edition includes our recent
    collections as well as the contents of the English Gigaword Third
    Edition (LDC2007T07).  The sources of text data include Agence
    France Presse, Associated Press, Central News Agency (Taiwan), NY
    Times, Xinhua and Salon.com

    /GALE Phase 1 Arabic Newsgroup Parallel Text Part 2/ ~ 145K words
    (263 files) of Arabic newsgroup text and its English translation
    selected from thirty sources. Newsgroups consist of posts to
    electronic bulletin boards, Usenet newsgroups, discussion groups and
    similar forums. This release was used as training data in Phase 1 of
    the DARPA-funded GALE program.

    /GALE Phase 1 Chinese Broadcast Conversation Parallel Text Part 2/ ~
    total of 24 hours of Chinese broadcast conversation were selected
    from three sources, China Central TV (CCTV) Phoenix TV, and Voice of
    America.  This release was used as training data in Phase 1 of the
    DARPA-funded GALE program.

    /GALE Phase 1 Chinese Newsgroup Parallel Text Part 1/ ~  240K
    characters (112 files) of Chinese newsgroup text and its English
    translation selected from twenty-five sources.   Newsgroups consist
    of posts to electronic bulletin boards, Usenet newsgroups,
    discussion groups and similar forums. This release was used as
    training data in Phase 1 of the DARPA-funded GALE program.

    /Japanese Web N-gram Corpus Version 1/ ~ contains n-grams (unigrams
    to seven-grams) and their observed counts in 250 billion tokens of
    Japanese web data collected in July 2007. All text was converted to
    UTF-8 and segmented using the publicly available segmenter MeCab.
    The set contains 3.2 billion n-grams total.

    /NIST MetricsMATR08 Development Data/ ~ contains sample data
    extracted from the NIST Open Machine Translation (MT) 2006
    evaluation.  Data includes the English machine translations from 8
    systems and the human reference translations for 25 Arabic source
    language newswire documents, along with corresponding human
    assessments of adequacy and preference.  This data set was
    originally provided to NIST MetricsMATR08 participants for the
    purpose of MT metric development.

    * *

2009 Subscription Members are automatically sent all MY2009 data as it 
is released.  2009 Standard Members are entitled to request 16 corpora 
for free from MY2009.   Non-members may license most data for research use.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090223/5108d211/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list