[Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Jul 22 20:29:20 UTC 2014
*- Fall 2014 Data Scholarship Program <#scholar> -*
/New publications:/
*- 2009 NIST Language Recognition Evaluation Test Set <#lre> -*
*- GALE Arabic-English Word Alignment Training Part 3 -- Web <#gale> -*
*- GALE Phase 2 Chinese Newswire Parallel Text Part 1 <#g2> -*
------------------------------------------------------------------------
*Fall 2014 Data Scholarship Program*
Applications are now being accepted through Monday, September 15, 2014,
11:59PM EST for the Fall 2014 LDC Data Scholarship program! The LDC Data
Scholarship program provides university students with access to LDC data
at no-cost.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and
a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing
their intended use of the data. The proposal should state which data the
student plans to use and how the data will benefit their research
project as well as information on the proposed methodology or algorithm.
Applicants should consult the LDC Catalog
<https://catalog.ldc.upenn.edu/> for a complete list of data distributed
by LDC. Due to certain restrictions, a handful of LDC corpora are
restricted to members of the Consortium. Applicants are advised to
select a maximum of one to two databases.
(2) Letter of Support. Applicants must submit one letter of support from
their thesis adviser or department chair. The letter must confirm that
the department or university lacks the funding to pay the full
non-member fee for the data and verify the student's need for data.
For further information on application materials and program rules,
please visit the LDC Data Scholarship
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page.
*New publications
*
(1)2009 NIST Language Recognition Evaluation Test Set
<https://catalog.ldc.upenn.edu/LDC2014S06> contains approximately 215
hours of conversational telephone speech and radio broadcast
conversation collected by LDC in the following 23 languages and
dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari,
English (American), English (Indian), Farsi, French, Georgian, Hausa,
Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish,
Ukrainian, Urdu and Vietnamese.
The goal of the NIST (National Institute of Standards and Technology)
<http://www.itl.nist.gov/iad/> Language Recognition Evaluation (LRE)
<http://www.itl.nist.gov/iad/mig/tests/lre/> is to establish the
baseline of current performance capability for language recognition of
conversational telephone speech and to lay the groundwork for further
research efforts in the field. NIST conducted language recognition
evaluations in 1996 <http://www.itl.nist.gov/iad/mig/tests/lre/1996/>,
2003 <http://www.itl.nist.gov/iad/mig/tests/lre/2003/>, 2005
<http://www.itl.nist.gov/iad/mig/tests/lre/2005/> and 2007
<http://www.itl.nist.gov/iad/mig/tests/lre/2007/>. The 2009
<http://www.itl.nist.gov/iad/mig/tests/lre/2009/> evaluation increased
the number of target languages. Most of the test data originated from
multilingual Voice of America (VOA) radio broadcasts assessed as being
of telephone bandwidth in addition to conversational telephone speech.
Further information regarding this evaluation can be found in the
evaluation plan which is included in the documentation for this release.
LDC released the prior LREs as:
2003 NIST Language Recognition Evaluation (LDC2006S31
<https://catalog.ldc.upenn.edu/LDC2006S31>)
2005 NIST Language Recognition Evaluation (LDC2008S05
<https://catalog.ldc.upenn.edu/LDC2008S05>)
2007 NIST Language Recognition Evaluation Test Set (LDC2009S04
<https://catalog.ldc.upenn.edu/LDC2009S04>)
2007 NIST Language Recognition Evaluation Supplemental Training Set
(LDC2009S05 <https://catalog.ldc.upenn.edu/LDC2009S05>)
The VOA speech data was collected by LDC in 2000 and 2001 and
constitutes approximately 75% of the test set. The telephone speech was
taken from LDC's Mixer 3 collection recorded between 2005 and 2007.
All test speech segments are presented as a sampled data stream in
standard 8-bit 8-kHz ?-law format. Each segment is stored separately in
a single channel SPHERE format file. The test segments contain three
nominal durations of speech: 3 seconds, 10 seconds and 30 seconds.
Actual speech durations vary, but were constrained to be within the
ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively.
*
(2) GALE Arabic-English Word Alignment Training Part 3 -- Web
<https://catalog.ldc.upenn.edu/LDC2014T14> was developed by LDC and
contains 217,158 tokens of word aligned Arabic and English parallel text
enriched with linguistic tags. This material was used as training data
in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This
is accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds contextual,
syntactic and language-specific features to the alignment annotation.
Other releases available in this series are:
GALE Chinese-English Word Alignment and Tagging Training Part 1 --
Newswire and Web (LDC2012T16 <http://catalog.ldc.upenn.edu/LDC2012T16>)
GALE Chinese-English Word Alignment and Tagging Training Part 2 --
Newswire (LDC2012T20 <http://catalog.ldc.upenn.edu/LDC2012T20>)
GALE Chinese-English Word Alignment and Tagging Training Part 3 --
Web (LDC2012T24 <http://catalog.ldc.upenn.edu/LDC2012T24>)
GALE Chinese-English Word Alignment and Tagging Training Part 4 --
Web (LDC2013T05 <http://catalog.ldc.upenn.edu/LDC2013T05>)
GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 1 (LDC2013T23 <http://catalog.ldc.upenn.edu/LDC2013T23>)
GALE Arabic-English Word Alignment Training Part 1 -- Newswire and
Web (LDC2014T05 <http://catalog.ldc.upenn.edu/LDC2014T05>)
GALE Arabic-English Word Alignment Training Part 2 -- Newswire
(LDC2014T10 <http://catalog.ldc.upenn.edu/LDC2014T10>)
This release consists of Arabic source web data collected by LDC. The
distribution by genre, words, character tokens and segments appears below:
Language
Genre
Files
Words
CharTokens
Segments
Arabic
WB
2,449
154,144
217,158
7,332
Note that word count is based on the untokenized Arabic source, and
token count is based on the tokenized Arabic source.
The Arabic word alignment tasks consisted of the following components:
Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for annotation
Tagging unmatched words attached to other words or phrases
*
(3) GALE Phase 2 Chinese Newswire Parallel Text Part 1
<https://catalog.ldc.upenn.edu/LDC2014T15> was developed by LDC. Along
with other corpora, the parallel text in this release comprised training
data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains 117,173 tokens of Chinese
source text and corresponding English translations selected from
newswire data collected by LDC in 2007 and transcribed by LDC or under
its direction.
This release includes 167 source-translation document pairs, comprising
117,173 tokens of translated data. Data is drawn from four distinct
Chinese newswire sources: China News Service, Guangming Daily, People's
Daily and People's Liberation Army Daily.
The data was transcribed by LDC staff and/or transcription vendors under
contract to LDC in accordance with Quick Rich Transcription guidelines
developed by LDC. Transcribers indicated sentence boundaries in addition
to transcribing the text. Data was manually selected for translation
according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented
files were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's Chinese to
English translation guidelines. Bilingual LDC staff performed quality
control procedures on the completed translations.
Source data and translations are distributed in TDF format. TDF files
are tab-delimited files containing one segment of text along with meta
information about that segment. Each field in the TDF file is described
in TDF_format.text. All data are encoded in UTF-8.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140722/faf009f7/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list