[Corpora-List] News from LDC

Tue Jul 22 20:29:20 UTC 2014

*- Fall 2014 Data Scholarship Program <#scholar>  -*

/New publications:/

*- 2009 NIST Language Recognition Evaluation Test Set <#lre>  -*

*- GALE Arabic-English Word Alignment Training Part 3 -- Web <#gale>  -*

*- GALE Phase 2 Chinese Newswire Parallel Text Part 1 <#g2>  -*

------------------------------------------------------------------------

*Fall 2014 Data Scholarship Program*

Applications are now being accepted through Monday, September 15, 2014, 
11:59PM EST for the Fall 2014 LDC Data Scholarship program! The LDC Data 
Scholarship program provides university students with access to LDC data 
at no-cost.

This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing 
their intended use of the data. The proposal should state which data the 
student plans to use and how the data will benefit their research 
project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Catalog 
<https://catalog.ldc.upenn.edu/> for a complete list of data distributed 
by LDC. Due to certain restrictions, a handful of LDC corpora are 
restricted to members of the Consortium. Applicants are advised to 
select a maximum of one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from 
their thesis adviser or department chair. The letter must confirm that 
the department or university lacks the funding to pay the full 
non-member fee for the data and verify the student's need for data.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page.

*New publications
*

(1)2009 NIST Language Recognition Evaluation Test Set 
<https://catalog.ldc.upenn.edu/LDC2014S06> contains approximately 215 
hours of conversational telephone speech and radio broadcast 
conversation collected by LDC in the following 23 languages and 
dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, 
English (American), English (Indian), Farsi, French, Georgian, Hausa, 
Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, 
Ukrainian, Urdu and Vietnamese.

The goal of the NIST (National Institute of Standards and Technology) 
<http://www.itl.nist.gov/iad/> Language Recognition Evaluation (LRE) 
<http://www.itl.nist.gov/iad/mig/tests/lre/> is to establish the 
baseline of current performance capability for language recognition of 
conversational telephone speech and to lay the groundwork for further 
research efforts in the field. NIST conducted language recognition 
evaluations in 1996 <http://www.itl.nist.gov/iad/mig/tests/lre/1996/>, 
2003 <http://www.itl.nist.gov/iad/mig/tests/lre/2003/>, 2005 
<http://www.itl.nist.gov/iad/mig/tests/lre/2005/> and 2007 
<http://www.itl.nist.gov/iad/mig/tests/lre/2007/>. The 2009 
<http://www.itl.nist.gov/iad/mig/tests/lre/2009/> evaluation increased 
the number of target languages. Most of the test data originated from 
multilingual Voice of America (VOA) radio broadcasts assessed as being 
of telephone bandwidth in addition to conversational telephone speech. 
Further information regarding this evaluation can be found in the 
evaluation plan which is included in the documentation for this release.

LDC released the prior LREs as:

    2003 NIST Language Recognition Evaluation (LDC2006S31
    <https://catalog.ldc.upenn.edu/LDC2006S31>)

    2005 NIST Language Recognition Evaluation (LDC2008S05
    <https://catalog.ldc.upenn.edu/LDC2008S05>)

    2007 NIST Language Recognition Evaluation Test Set (LDC2009S04
    <https://catalog.ldc.upenn.edu/LDC2009S04>)

    2007 NIST Language Recognition Evaluation Supplemental Training Set
    (LDC2009S05 <https://catalog.ldc.upenn.edu/LDC2009S05>)

The VOA speech data was collected by LDC in 2000 and 2001 and 
constitutes approximately 75% of the test set. The telephone speech was 
taken from LDC's Mixer 3 collection recorded between 2005 and 2007.

All test speech segments are presented as a sampled data stream in 
standard 8-bit 8-kHz ?-law format. Each segment is stored separately in 
a single channel SPHERE format file. The test segments contain three 
nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. 
Actual speech durations vary, but were constrained to be within the 
ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively.

*

(2) GALE Arabic-English Word Alignment Training Part 3 -- Web 
<https://catalog.ldc.upenn.edu/LDC2014T14> was developed by LDC and 
contains 217,158 tokens of word aligned Arabic and English parallel text 
enriched with linguistic tags. This material was used as training data 
in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

Other releases available in this series are:

    GALE Chinese-English Word Alignment and Tagging Training Part 1 --
    Newswire and Web (LDC2012T16 <http://catalog.ldc.upenn.edu/LDC2012T16>)

    GALE Chinese-English Word Alignment and Tagging Training Part 2 --
    Newswire (LDC2012T20 <http://catalog.ldc.upenn.edu/LDC2012T20>)

    GALE Chinese-English Word Alignment and Tagging Training Part 3 --
    Web (LDC2012T24 <http://catalog.ldc.upenn.edu/LDC2012T24>)

    GALE Chinese-English Word Alignment and Tagging Training Part 4 --
    Web (LDC2013T05 <http://catalog.ldc.upenn.edu/LDC2013T05>)

    GALE Chinese-English Word Alignment and Tagging -- Broadcast
    Training Part 1 (LDC2013T23 <http://catalog.ldc.upenn.edu/LDC2013T23>)

    GALE Arabic-English Word Alignment Training Part 1 -- Newswire and
    Web (LDC2014T05 <http://catalog.ldc.upenn.edu/LDC2014T05>)

    GALE Arabic-English Word Alignment Training Part 2 -- Newswire
    (LDC2014T10 <http://catalog.ldc.upenn.edu/LDC2014T10>)

This release consists of Arabic source web data collected by LDC. The 
distribution by genre, words, character tokens and segments appears below:

Language

Genre

Files

Words

CharTokens

Segments

Arabic

WB

2,449

154,144

217,158

7,332

Note that word count is based on the untokenized Arabic source, and 
token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:

    Normalizing tokenized tokens as needed

    Identifying different types of links

    Identifying sentence segments not suitable for annotation

    Tagging unmatched words attached to other words or phrases

*

(3) GALE Phase 2 Chinese Newswire Parallel Text Part 1 
<https://catalog.ldc.upenn.edu/LDC2014T15> was developed by LDC. Along 
with other corpora, the parallel text in this release comprised training 
data for Phase 2 of the DARPA GALE (Global Autonomous Language 
Exploitation) Program. This corpus contains 117,173 tokens of Chinese 
source text and corresponding English translations selected from 
newswire data collected by LDC in 2007 and transcribed by LDC or under 
its direction.

This release includes 167 source-translation document pairs, comprising 
117,173 tokens of translated data. Data is drawn from four distinct 
Chinese newswire sources: China News Service, Guangming Daily, People's 
Daily and People's Liberation Army Daily.

The data was transcribed by LDC staff and/or transcription vendors under 
contract to LDC in accordance with Quick Rich Transcription guidelines 
developed by LDC. Transcribers indicated sentence boundaries in addition 
to transcribing the text. Data was manually selected for translation 
according to several criteria, including linguistic features, 
transcription features and topic features. The transcribed and segmented 
files were then reformatted into a human-readable translation format and 
assigned to translation vendors. Translators followed LDC's Chinese to 
English translation guidelines. Bilingual LDC staff performed quality 
control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files 
are tab-delimited files containing one segment of text along with meta 
information about that segment. Each field in the TDF file is described 
in TDF_format.text. All data are encoded in UTF-8.

------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140722/faf009f7/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora