[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Apr 22 20:10:14 UTC 2014


/New publications:/

*- Domain-Specific Hyponym Relations <#domain>**-
****
**- GALE Arabic-English Parallel Aligned Treebank -- Web Training 
<#gale>**  -
****
**- Multi-Channel WSJ Audio <#wsj>  -***

------------------------------------------------------------------------

*New publications
*

(1) Domain-Specific Hyponym Relations 
<http://catalog.ldc.upenn.edu/LDC2014T07> was developed by the Shaanxi 
Province Key Laboratory of Satellite and Terrestrial Network Technology 
at Xi'an Jiaotung University <http://www.xjtu.edu.cn/en/>, Xi'an, 
Shaanxi, China. It provides more than 5,000 English hyponym relations in 
five domains including data mining, computer networks, data structures, 
Euclidean geometry and microbiology. All hypernym and hyponym words were 
taken from Wikipedia article titles.

A hyponym relation is a word sense relation that is an IS-A relation. 
For example, dog is a hyponym of animal and binary tree is a hyponym of 
tree structure. Among the applications for domain-specific hyponym 
relations are taxonomy and ontology learning, query result organization 
in a faceted search and knowledge organization and automated reasoning 
in knowledge-rich applications.

The data is presented in XML format, and each file provides hyponym 
relations in one domain. Within each file, the term, Wikipedia URL, 
hyponym relation and the names of the hyponym and hypernym words are 
included. The distribution of terms and relations is set forth in the 
table below:

Dataset

	

Terms

	

Hyponym Relations

Data Mining

	

278

	

364

Computer Network

	

336

	

399

Data Structure

	

315

	

578

Euclidean Geometry

	

455

	

690

Microbiology

	

1,028

	

3,533



This data is made available at no-cost under the Creative Commons 
Attribution-Noncommercial Share Alike 3.0 
<http://creativecommons.org/licenses/by-nc-sa/3.0/> license.

*


(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training 
<http://catalog.ldc.upenn.edu/LDC2014T08> was developed by LDC and 
contains 69,766 tokens of word aligned Arabic and English parallel text 
with treebank annotations. This material was used as training data in 
the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological 
and syntactic structures aligned at the sentence level and the 
sub-sentence level. Such data sets are useful for natural language 
processing and related fields, including automatic word alignment system 
training and evaluation, transfer-rule extraction, word sense 
disambiguation, translation lexicon extraction and cultural heritage and 
cross-linguistic studies. With respect to machine translation system 
development, parallel aligned treebanks may improve system performance 
with enhanced syntactic parsers, better rules and knowledge about 
language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. 
Arabic and English treebank annotations were performed independently. 
The parallel texts were then word aligned.

LDC previously released Arabic-English Parallel Aligned Treebanks as 
follows:

  * Newswire <http://catalog.ldc.upenn.edu/LDC2013T10>
  * Broadcast News Part 1 <http://catalog.ldc.upenn.edu/LDC2013T14>
  * Broadcast News Part 2 <http://catalog.ldc.upenn.edu/LDC2014T03>

This release consists of Arabic source web data (newsgroups, weblogs) 
collected by LDC in 2004 and 2005. All data is encoded as UTF-8. A count 
of files, words, tokens and segments is below.

Language

	

Files

	

Words

	

Tokens

	

Segments

Arabic

	

162

	

46,710

	

69,766

	

3,178

Note: Word count is based on the untokenized Arabic source, token count 
is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences 
between words, phrases or groups of words in a set of parallel texts. 
Arabic-English word alignment annotation consisted of the following tasks:

  * Identifying different types of links: translated (correct or
    incorrect) and not translated (correct or incorrect)
  * Identifying sentence segments not suitable for annotation, e.g.,
    blank segments, incorrectly-segmented segments, segments with
    foreign languages
  * Tagging unmatched words attached to other words or phrases

*

(3) Multi-Channel WSJ Audio <http://catalog.ldc.upenn.edu/LDC2014S03> 
was developed by the Centre for Speech Technology Research 
<http://www.cstr.ed.ac.uk/> at the University of Edinburgh and contains 
approximately 100 hours of recorded speech from 45 British English 
speakers. Participants read Wall Street Journal texts published in 
1987-1989 in three recording scenarios: a single stationary speaker, two 
stationary overlapping speakers and one single moving speaker.

This corpus was designed to address the challenges of speech recognition 
in meetings, which often occur in rooms with non-ideal acoustic 
conditions and significant background noise, and may contain large 
sections of overlapping speech. Using headset microphones represents one 
approach, but meeting participants may be reluctant to wear them. 
Microphone arrays are another option. MCWSJ supports research in large 
vocabulary tasks using microphone arrays. The news sentences read by 
speakers are taken from WSJCAM0 Cambridge Read News 
<http://catalog.ldc.upenn.edu/LDC95S24>, a corpus originally developed 
for large vocabulary continuous speech recognition experiments, which in 
turn was based on CSR-I (WSJ0) Complete 
<http://catalog.ldc.upenn.edu/LDC93S6A>, made available by LDC to 
support large vocabulary continuous speech recognition initiatives.

Speakers reading news text from prompts were recorded using a headset 
microphone, a lapel microphone and an eight-channel microphone array. In 
the single speaker scenario, participants read from six fixed positions. 
Fixed positions were assigned for the entire recording in the 
overlapping scenario. For the moving scenario, participants moved from 
one position to the next while reading.

Fifteen speakers were recorded for the single scenario, nine pairs for 
the overlapping scenario and nine individuals for the moving scenario. 
Each read approximately 90 sentences.

------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140422/220eec36/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list