[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Apr 22 20:10:14 UTC 2014
/New publications:/
*- Domain-Specific Hyponym Relations <#domain>**-
****
**- GALE Arabic-English Parallel Aligned Treebank -- Web Training
<#gale>** -
****
**- Multi-Channel WSJ Audio <#wsj> -***
------------------------------------------------------------------------
*New publications
*
(1) Domain-Specific Hyponym Relations
<http://catalog.ldc.upenn.edu/LDC2014T07> was developed by the Shaanxi
Province Key Laboratory of Satellite and Terrestrial Network Technology
at Xi'an Jiaotung University <http://www.xjtu.edu.cn/en/>, Xi'an,
Shaanxi, China. It provides more than 5,000 English hyponym relations in
five domains including data mining, computer networks, data structures,
Euclidean geometry and microbiology. All hypernym and hyponym words were
taken from Wikipedia article titles.
A hyponym relation is a word sense relation that is an IS-A relation.
For example, dog is a hyponym of animal and binary tree is a hyponym of
tree structure. Among the applications for domain-specific hyponym
relations are taxonomy and ontology learning, query result organization
in a faceted search and knowledge organization and automated reasoning
in knowledge-rich applications.
The data is presented in XML format, and each file provides hyponym
relations in one domain. Within each file, the term, Wikipedia URL,
hyponym relation and the names of the hyponym and hypernym words are
included. The distribution of terms and relations is set forth in the
table below:
Dataset
Terms
Hyponym Relations
Data Mining
278
364
Computer Network
336
399
Data Structure
315
578
Euclidean Geometry
455
690
Microbiology
1,028
3,533
This data is made available at no-cost under the Creative Commons
Attribution-Noncommercial Share Alike 3.0
<http://creativecommons.org/licenses/by-nc-sa/3.0/> license.
*
(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training
<http://catalog.ldc.upenn.edu/LDC2014T08> was developed by LDC and
contains 69,766 tokens of word aligned Arabic and English parallel text
with treebank annotations. This material was used as training data in
the DARPA GALE (Global Autonomous Language Exploitation) program.
Parallel aligned treebanks are treebanks annotated with morphological
and syntactic structures aligned at the sentence level and the
sub-sentence level. Such data sets are useful for natural language
processing and related fields, including automatic word alignment system
training and evaluation, transfer-rule extraction, word sense
disambiguation, translation lexicon extraction and cultural heritage and
cross-linguistic studies. With respect to machine translation system
development, parallel aligned treebanks may improve system performance
with enhanced syntactic parsers, better rules and knowledge about
language pairs and reduced word error rate.
In this release, the source Arabic data was translated into English.
Arabic and English treebank annotations were performed independently.
The parallel texts were then word aligned.
LDC previously released Arabic-English Parallel Aligned Treebanks as
follows:
* Newswire <http://catalog.ldc.upenn.edu/LDC2013T10>
* Broadcast News Part 1 <http://catalog.ldc.upenn.edu/LDC2013T14>
* Broadcast News Part 2 <http://catalog.ldc.upenn.edu/LDC2014T03>
This release consists of Arabic source web data (newsgroups, weblogs)
collected by LDC in 2004 and 2005. All data is encoded as UTF-8. A count
of files, words, tokens and segments is below.
Language
Files
Words
Tokens
Segments
Arabic
162
46,710
69,766
3,178
Note: Word count is based on the untokenized Arabic source, token count
is based on the ATB-tokenized Arabic source.
The purpose of the GALE word alignment task was to find correspondences
between words, phrases or groups of words in a set of parallel texts.
Arabic-English word alignment annotation consisted of the following tasks:
* Identifying different types of links: translated (correct or
incorrect) and not translated (correct or incorrect)
* Identifying sentence segments not suitable for annotation, e.g.,
blank segments, incorrectly-segmented segments, segments with
foreign languages
* Tagging unmatched words attached to other words or phrases
*
(3) Multi-Channel WSJ Audio <http://catalog.ldc.upenn.edu/LDC2014S03>
was developed by the Centre for Speech Technology Research
<http://www.cstr.ed.ac.uk/> at the University of Edinburgh and contains
approximately 100 hours of recorded speech from 45 British English
speakers. Participants read Wall Street Journal texts published in
1987-1989 in three recording scenarios: a single stationary speaker, two
stationary overlapping speakers and one single moving speaker.
This corpus was designed to address the challenges of speech recognition
in meetings, which often occur in rooms with non-ideal acoustic
conditions and significant background noise, and may contain large
sections of overlapping speech. Using headset microphones represents one
approach, but meeting participants may be reluctant to wear them.
Microphone arrays are another option. MCWSJ supports research in large
vocabulary tasks using microphone arrays. The news sentences read by
speakers are taken from WSJCAM0 Cambridge Read News
<http://catalog.ldc.upenn.edu/LDC95S24>, a corpus originally developed
for large vocabulary continuous speech recognition experiments, which in
turn was based on CSR-I (WSJ0) Complete
<http://catalog.ldc.upenn.edu/LDC93S6A>, made available by LDC to
support large vocabulary continuous speech recognition initiatives.
Speakers reading news text from prompts were recorded using a headset
microphone, a lapel microphone and an eight-channel microphone array. In
the single speaker scenario, participants read from six fixed positions.
Fixed positions were assigned for the entire recording in the
overlapping scenario. For the moving scenario, participants moved from
one position to the next while reading.
Fifteen speakers were recorded for the single scenario, nine pairs for
the overlapping scenario and nine individuals for the moving scenario.
Each read approximately 90 sentences.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140422/220eec36/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list