[Corpora-List] News from LDC - July 2013
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Jul 23 20:37:53 UTC 2013
*- Fall 2013 Data Scholarship Program <#scholar> -
*
/New publications:/*
*
*- Chinese Proposition Bank 3.0 <#prop> -
*
*- GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part
1 <#gale> -*
------------------------------------------------------------------------
*Fall 2013 Data Scholarship Program*
Applications are now being accepted through September 16, 2013, 11:59PM
EST for the Fall 2013 LDC Data Scholarship program! The LDC Data
Scholarship program provides university students with access to LDC data
at no-cost.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and
a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) /Data Use Proposal/. Applicants must submit a proposal describing
their intended use of the data. The proposal should state which data the
student plans to use and how the data will benefit their research
project as well as information on the proposed methodology or algorithm.
Applicants should consult the LDC Corpus Catalog
<http://www.ldc.upenn.edu/Catalog/index.jsp>for a complete list of data
distributed by LDC. Due to certain restrictions, a handful of LDC
corpora are restricted to members of the Consortium. Applicants are
advised to select a maximum of one to two databases.
(2) /Letter of Support/. Applicants must submit one letter of support
from their thesis adviser or department chair. The letter must confirm
that the department or university lacks the funding to pay the full
Non-member Fee for the data and verify the student's need for data.
For further information on application materials and program rules,
please visit the LDC Data Scholarship
<http://www.ldc.upenn.edu/About/scholarships.html>page.
Students can email their applications to the LDC Data Scholarship
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent
by email from the same address.
The deadline for the Fall 2013 programis Monday, September 16, 2013,
11:59PM EST.
* New publications*
(1) Chinese Proposition Bank 3.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T13>is
a continuation of the Chinese Proposition Bank
<http://www.cs.brandeis.edu/%7Eclp/ctb/cpb/>project which aims to create
a corpus of text annotated with information about basic semantic
propositions. Chinese Proposition Bank 3.0 adds predicate-argument
annotation on 187,731 words from Chinese Treebank 7.0 (LDC2010T07
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T07>).
The data sources are comprised of newswire, magazine articles, various
broadcast news and broadcast conversation programming, web newsgroups
and weblogs.
LDC has also released Chinese Proposition Bank 1.0 (LDC2005T23
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T23>) and
Chinese Proposition Bank 2.0 (LDC2008T07
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07>).
This release contains the predicate-argument annotation of 173,206 verb
instances and 14,525 noun instances. The annotation of nouns is limited
to nominalizations that have a corresponding verb. The general
annotation guidelines and the lexical guidelines (called frame files)
for each verbal and nominal predicate are also included in this release.
Below are some statistics about the corpus.
* Total propositions for verbs - 173,206
* Total propositions for nouns - 14,525
* Total verbs framed - 24,642
* Total framesets - 26,467
* Verbs with multiple framesets - 1337
* Average framesets per verb - 1.07
* Total nouns framed - 1,421
* Total noun framesets - 1,528
* Nouns with multiple framesets - 48
* Average framesets per nouns - 1.08
*
(2) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part
1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T14>was
developed by LDC and contains 115,826 tokens of word aligned Arabic and
English parallel text with treebank annotations. This material was used
as training data in the DARPA GALE (Global Autonomous Language
Exploitation) program.
Parallel aligned treebanks are treebanks annotated with morphological
and syntactic structures aligned at the sentence level and the
sub-sentence level. Such data sets are useful for natural language
processing and related fields, including automatic word alignment system
training and evaluation, transfer-rule extraction, word sense
disambiguation, translation lexicon extraction and cultural heritage and
cross-linguistic studies. With respect to machine translation system
development, parallel aligned treebanks may improve system performance
with enhanced syntactic parsers, better rules and knowledge about
language pairs and reduced word error rate.
In this release, the source Arabic data was translated into English.
Arabic and English treebank annotations were performed independently.
The parallel texts were then word aligned. The material in this corpus
corresponds to a portion of the Arabic treebanked data in Arabic
Treebank - Broadcast News v1.0 (LDC2012T07
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07>).
The source data consists of Arabic broadcast news programming collected
by LDC in 2005 and 2006 from Alhurra, Aljazeera and Dubai TV. All data
is encoded as UTF-8. A count of files, words, tokens and segments is below.
Language
Files
Words
Tokens
Segments
Arabic
28
89,213
115,826
4,824
Note: Word count is based on the untokenized Arabic source. Ttoken count
is based on the ATB-tokenized Arabic source.
The purpose of the GALE word alignment task was to find correspondences
between words, phrases or groups of words in a set of parallel texts.
Arabic-English word alignment annotation consisted of the following tasks:
* Identifying different types of links: translated (correct or
incorrect) and not translated (correct or incorrect)
* Identifying sentence segments not suitable for annotation, e.g.,
blank segments, incorrectly-segmented segments, segments with
foreign languages
* Tagging unmatched words attached to other words or phrases
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130723/64611e97/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list