[Corpora-List] News from LDC - July 2013

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Jul 23 20:37:53 UTC 2013


*- Fall 2013 Data Scholarship Program <#scholar> -
*

/New publications:/*
*

*- Chinese Proposition Bank 3.0 <#prop>  -
*

*- GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 
1 <#gale> -*

------------------------------------------------------------------------

*Fall 2013 Data Scholarship Program*

Applications are now being accepted through September 16, 2013, 11:59PM 
EST for the Fall 2013 LDC Data Scholarship program! The LDC Data 
Scholarship program provides university students with access to LDC data 
at no-cost.


This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) /Data Use Proposal/. Applicants must submit a proposal describing 
their intended use of the data. The proposal should state which data the 
student plans to use and how the data will benefit their research 
project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog 
<http://www.ldc.upenn.edu/Catalog/index.jsp>for a complete list of data 
distributed by LDC. Due to certain restrictions, a handful of LDC 
corpora are restricted to members of the Consortium. Applicants are 
advised to select a maximum of one to two databases.

(2) /Letter of Support/. Applicants must submit one letter of support 
from their thesis adviser or department chair. The letter must confirm 
that the department or university lacks the funding to pay the full 
Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<http://www.ldc.upenn.edu/About/scholarships.html>page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Fall 2013 programis Monday, September 16, 2013, 
11:59PM EST.


*    New publications*


(1) Chinese Proposition Bank 3.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T13>is 
a continuation of the Chinese Proposition Bank 
<http://www.cs.brandeis.edu/%7Eclp/ctb/cpb/>project which aims to create 
a corpus of text annotated with information about basic semantic 
propositions. Chinese Proposition Bank 3.0 adds predicate-argument 
annotation on 187,731 words from Chinese Treebank 7.0 (LDC2010T07 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T07>). 
The data sources are comprised of newswire, magazine articles, various 
broadcast news and broadcast conversation programming, web newsgroups 
and weblogs.

LDC has also released Chinese Proposition Bank 1.0 (LDC2005T23 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T23>) and 
Chinese Proposition Bank 2.0 (LDC2008T07 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07>).

This release contains the predicate-argument annotation of 173,206 verb 
instances and 14,525 noun instances. The annotation of nouns is limited 
to nominalizations that have a corresponding verb. The general 
annotation guidelines and the lexical guidelines (called frame files) 
for each verbal and nominal predicate are also included in this release. 
Below are some statistics about the corpus.

  * Total propositions for verbs - 173,206
  * Total propositions for nouns - 14,525
  * Total verbs framed - 24,642
  * Total framesets - 26,467
  * Verbs with multiple framesets - 1337
  * Average framesets per verb - 1.07
  * Total nouns framed - 1,421
  * Total noun framesets - 1,528
  * Nouns with multiple framesets - 48
  * Average framesets per nouns - 1.08

*

(2) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 
1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T14>was 
developed by LDC and contains 115,826 tokens of word aligned Arabic and 
English parallel text with treebank annotations. This material was used 
as training data in the DARPA GALE (Global Autonomous Language 
Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological 
and syntactic structures aligned at the sentence level and the 
sub-sentence level. Such data sets are useful for natural language 
processing and related fields, including automatic word alignment system 
training and evaluation, transfer-rule extraction, word sense 
disambiguation, translation lexicon extraction and cultural heritage and 
cross-linguistic studies. With respect to machine translation system 
development, parallel aligned treebanks may improve system performance 
with enhanced syntactic parsers, better rules and knowledge about 
language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. 
Arabic and English treebank annotations were performed independently. 
The parallel texts were then word aligned. The material in this corpus 
corresponds to a portion of the Arabic treebanked data in Arabic 
Treebank - Broadcast News v1.0 (LDC2012T07 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07>).

The source data consists of Arabic broadcast news programming collected 
by LDC in 2005 and 2006 from Alhurra, Aljazeera and Dubai TV. All data 
is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language

	

Files

	

Words

	

Tokens

	

Segments

Arabic

	

28

	

89,213

	

115,826

	

4,824

Note: Word count is based on the untokenized Arabic source. Ttoken count 
is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences 
between words, phrases or groups of words in a set of parallel texts. 
Arabic-English word alignment annotation consisted of the following tasks:

  * Identifying different types of links: translated (correct or
    incorrect) and not translated (correct or incorrect)
  * Identifying sentence segments not suitable for annotation, e.g.,
    blank segments, incorrectly-segmented segments, segments with
    foreign languages
  * Tagging unmatched words attached to other words or phrases


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130723/64611e97/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list