[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Nov 21 21:58:37 UTC 2012


*Spring 2013 LDC Data Scholarship Program<#scholar>*

**

/New publications:/

*Annotated English Gigaword* <#giga>

*Chinese-English Semiconductor Parallel Text
* <#semi>*
GALE Phase 2 Arabic Newswire Parallel Text <#gale>*

**

------------------------------------------------------------------------

*Spring 2013 LDC Data Scholarship Program*//

Applications are now being accepted through January 15, 2013, 11:59PM 
EST for the Spring 2013 LDC Data Scholarship program! The LDC Data 
Scholarship program provides university students with access to LDC data 
at no-cost. During previous program cycles, LDC has awarded no-cost 
copies of LDC data to over 25 individual students and student research 
groups.

This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing 
their intended use of the data. The proposal should state which data the 
student plans to use and how the data will benefit their research 
project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog 
<http://www.ldc.upenn.edu/Catalog/index.jsp>for a complete list of data 
distributed by LDC. Due to certain restrictions, a handful of LDC 
corpora are restricted to members of the Consortium. Applicants are 
advised to select a maximum of one to two datasets; students may apply 
for additional datasets during the following cycle once they have 
completed processing of the initial datasets and publish or present work 
in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from 
their thesis adviser or department chair. The letter must verify the 
student's need for data and confirm that the department or university 
lacks the funding to pay the full Non-member Fee for the data or to join 
the consortium.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<http://www.ldc.upenn.edu/About/scholarships.html>page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Spring 2013 program cycle is January 15, 2013, 
11:59PM EST.



*New publications***

(1) Annotated English Gigaword 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T21>was 
developed by Johns Hopkins University's Human Language Technology Center 
of Excellence <http://hltcoe.jhu.edu/>. It adds automatically-generated 
syntactic and discourse structure annotation to English Gigaword Fifth 
Edition (LDC2011T07 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07>) and 
also contains an API and tools for reading the dataset's XML files. The 
goal of the annotation is to provide a standardized corpus for knowledge 
extraction and distributional semantics which enables broader 
involvement in large-scale knowledge-acquisition efforts by researchers.

Annotated English Gigaword contains the nearly ten million documents 
(over four billion words) of the original English Gigaword Fifth Edition 
from seven news sources:

  * Agence France-Presse, English Service (afp_eng)
  * Associated Press Worldstream, English Service (apw_eng)
  * Central News Agency of Taiwan, English Service (cna_eng)
  * Los Angeles Times/Washington Post Newswire Service (ltw_eng)
  * Washington Post/Bloomberg Newswire Service (wpb_eng)
  * New York Times Newswire Service (nyt_eng)
  * Xinhua News Agency, English Service (xin_eng)

The following layers of annotation were added:

  * Tokenized and segmented sentences
  * Treebank-style constituent parse trees
  * Syntactic dependency trees
  * Named entities
  * In-document coreference chains

The annotation was performed in a three-step process: (1) the data was 
preprocessed and sentences selected for annotation (sentences with more 
than 100 tokens were excluded); (2) syntactic parses were derived; and 
(3) the parsed output was post-processed to derive syntactic 
dependencies, named entities and coreference chains. Over 183 million 
sentences were parsed.

*

(2) Chinese-English Semiconductor Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T22>was 
developed by The MITRE Corporation <http://www.mitre.org/>. It consists 
of parallel sentences from a collection of abstracts from scientific 
articles on semiconductors published in Mandarin and translated into 
English by translators with particular expertise in the technical area. 
Translators were instructed to err on the side of literal translation if 
required, but to maintain the technical writing style of the source and 
to make the resulting English as natural as possible. The translators 
followed specific guidelines for translation, and those are included in 
this distribution.

There are 2,169 lines of parallel Mandarin and English, with a total of 
125,302 characters of Mandarin and 64,851 words of English, presented in 
a separate UTF-8 plain text file for each language. The sentences were 
translated in sequential order and presented in a scrambled order, such 
that parallel sentences at identical line numbers are translations. For 
example, the 31st line of the English file is a translation of the 31st 
line of the Mandarin file. The original line sequence is not provided.

*


(3) GALE Phase 2 Arabic Newswire Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T17>was 
developed by LDC.Along with other corpora, the parallel text in this 
release comprised training data for Phase 2 of the DARPA GALE (Global 
Autonomous Language Exploitation) Program. This corpus contains Modern 
Standard Arabic source text and corresponding English translations 
selected from newswire data collected in 2007 by LDC and transcribed by 
LDC or under its direction.

GALE Phase 2 Arabic Newswire Parallel Text includes 400 
source-translation pairs, comprising 181,704 tokens of Arabic source 
text and its English translation. Data is drawn from six distinct Arabic 
newswire sources.: Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, 
Asharq Al-Awsat and Assabah.

The files in this release were transcribed by LDC staff and/or 
transcription vendors under contract to LDC in accordance with the Quick 
Rich Transcription 
<http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V3.pdf>guidelines 
developed by LDC. Transcribers indicated sentence boundaries in addition 
to transcribing the text. Data was manually selected for translation 
according to several criteria, including linguistic features, 
transcription features and topic features. The transcribed and segmented 
files were then reformatted into a human-readable translation format and 
assigned to translation vendors. Translators followed LDC's Arabic to 
English translation guidelines. Bilingual LDC staff performed quality 
control procedures on the completed translations.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121121/a0a4f250/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list