[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Dec 21 21:01:04 UTC 2012


*- Spring 2013 LDC Data Scholarship Program -* <#scholar>
<#scholar>

*-  Penn Discourse Treebank Version 2.0 Update  -* <#pdtb>*
*/
New publications:/

*- GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web  -
* <#gale>

- *Russian-English Computer Security Parallel Text*  - <#russian>


**
------------------------------------------------------------------------

*Spring 2013 LDC Data Scholarship Program *

The deadline for the Spring 2013 LDC Data Scholarship Program is one 
month away!   Student applications are being accepted now through 
January 15, 2013, 11:59PM EST. The LDC Data Scholarship program provides 
university students with access to LDC data at no cost.  This program is 
open to students pursuing both undergraduate and graduate studies in an 
accredited college or university. LDC Data Scholarships are not 
restricted to any particular field of study; however, students must 
demonstrate a well-developed research agenda and a bona fide inability 
to pay.

Students will need to complete an application which consists of a data 
use proposal and letter of support from their adviser. For further 
information on application materials and program rules, please visit the 
LDC Data Scholarship 
<http://www.ldc.upenn.edu/About/scholarships.html>page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.


*Penn Discourse Treebank Version 2.0 Update*


**

The developers of the Penn Discourse Treebank Version 2.0 LDC2008T05 
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T05> 
(PDTB) have updated this release to add metadata to the Wall Street 
Journal (WSJ) news stories in the corpus. The goal is to aid 
understanding PDTB files as texts and to support distinguishing texts 
from different genres within the WSJ.
The metadata includes the following fields:

  * DD: the date the article appeared in the WSJ
  * AN: unique identifier for the article
  * HL: the column name (for regular features such as Who's News,
    Marketing & Media, Technology), its headline and by-line
  * SO: the source of the article
  * IN: manually-assigned codes or keywords for the article
  * CO: manually-assigned codes for companies or other organizations
  * DATELINE: normally the location where the article was filed, but
    sometimes has very unexpected contents
  * GV: Branch of Government or Government Agency mentioned in the article
  * SBREAKS: the byte position of section breaks present in the file
  * ARTICLEBREAK: separates files that contain more than one article

Contact LDC to obtain the update.



*New publications*

(1) GALE Chinese-English Word Alignment and Tagging Training Part 3 -- 
Web 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T24>was 
developed by LDC and contains 154,541 tokens of word aligned Chinese and 
English parallel text enriched with linguistic tags. This material was 
used as training data in the DARPA GALE 
<http://projects.ldc.upenn.edu/gale/index.html>(Global Autonomous 
Language Exploitation) program.

Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- 
Newswire and Web (LDC2012T16) and GALE Chinese-English Word Alignment 
and Tagging Training Part 3 -- Web (LDC2012T20) are also available 
through LDC.

This release consists of Chinese source web data (newsgroup, weblog) 
collected by LDC in 2008 and 2009. The distribution by words, character 
tokens and segments appears below:

Language

	

Files

	

Words

	

CharTokens

	

Segments

Chinese

	

1249

	

103027

	

154541

	

4842


Note that all token counts are based on the Chinese data only. One token 
is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

  * Identifying, aligning, and tagging 8 different types of links
  * Identifying, attaching, and tagging local-level unmatched words
  * Identifying and tagging sentence/discourse-level unmatched words
  * Identifying and tagging all instances of Chinese ?(DE) except when
    they were a part of a semantic link.



*

(2) Russian-English Computer Security Parallel Text 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T23>was 
developed by The MITRE Corporation <http://www.mitre.org/>. It consists 
of parallel sentences from a set of computer security reports published 
in Russian and translated into English by translators with particular 
expertise in the technical area. Translators were instructed to err on 
the side of literal translation if required, but to maintain the 
technical writing style of the source and to make the resulting English 
as natural as possible. The translators followed specific guidelines for 
translation, and those are included in this distribution.

There are 6,276 lines of parallel Russian and English, with a total of 
60,059 words of Russian and 76,437 words of English, presented in a 
separate UTF-8 plain text file for each language. The sentences were 
translated in sequential order and presented in a scrambled order, such 
that parallel sentences at identical line numbers are translations. For 
example, the 31st line of the English file is a translation of the 31st 
line of the Russian file. The original line sequence is not provided. 
1,694 untranslated lines (such as code snippets) are included as a 
separate file.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121221/d6d2af86/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list