[Corpora] [Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Nov 21 22:56:29 UTC 2014


**Fall 2014 Data Scholarship Recipients <#fall>**
*
**Spring 2015 Data Scholarship Program <#spring>
*
*LDC is now on Twitter <#twitter>
*

/New publications:/

*Boulder Lies and Truth <#lies>**
**
**GALE Chinese-English Word Alignment and Tagging -- Broadcast Training 
Part 2 <#galece>**
**
**GALE Phase 2 Chinese Web Parallel Text <#galep2>*

------------------------------------------------------------------------
------------------------------------------------------------------------

*Fall 2014 Data Scholarship Recipients*

LDC is pleased to announce the student recipients of the Fall 2014 LDC 
Data Scholarship program 
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.The following 
students have received no-cost copies of LDC data:

    Mohammed Abumatar ~ University of Jordan (Jordan), Bsc Candidate,
    Computer Engineering.  Mohammed has been awarded a copies of MADCAT
    Phase 1-3 Training Data for his work in handwriting recognition.

    Ramy Baly ~ American University of Beirut (Lebanon), PhD candidate,
    Electrical and Computer Engineering.  Ramy has been awarded a copies
    of Arabic Treebank Parts 1-3 for his work in opinion mining.

    Abbas Khosravanai ~ Amirkabir University of Technology (Iran), PhD
    candidate, Computer Engineering.  Abbas has been awarded a copy of
    2008 NIST Speaker Recognition for his work in robust speaker
    recognition.

    Phuc Nguyen ~ University of North Texas (USA), PhD candidate,
    Computer Science and Engineering.  Phuc has been awarded a copy of
    Message Understanding Conference (MUC) 7 for his work in named
    entity recognition.


*Spring 2015 Data Scholarship Program*

Applications are now being accepted through Thursday, January 15, 2015, 
11:59PM EST for the Spring 2015 LDC Data Scholarship program. The LDC 
Data Scholarship program provides university students with access to LDC 
data at no-cost. During previous program cycles, LDC has awarded no-cost 
copies of LDC data to over 40 individual students and student research 
groups. This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay.


The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing 
their intended use of the data. The proposal should state which data the 
student plans to use and how the data will benefit their research 
project as well as information on the proposed methodology or algorithm.

(2) Letter of Support. Applicants must submit one letter of support from 
their thesis adviser or department chair. The letter must verify the 
student's need for data and confirm that the department or university 
lacks the funding to pay the full non-member fee for the data or to join 
the Consortium.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Spring 2015 program cycle is January 15, 2015, 
11:59PM EST.


*LDC is now on Twitter *

LDC now has a Twitter feed <https://twitter.com/LDCupenn>. Start 
following us today for updates on new corpora releases and the latest 
LDC news.



*New publications*

(1) Boulder Lies and Truth <https://catalog.ldc.upenn.edu/LDC2014T24> 
was developed at the University of Colorado Boulder and contains 
approximately 1,500 elicited English reviews of hotels and electronics 
for the purpose of studying deception in written language. Reviews were 
collected by crowd-sourcing with Amazon Medical Turk.

Each review was required to be original and was checked for plagiarism 
against the web. Reviews were annotated with respect to the following 
three dimensions:

    Domain: Electronics (e.g., iPhone) or Hotels

    Sentiment: Positive or Negative

Truth Value:

    a) Truthful: a review about an object known by the writer reflecting
    the real sentiment of the writer toward the object of the review

    b) Opposition: A review about an object known by the writer
    reflecting the opposite sentiment of the writer toward the object of
    the review (i.e., if the writer liked the object they were asked to
    write a negative review; if the writer did not like the object, they
    were asked to write a positive review)

    c) Deceptive (i.e., fabricated): a review written about an object
    not known by the writer either positive or negative in sentiment;
    the objects reviewed were provided via a URL from the tasks in (a)
    and (b)

    Each review was judged a total of 30 times: (1) 10 times to evaluate
    its perceived quality (on a range from 1-5); (2) 10 times with
    judgments about its perceived truthfulness (e.g., truthful or
    somehow deceptive, a lie or a fabrication); and (3) 10 times for its
    perceived sentiment (i.e., star rating).

This data is available at no-cost under this user license agreement 
<https://catalog.ldc.upenn.edu/license/boulder-lies-and-truth.pdf>.

  *

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast 
Training Part 2 <https://catalog.ldc.upenn.edu/LDC2014T25> was developed 
by LDC and contains 65,069 tokens of word aligned Chinese and English 
parallel text enriched with linguistic tags. This material was used as 
training data in the DARPA GALE (Global Autonomous Language 
Exploitation) program.

Some approaches to statistical machine translation include the 
incorporation of linguistic knowledge in word aligned text as a means to 
improve automatic word alignment and machine translation quality. This 
is accomplished with two annotation schemes: alignment and tagging. 
Alignment identifies minimum translation units and translation relations 
by using minimum-match and attachment annotation approaches. A set of 
word tags and alignment link tags are designed in the tagging scheme to 
describe these translation units and relations. Tagging adds contextual, 
syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) 
programming collected by LDC in 2008.

The Chinese word alignment tasks consisted of the following components:

    Identifying, aligning, and tagging eight different types of links

    Identifying, attaching, and tagging local-level unmatched words

    Identifying and tagging sentence/discourse-level unmatched words

    Identifying and tagging all instances of Chinese ?(DE) except when
    they were a part of a semantic link

*

(3) GALE Phase 2 Chinese Web Parallel Text 
<https://catalog.ldc.upenn.edu/LDC2014T26> was developed by LDC and 
along with other corpora, the parallel text in this release comprised 
training data for Phase 2 of the DARPA GALE (Global Autonomous Language 
Exploitation) Program. This corpus contains Chinese source text and 
corresponding English translations selected from weblog and newsgroup 
data collected by LDC and translated by LDC or under its direction.

This release includes 46 source-translation document pairs, comprising 
66,779 tokens of translated data. Data is drawn from four Chinese weblog 
and newsgroup sources.

Data was manually selected for translation according to several 
criteria, including linguistic features and topic features. The files 
were formatted into a human-readable translation format and assigned to 
translation vendors. Translators followed LDC's Chinese to English 
translation guidelines. Bilingual LDC staff performed quality control 
procedures on the completed translations.


------------------------------------------------------------------------

-- 
--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810                        ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141121/7f75d36b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list