[Corpora] [Corpora-List] News from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Fri Nov 21 22:56:29 UTC 2014
**Fall 2014 Data Scholarship Recipients <#fall>**
*
**Spring 2015 Data Scholarship Program <#spring>
*
*LDC is now on Twitter <#twitter>
*
/New publications:/
*Boulder Lies and Truth <#lies>**
**
**GALE Chinese-English Word Alignment and Tagging -- Broadcast Training
Part 2 <#galece>**
**
**GALE Phase 2 Chinese Web Parallel Text <#galep2>*
------------------------------------------------------------------------
------------------------------------------------------------------------
*Fall 2014 Data Scholarship Recipients*
LDC is pleased to announce the student recipients of the Fall 2014 LDC
Data Scholarship program
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.The following
students have received no-cost copies of LDC data:
Mohammed Abumatar ~ University of Jordan (Jordan), Bsc Candidate,
Computer Engineering. Mohammed has been awarded a copies of MADCAT
Phase 1-3 Training Data for his work in handwriting recognition.
Ramy Baly ~ American University of Beirut (Lebanon), PhD candidate,
Electrical and Computer Engineering. Ramy has been awarded a copies
of Arabic Treebank Parts 1-3 for his work in opinion mining.
Abbas Khosravanai ~ Amirkabir University of Technology (Iran), PhD
candidate, Computer Engineering. Abbas has been awarded a copy of
2008 NIST Speaker Recognition for his work in robust speaker
recognition.
Phuc Nguyen ~ University of North Texas (USA), PhD candidate,
Computer Science and Engineering. Phuc has been awarded a copy of
Message Understanding Conference (MUC) 7 for his work in named
entity recognition.
*Spring 2015 Data Scholarship Program*
Applications are now being accepted through Thursday, January 15, 2015,
11:59PM EST for the Spring 2015 LDC Data Scholarship program. The LDC
Data Scholarship program provides university students with access to LDC
data at no-cost. During previous program cycles, LDC has awarded no-cost
copies of LDC data to over 40 individual students and student research
groups. This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and
a bona fide inability to pay.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing
their intended use of the data. The proposal should state which data the
student plans to use and how the data will benefit their research
project as well as information on the proposed methodology or algorithm.
(2) Letter of Support. Applicants must submit one letter of support from
their thesis adviser or department chair. The letter must verify the
student's need for data and confirm that the department or university
lacks the funding to pay the full non-member fee for the data or to join
the Consortium.
For further information on application materials and program rules,
please visit the LDC Data Scholarship
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page.
Students can email their applications to the LDC Data Scholarship
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent
by email from the same address.
The deadline for the Spring 2015 program cycle is January 15, 2015,
11:59PM EST.
*LDC is now on Twitter *
LDC now has a Twitter feed <https://twitter.com/LDCupenn>. Start
following us today for updates on new corpora releases and the latest
LDC news.
*New publications*
(1) Boulder Lies and Truth <https://catalog.ldc.upenn.edu/LDC2014T24>
was developed at the University of Colorado Boulder and contains
approximately 1,500 elicited English reviews of hotels and electronics
for the purpose of studying deception in written language. Reviews were
collected by crowd-sourcing with Amazon Medical Turk.
Each review was required to be original and was checked for plagiarism
against the web. Reviews were annotated with respect to the following
three dimensions:
Domain: Electronics (e.g., iPhone) or Hotels
Sentiment: Positive or Negative
Truth Value:
a) Truthful: a review about an object known by the writer reflecting
the real sentiment of the writer toward the object of the review
b) Opposition: A review about an object known by the writer
reflecting the opposite sentiment of the writer toward the object of
the review (i.e., if the writer liked the object they were asked to
write a negative review; if the writer did not like the object, they
were asked to write a positive review)
c) Deceptive (i.e., fabricated): a review written about an object
not known by the writer either positive or negative in sentiment;
the objects reviewed were provided via a URL from the tasks in (a)
and (b)
Each review was judged a total of 30 times: (1) 10 times to evaluate
its perceived quality (on a range from 1-5); (2) 10 times with
judgments about its perceived truthfulness (e.g., truthful or
somehow deceptive, a lie or a fabrication); and (3) 10 times for its
perceived sentiment (i.e., star rating).
This data is available at no-cost under this user license agreement
<https://catalog.ldc.upenn.edu/license/boulder-lies-and-truth.pdf>.
*
(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 2 <https://catalog.ldc.upenn.edu/LDC2014T25> was developed
by LDC and contains 65,069 tokens of word aligned Chinese and English
parallel text enriched with linguistic tags. This material was used as
training data in the DARPA GALE (Global Autonomous Language
Exploitation) program.
Some approaches to statistical machine translation include the
incorporation of linguistic knowledge in word aligned text as a means to
improve automatic word alignment and machine translation quality. This
is accomplished with two annotation schemes: alignment and tagging.
Alignment identifies minimum translation units and translation relations
by using minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds contextual,
syntactic and language-specific features to the alignment annotation.
This release consists of Chinese source broadcast conversation (BC)
programming collected by LDC in 2008.
The Chinese word alignment tasks consisted of the following components:
Identifying, aligning, and tagging eight different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese ?(DE) except when
they were a part of a semantic link
*
(3) GALE Phase 2 Chinese Web Parallel Text
<https://catalog.ldc.upenn.edu/LDC2014T26> was developed by LDC and
along with other corpora, the parallel text in this release comprised
training data for Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and
corresponding English translations selected from weblog and newsgroup
data collected by LDC and translated by LDC or under its direction.
This release includes 46 source-translation document pairs, comprising
66,779 tokens of translated data. Data is drawn from four Chinese weblog
and newsgroup sources.
Data was manually selected for translation according to several
criteria, including linguistic features and topic features. The files
were formatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Chinese to English
translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations.
------------------------------------------------------------------------
--
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20141121/7f75d36b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list