27.2779, FYI: June 2016 Newsletter – LDC

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Wed Jun 29 14:14:40 UTC 2016


LINGUIST List: Vol-27-2779. Wed Jun 29 2016. ISSN: 1069 - 4875.

Subject: 27.2779, FYI: June 2016 Newsletter – LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================


Date: Wed, 29 Jun 2016 10:14:25
From: LDC LDC [ldc at ldc.upenn.edu]
Subject: June 2016 Newsletter – LDC

 
In this newsletter:

Commercial use and LDC data

New publications:
Chinese Treebank 9.0
CHM150
GALE Phase 4 Arabic Weblog Parallel Sentences

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) Chinese Treebank 9.0 consists of approximately two million words of
annotated and parsed text from Chinese newswire, government documents,
magazine articles, various broadcast news and broadcast conversation programs,
web newsgroups, weblogs, discussion forums, chat messages and transcribed
conversational telephone speech. This new data set in the Chinese Treebank
series adds more annotated web data and two new genres – chat messages and
transcribed telephone speech.

There are 3,726 text files in this release, containing 132,076 sentences,
2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided
in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled
brackets. The data is provided in four different formats: raw text, word
segmented, POS-tagged, and syntactically bracketed formats. All files were
automatically verified and manually checked.

Chinese Treebank 9.0 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

 
*

(2) CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing
Laboratory of the Faculty of Engineering at the National Autonomous University
of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish
speech, associated transcripts, and speaker metadata. The goal of this work
was to support spoken term detection and forensic speaker identification.

This corpus is comprised of Mexican Spanish microphone speech from 75 male
speakers and 75 female speakers in a quiet office environment. Speakers could
answer pre-selected open questions or describe a particular painting shown to
them on a computer monitor. Speaker metadata in this release includes age,
gender, place of birth, place of residence and parents' nationalities.

CHM150 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. This data is being made available at no-cost for
non-member organizations under a research license.

*

(3) GALE Phase 4 Arabic Weblog Parallel Sentences was developed by LDC. Along
with other corpora, the parallel text in this release comprised training data
for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and
corresponding English translations, selected from newsgroup and weblog data
collected by LDC and translated by LDC or under its direction.

 
The data includes 1,067 source-translation document pairs, comprising 68,346
words (Arabic source) of translated data. 

 
Sentences were selected for translation in two steps. First, files were chosen
using sentence selection scripts provided by GALE program participants SRI
International and IBM. The output was then manually reviewed by LDC staff to
eliminate problematic sentences. Selected files were reformatted into a
human-readable translation format and assigned to translation vendors.
Translators followed LDC's Chinese to English translation guidelines and were
provided with the full source documents containing the target sentences for
their reference. Bilingual LDC staff performed quality control procedures on
the completed translations.

GALE Phase 4 Arabic Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and
as such can receive donations through Indiana University Foundation. We
also collect donations via eLinguistics Foundation, a registered 501(c)
Non Profit organization with the federal tax number 45-4211155. Either
way, the donations can be offset against your federal and sometimes your
state tax return (U.S. tax payers only). For more information visit the
IRS Web-Site, or contact your financial advisor.

Many companies also offer a gift matching program, such that
they will match any gift you make to a non-profit organization.
Normally this entails your contacting your human resources department
and sending us a form that the Indiana University Foundation fills in
and returns to your employer. This is generally a simple administrative
procedure that doubles the value of your gift to LINGUIST, without
costing you an extra penny. Please take a moment to check if
your company operates such a program.


Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-2779	
----------------------------------------------------------







More information about the LINGUIST mailing list