[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri Aug 24 18:30:49 UTC 2012


------------------------------------------------------------------------
/In this newsletter:/
//*//*
*-LDC and Google Collaboration Results in New Syntactically-Annotated 
Language Resources <#google>  -*
**
*- The Future of Language Resources: LDC 20th Anniversary Workshop 
<#20th>  -*
**
*- Fall 2012 LDC Data Scholarship Program <#scholar>  -*
**
**
/New publications:/
//*//*
LDC2012T13
*- **English Web Treebank <#webtb>  -*
**
LDC2012T14
*- GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 
<#gale>  --*
**
LDC2012T12
*- Spanish TimeBank 1.0 <#time>  --*
****
****
------------------------------------------------------------------------
*
**
*
*LDC and Google Collaboration Results in New Syntactically-Annotated 
Language Resources*
** **
Google Inc.and the Linguistic Data Consortium (LDC) have collaborated to 
develop new syntactically-annotated language resources that enable 
computers to better understand human language. The project, 
funded**through a gift from Google in 2010, has resulted in the 
development of the English Web Treebank LDC2012T13 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T13> 
containing over 250,000 words of weblogs, newsgroups, email, reviews and 
question-answers manually annotated for syntactic structure. This 
resource will allow language technology researchers to develop and 
evaluate the robustness of parsing methods in various new web domains. 
It was used in the 2012 shared task on parsing English web text for the 
First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL) 
<https://sites.google.com/site/sancl2012/> which took place at NAACL-HLT 
in Montreal on June 8, 2012. The English Web Treebank is available to 
the research community through LDC's Catalog 
<http://www.ldc.upenn.edu/Catalog/>.

Natural language processing (NLP) is a field of computational linguistic 
research concerned with the interactions between human language and 
computers. Parsing is a discipline within NLP in which computers analyze 
text and determine its syntactic structure. While syntactic parsing is 
already practically useful, Google funded this effort to help the 
research community develop better parsers for web text. The web texts 
collected and annotated by LDC provide new, diverse data for training 
parsing systems.

Google chose LDC for this work based on the Consortium's experience in 
developing and creating syntactic annotations, also known as treebanks. 
Treebanks are critically important to parsing research since they 
provide human-analyzed sentence structures that facilitate training and 
testing scenarios in NLP research. This work extends the existing 
relationship between LDC and Google.LDC has published four other 
Google-developed data sets in the past six years: English, Chinese, 
Japanese and European language n-grams used principally for language 
modeling.
****
<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E20221#top>
*The Future of Language Resources: LDC 20th Anniversary Workshop *

LDC's 20th Anniversary Workshop is rapidly approaching! The event will 
take place on the University of Pennsylvania's campus on September 6-7, 
2012.

Workshop themes include: the developments in human language technologies 
and associated resources that have brought us to our current state; the 
language resources required by the technical approaches taken and the 
impact of these resources on HLT progress; the applications of HLT and 
resources to other disciplines including law, medicine, economics, the 
political sciences and psychology; the impact of HLTs and related 
technologies on linguistic analysis and novel approaches in fields as 
widespread as phonetics, semantics, language documentation, 
sociolinguistics and dialect geography; and the impact of any of these 
developments on the ways in which language resources are created, shared 
and exploited and on the specific resources required.

Please read more here 
<http://www.ldc.upenn.edu/About/20th_Anniversary_Workshop.html>.

<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E20221#top>


*Fall 2012 LDC Data Scholarship Program* **

Applications are now being accepted through September 17, 2012, 11:59PM 
EST for the Fall 2012 LDC Data Scholarship program! The LDC Data 
Scholarship program provides university students with access to LDC data 
at no-cost. During previous program cycles, LDC has awarded no-cost 
copies of LDC data to over 20 individual students and student research 
groups.

This program is open to students pursuing both undergraduate and 
graduate studies in an accredited college or university. LDC Data 
Scholarships are not restricted to any particular field of study; 
however, students must demonstrate a well-developed research agenda and 
a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) *Data Use Proposal*. Applicants must submit a proposal describing 
their intended use of the data. The proposal should state which data the 
student plans to use and how the data will benefit their research 
project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog 
<http://www.ldc.upenn.edu/Catalog/index.jsp>for a complete list of data 
distributed by LDC. Due to certain restrictions, a handful of LDC 
corpora are restricted to members of the Consortium. Applicants are 
advised to select a maximum of one to two datasets; students may apply 
for additional datasets during the following cycle once they have 
completed processing of the initial datasets and publish or present work 
in some juried venue.

(2) *Letter of Support*. Applicants must submit one letter of support 
from their thesis adviser or department chair. The letter must confirm 
that the department or university lacks the funding to pay the full 
Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, 
please visit the LDC Data Scholarship 
<http://www.ldc.upenn.edu/About/scholarships.html>page.

Students can email their applications to the LDC Data Scholarship 
program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent 
by email from the same address.

The deadline for the Fall 2012 program cycle is September 17, 2012, 
11:59PM EST.


**

<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E20219#top>*
*
*New publications*

**

(1)English Web Treebank 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T13> 
was developed by the Linguistic Data Consortium (LDC) with funding 
through a gift from Google Inc. It consists of over 250,000 words of 
English weblogs, newsgroups, email, reviews and question-answers 
manually annotated for syntactic structure and is designed to allow 
language technology researchers to develop and evaluate the robustness 
of parsing methods in those web domains.

This release contains 254,830 word-level tokens and 16,624 
sentence-level tokens of webtext in 1174 files annotated for sentence- 
and word-level tokenization, part-of-speech, and syntactic structure. 
The data is roughly evenly divided across five genres: weblogs, 
newsgroups, email, reviews, and question-answers. The files were 
manually annotated following the sentence-level tokenization guidelines 
for web text and the word-level tokenization guidelines developed for 
English treebanks in the DARPA GALE 
<http://projects.ldc.upenn.edu/gale/index.html>project. Only text from 
the subject line and message body of posts, articles, messages and 
question-answers were collected and annotated.

Non-members may license this data by completing the LDC User Agreement 
for Non-members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>. 
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address. The first fifty copies of this publication are being made 
available at no charge. After the first fifty copies are distributed, 
the non-member fee of US$175 applies.

<imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E20219#top>

*

(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T14> 
was developed by LDC. Along with other corpora, the parallel text in 
this release comprised training data for Phase 2 of the DARPA GALE 
(Global Autonomous Language Exploitation) Program. This corpus contains 
Modern Standard Arabic source text and corresponding English 
translations selected from broadcast conversation (BC) data collected by 
LDC between 2004 and 2007 and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 includes 
29 source-translation document pairs, comprising 169,488 words of Arabic 
source text and its English translation. Data is drawn from eight 
distinct Arabic programs broadcast between 2004 and 2007 from Aljazeera, 
a regional broadcast programmer based in Doha, Qatar; and Nile TV, an 
Egyptian broadcaster. The programs in this release focus on current 
events topics.

The files in this release were transcribed by LDC staff and/or 
transcription vendors under contract to LDC in accordance with the Quick 
Rich Transcription 
<http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V2.pdf>guidelines 
developed by LDC. Transcribers indicated sentence boundaries in addition 
to transcribing the text. Data was manually selected for translation 
according to several criteria, including linguistic features, 
transcription features and topic features. The transcribed and segmented 
files were then reformatted into a human-readable translation format and 
assigned to translation vendors. Translators followed LDC's Arabic to 
English translation guidelines. Bilingual LDC staff performed quality 
control procedures in the completed translations.



*

(3) Spanish TimeBank 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T12> 
was developed by researchers at Barcelona Media 
<http://www.barcelonamedia.org/>and consists of Spanish texts in the 
AnCora corpus <http://clic.ub.edu/corpus/en/ancora>annotated with 
temporal and event information according to the TimeML specification 
language <http://www.timeml.org/site/index.html>.

Spanish TimeBank 1.0 contains stand-off annotations for 210 documents 
with over 75,800 tokens (including punctuation marks) and 68,000 tokens 
(excluding punctuation). The source documents are news stories and 
fiction from the AnCora corpus.

The AnCora corpus is the largest multilayer annotated corpus of Spanish 
and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words 
in Catalan. The AnCora documents are annotated on many linguistic levels 
including structure, syntax, dependencies, semantics and pragmatics. 
That information is not included in this release, but it can be mapped 
to the present annotations. The corpus is freely available from the 
Centre de Llenguatge i Computació (CLiC) <http://clic.ub.edu/ancora>.

Non-members may license this data by completing the LDC User Agreement 
for Non-members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>. 
The agreement can be faxed to +1 215 573 2175 or scanned and emailed to 
this address. The publication is being made available at no charge.
------------------------------------------------------------------------


--

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------
Linguistic Data Consortium      Phone: 1 (215) 573-1275
University of Pennsylvania        Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu


------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120824/603e9c95/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list