[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Feb 26 21:49:46 UTC 2008


LDC2008T04
*-  OntoNotes Release 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04>  -*

LDC2008T05
*-  Penn Discourse Treebank Version 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05>  -*

*-  2007 Member Survey Responses  -*
*
-  2008 Publications Pipeline  -

*

------------------------------------------------------------------------

*
New Publications
*


(1) The OntoNotes project is a collaborative effort between BBN 
Technologies, the University of Colorado, the University of 
Pennsylvania, and the University of Southern California's Information 
Sciences Institute. The goal of the project is to annotate a large 
corpus comprising various genres of text (news, conversational telephone 
speech, weblogs, use net, broadcast, talk shows) in three languages 
(English, Chinese, and Arabic) with structural information (syntax and 
predicate argument structure) and shallow semantics (word sense linked 
to an ontology and coreference).

OntoNotes Release 1.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21> 
contains 400k words of Chinese newswire data and 300k words of English 
newswire data. The current release, OntoNotes Release 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04>, 
adds the following to the corpus: 274k words of Chinese broadcast news 
data and 200k words of English broadcast news data. The current goals 
call for annotation of over a million words each of English and Chinese, 
and half a million words of Arabic over five years. OntoNotes builds on 
two time-tested resources, following the Penn Treebank for syntax and 
the Penn PropBank for predicate-argument structure. Its semantic 
representation will include word sense disambiguation for nouns and 
verbs, with each word sense connected to an ontology, and coreference.  

*

 

(2) The Penn Discourse Treebank (PDTB) 
<http://www.seas.upenn.edu/%7Epdtb> Project is located at the Institute 
for Research in Cognitive Science at the University of Pennsylvania.  
The goal of the project is to develop a large scale corpus annotated 
with information related to discourse structure. Penn Discourse Treebank 
Version 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05> 
contains annotations of discourse relations and their arguments on the 
one million word Wall Street Journal (WSJ) data in Treebank-2 (LDC95T7). 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>

The PDTB focuses on encoding discourse relations associated with 
discourse connectives, adopting a lexically grounded approach for the 
annotation. The corpus provides annotations for the argument structure 
of Explicit and Implicit connectives, the senses of connectives and the 
attribution of connectives and their arguments. The lexically grounded 
approach exposes a clearly defined level of discourse structure which 
will support the extraction of a range of inferences associated with 
discourse connectives.

The PDTB annotates semantic or informational relations holding between 
two (and only two) Abstract Objects (AOs), expressed either explicitly 
via lexical items or implicitly via adjacency. For the former, the 
lexical items anchoring the relation are annotated as Explicit 
connectives. For the latter, the implicit inferable relations are 
annotated by inserting an Implicit connective that best expresses the 
inferred relation.

Explicit connectives are identified from three grammatical classes: 
subordinating conjunctions (e.g., because, when), coordinating 
conjunctions (e.g., and, or), and discourse adverbials (e.g., however, 
otherwise). Arguments of connectives are simply labeled Arg2 for the 
argument appearing in the clause syntactically bound to the connective, 
and Arg1 for the other argument.  In addition to the argument structure 
of discourse relations, the PDTB also annotates the attribution of 
relations (both explicit and implicit) as well as of each of their 
arguments.

The current release contains 40600 discourse relations annotations, 
distributed into the following five types: Explicit Relations, Implicit 
Relations, Alternative Lexicalizations, Entity Relations, and No 
Relations. 


*2007 Member Survey Responses

*

Please click here 
<https://secure.ldc.upenn.edu/intranet/surveyStatsPublic_2007.jsp?survey_id=1> 
to access a summary of the responses to Questions 1-15 of the 2007 
Member Survey. These questions were sent to all survey recipients.

We also received many suggestions for future releases, among them:

   * More African language publications
   * Gigaword corpora in additional languages
   * More annotated data for a greater variety of uses
   * More parallel text corpora
   * Web blogs and chat room data

Several corpora that would satisfy these needs are prospective 2008 
publications.

The winner of the blind drawing for the $500 benefit for survey 
responses received by January 14, 2008 is Richard Rose of McGill 
University. Congratulations!



*2008 Publications Pipeline
*


Membership Year (MY) 2008 is shaping up to be another productive one for 
the LDC. We anticipate releasing a balanced and exciting selection of 
publications.  Here is a glimpse of what is in the pipeline for MY2008. 
(Disclaimer:  unforeseen circumstances may lead to modifications of our 
plans.  Please regard this list as tentative).

    * BLLIP 1994-1997 News Text Release 1 - automatic parses for the
      North American News Text Corpus - NANT (LDC95T21). The parses were
      generated by the Charniak and Johnson Reranking Parser which was
      trained on Wall Street Journal (WSJ) data from Treebank 3
      (LDC99T42). Each file is a sequence of n-best lists containing the
      top n parses of each sentence with the corresponding parser
      probability and reranker score.  The parses may be used in systems
      that are trained off labeled parse trees but require more data
      than found in WSJ.  Two versions will be released:  a complete
      'Members-Only' version which contains parses for the entire NANT
      Corpus and a 'Non Member' version for general licensing which
      includes all news text except data from the Wall Street Journal.

    * Chinese Proposition Bank -  the goal of this project is to create
      a corpus of text annotated with information about basic semantic
      propositions. Predicate-argument relations are being added to the
      syntactic trees of the Chinese Treebank Data. This release
      contains the predicate-argument annotation of 81,009 verb
      instances (11,171 unique verbs) and 14,525 noun instances (1,421
      unique nouns). The annotation of nouns are limited to
      nominalizations that have a corresponding verb.

    * English Dictionary of the Tamil Verb - contains translations for
      6597 English verbs and defines 9716 Tamil verbs. Each entry
      contain the following: the English entry or head word; the Tamil
      equivalent (in Tamil script and transliteration); the verb class
      and transitivity specification; the spoken Tamil pronunciation
      (audio files in mp3 format); the English definition(s); additional
      Tamil entries (if applicable); example sentences or phrases in
      Literary Tamil, Spoken Tamil (with a corresponding audio file) and
      an English translation; and Tamil synonyms or near-synonyms, where
      appropriate.

    * GALE Phase 1 Arabic Blog Parallel Text -  contains a total of 102K
      words (222 files) of Arabic blog text selected from 33 sources.
      Blogs consist of posts to informal web-based journals of varying
      topical content. Manual sentence units/segments (SU) annotation
      was also performed on a subset of files following LDC's Quick Rich
      Transcription specification.  Files were translated according to
      LDC's GALE Translation guidelines.

    * GALE Phase 1 Chinese Blog Parallel Text - contains a total of 313K
      characters (277 files) of Chinese blog text selected from 8
      sources. Blogs consist of posts to informal web-based journals of
      varying topical content. Manual sentence units/segments (SU)
      annotation was also performed on a subset of files following LDC's
      Quick Rich Transcription specification.  Files were translated
      according to the LDC's GALE Translation guidelines.

    * GALE Phase 1 Arabic Newsgroup Parallel Text - contains a total of
      178K words (264 files) of Arabic newsgroup text selected from 35
      sources. Newsgroups consist of posts to electronic bulletin
      boards, Usenet newsgroups, discussion groups and similar forums.
      Manual sentence units/segments (SU) annotation was also performed
      on a subset of files following LDC's Quick Rich Transcription
      specification.  Files were translated according to LDC's GALE
      Translation guidelines.

    * GALE Phase 1 Chinese Newsgroup Parallel Text - contains a total of
      240K characters (112 files) of Chinese newsgroup text selected
      from 25 sources. Newsgroups consist of posts to electronic
      bulletin boards, Usenet newsgroups, discussion groups and similar
      forums. Manual sentence units/segments (SU) annotation was also
      performed on a subset of files following LDC's Quick Rich
      Transcription specification.  Files were translated according to
      the LDC's GALE Translation guidelines.

    * Hindi WordNet -  first wordnet for an Indian language. Similar in
      design to the Princeton Wordnet for English, it incorporates
      additional semantic relations to capture the complexities of 
      Hindi.  The WordNet contains 28604 synsets and 63436 unique words.
      Created by the NLP group at Indian Institute of Technology Bombay,
      it is inspiring construction of wordnets for many other Indian
      languages, notably Marathi.

    * LCTL Bengali Language Pack  - a set of linguistic resources to
      support technological improvement and development of new
      technology for the Bengali language created in the Less Commonly
      Taught Languages (LCTL) project which covered a total of _
      languages. Package components are: 2.6 million tokens of
      monolingual text, 500,000 tokens of parallel text, a bilingual
      lexicon with 48,000 entries, sentence and word segmenting tools,
      an encoding converter, a part of speech tagger, a morphological
      analyzer, a named entity tagger and 136,000 tokens of named entity
      tagged text, a Bengali-to-English name transliterator, and a
      descriptive grammar created by a PhD research linguist. About
      30,000 tokens of the parallel text are English-to-LCTL
      translations of a "Common Subset" corpus, which will be included
      in all additional LCTL Language Packs.

    * North American News Text Corpus (NANT) Reissue - as a companion to
      BLLIP 1994-1997 News Text Release 1, LDC will reissue the North
      American News Text Corpus (LDC95T21).  Data includes news text
      articles from several sources (L.A.Times/Washington Post, Reuters
      General News, Reuters Financial News, Wall Street Journal, New
      York Times) that has been formatted with TIPSTER-style SGML tags
      to indicate article boundaries and organization of information
      within each article.  Two versions will be released:  a complete
      'Members-Only' version which contains all previously released NANT
      articles and a 'Non Member' version for general licensing which
      includes all news text except data from the Wall Street Journal.



------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------

*
Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA                  http://www.ldc.upenn.edu*


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080226/77420cfe/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list