[Corpora-List] New from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Feb 26 21:49:46 UTC 2008
LDC2008T04
*- OntoNotes Release 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04> -*
LDC2008T05
*- Penn Discourse Treebank Version 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05> -*
*- 2007 Member Survey Responses -*
*
- 2008 Publications Pipeline -
*
------------------------------------------------------------------------
*
New Publications
*
(1) The OntoNotes project is a collaborative effort between BBN
Technologies, the University of Colorado, the University of
Pennsylvania, and the University of Southern California's Information
Sciences Institute. The goal of the project is to annotate a large
corpus comprising various genres of text (news, conversational telephone
speech, weblogs, use net, broadcast, talk shows) in three languages
(English, Chinese, and Arabic) with structural information (syntax and
predicate argument structure) and shallow semantics (word sense linked
to an ontology and coreference).
OntoNotes Release 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21>
contains 400k words of Chinese newswire data and 300k words of English
newswire data. The current release, OntoNotes Release 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04>,
adds the following to the corpus: 274k words of Chinese broadcast news
data and 200k words of English broadcast news data. The current goals
call for annotation of over a million words each of English and Chinese,
and half a million words of Arabic over five years. OntoNotes builds on
two time-tested resources, following the Penn Treebank for syntax and
the Penn PropBank for predicate-argument structure. Its semantic
representation will include word sense disambiguation for nouns and
verbs, with each word sense connected to an ontology, and coreference.
*
(2) The Penn Discourse Treebank (PDTB)
<http://www.seas.upenn.edu/%7Epdtb> Project is located at the Institute
for Research in Cognitive Science at the University of Pennsylvania.
The goal of the project is to develop a large scale corpus annotated
with information related to discourse structure. Penn Discourse Treebank
Version 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05>
contains annotations of discourse relations and their arguments on the
one million word Wall Street Journal (WSJ) data in Treebank-2 (LDC95T7).
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>
The PDTB focuses on encoding discourse relations associated with
discourse connectives, adopting a lexically grounded approach for the
annotation. The corpus provides annotations for the argument structure
of Explicit and Implicit connectives, the senses of connectives and the
attribution of connectives and their arguments. The lexically grounded
approach exposes a clearly defined level of discourse structure which
will support the extraction of a range of inferences associated with
discourse connectives.
The PDTB annotates semantic or informational relations holding between
two (and only two) Abstract Objects (AOs), expressed either explicitly
via lexical items or implicitly via adjacency. For the former, the
lexical items anchoring the relation are annotated as Explicit
connectives. For the latter, the implicit inferable relations are
annotated by inserting an Implicit connective that best expresses the
inferred relation.
Explicit connectives are identified from three grammatical classes:
subordinating conjunctions (e.g., because, when), coordinating
conjunctions (e.g., and, or), and discourse adverbials (e.g., however,
otherwise). Arguments of connectives are simply labeled Arg2 for the
argument appearing in the clause syntactically bound to the connective,
and Arg1 for the other argument. In addition to the argument structure
of discourse relations, the PDTB also annotates the attribution of
relations (both explicit and implicit) as well as of each of their
arguments.
The current release contains 40600 discourse relations annotations,
distributed into the following five types: Explicit Relations, Implicit
Relations, Alternative Lexicalizations, Entity Relations, and No
Relations.
*2007 Member Survey Responses
*
Please click here
<https://secure.ldc.upenn.edu/intranet/surveyStatsPublic_2007.jsp?survey_id=1>
to access a summary of the responses to Questions 1-15 of the 2007
Member Survey. These questions were sent to all survey recipients.
We also received many suggestions for future releases, among them:
* More African language publications
* Gigaword corpora in additional languages
* More annotated data for a greater variety of uses
* More parallel text corpora
* Web blogs and chat room data
Several corpora that would satisfy these needs are prospective 2008
publications.
The winner of the blind drawing for the $500 benefit for survey
responses received by January 14, 2008 is Richard Rose of McGill
University. Congratulations!
*2008 Publications Pipeline
*
Membership Year (MY) 2008 is shaping up to be another productive one for
the LDC. We anticipate releasing a balanced and exciting selection of
publications. Here is a glimpse of what is in the pipeline for MY2008.
(Disclaimer: unforeseen circumstances may lead to modifications of our
plans. Please regard this list as tentative).
* BLLIP 1994-1997 News Text Release 1 - automatic parses for the
North American News Text Corpus - NANT (LDC95T21). The parses were
generated by the Charniak and Johnson Reranking Parser which was
trained on Wall Street Journal (WSJ) data from Treebank 3
(LDC99T42). Each file is a sequence of n-best lists containing the
top n parses of each sentence with the corresponding parser
probability and reranker score. The parses may be used in systems
that are trained off labeled parse trees but require more data
than found in WSJ. Two versions will be released: a complete
'Members-Only' version which contains parses for the entire NANT
Corpus and a 'Non Member' version for general licensing which
includes all news text except data from the Wall Street Journal.
* Chinese Proposition Bank - the goal of this project is to create
a corpus of text annotated with information about basic semantic
propositions. Predicate-argument relations are being added to the
syntactic trees of the Chinese Treebank Data. This release
contains the predicate-argument annotation of 81,009 verb
instances (11,171 unique verbs) and 14,525 noun instances (1,421
unique nouns). The annotation of nouns are limited to
nominalizations that have a corresponding verb.
* English Dictionary of the Tamil Verb - contains translations for
6597 English verbs and defines 9716 Tamil verbs. Each entry
contain the following: the English entry or head word; the Tamil
equivalent (in Tamil script and transliteration); the verb class
and transitivity specification; the spoken Tamil pronunciation
(audio files in mp3 format); the English definition(s); additional
Tamil entries (if applicable); example sentences or phrases in
Literary Tamil, Spoken Tamil (with a corresponding audio file) and
an English translation; and Tamil synonyms or near-synonyms, where
appropriate.
* GALE Phase 1 Arabic Blog Parallel Text - contains a total of 102K
words (222 files) of Arabic blog text selected from 33 sources.
Blogs consist of posts to informal web-based journals of varying
topical content. Manual sentence units/segments (SU) annotation
was also performed on a subset of files following LDC's Quick Rich
Transcription specification. Files were translated according to
LDC's GALE Translation guidelines.
* GALE Phase 1 Chinese Blog Parallel Text - contains a total of 313K
characters (277 files) of Chinese blog text selected from 8
sources. Blogs consist of posts to informal web-based journals of
varying topical content. Manual sentence units/segments (SU)
annotation was also performed on a subset of files following LDC's
Quick Rich Transcription specification. Files were translated
according to the LDC's GALE Translation guidelines.
* GALE Phase 1 Arabic Newsgroup Parallel Text - contains a total of
178K words (264 files) of Arabic newsgroup text selected from 35
sources. Newsgroups consist of posts to electronic bulletin
boards, Usenet newsgroups, discussion groups and similar forums.
Manual sentence units/segments (SU) annotation was also performed
on a subset of files following LDC's Quick Rich Transcription
specification. Files were translated according to LDC's GALE
Translation guidelines.
* GALE Phase 1 Chinese Newsgroup Parallel Text - contains a total of
240K characters (112 files) of Chinese newsgroup text selected
from 25 sources. Newsgroups consist of posts to electronic
bulletin boards, Usenet newsgroups, discussion groups and similar
forums. Manual sentence units/segments (SU) annotation was also
performed on a subset of files following LDC's Quick Rich
Transcription specification. Files were translated according to
the LDC's GALE Translation guidelines.
* Hindi WordNet - first wordnet for an Indian language. Similar in
design to the Princeton Wordnet for English, it incorporates
additional semantic relations to capture the complexities of
Hindi. The WordNet contains 28604 synsets and 63436 unique words.
Created by the NLP group at Indian Institute of Technology Bombay,
it is inspiring construction of wordnets for many other Indian
languages, notably Marathi.
* LCTL Bengali Language Pack - a set of linguistic resources to
support technological improvement and development of new
technology for the Bengali language created in the Less Commonly
Taught Languages (LCTL) project which covered a total of _
languages. Package components are: 2.6 million tokens of
monolingual text, 500,000 tokens of parallel text, a bilingual
lexicon with 48,000 entries, sentence and word segmenting tools,
an encoding converter, a part of speech tagger, a morphological
analyzer, a named entity tagger and 136,000 tokens of named entity
tagged text, a Bengali-to-English name transliterator, and a
descriptive grammar created by a PhD research linguist. About
30,000 tokens of the parallel text are English-to-LCTL
translations of a "Common Subset" corpus, which will be included
in all additional LCTL Language Packs.
* North American News Text Corpus (NANT) Reissue - as a companion to
BLLIP 1994-1997 News Text Release 1, LDC will reissue the North
American News Text Corpus (LDC95T21). Data includes news text
articles from several sources (L.A.Times/Washington Post, Reuters
General News, Reuters Financial News, Wall Street Journal, New
York Times) that has been formatted with TIPSTER-style SGML tags
to indicate article boundaries and organization of information
within each article. Two versions will be released: a complete
'Members-Only' version which contains all previously released NANT
articles and a 'Non Member' version for general licensing which
includes all news text except data from the Wall Street Journal.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
*
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080226/77420cfe/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list