9.598, FYI: LDC Corpora, Lang Universals

LINGUIST Network linguist at linguistlist.org
Tue Apr 21 23:31:37 UTC 1998


LINGUIST List:  Vol-9-598. Wed Apr 22 1998. ISSN: 1068-4875.

Subject: 9.598, FYI: LDC Corpora, Lang Universals

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Editors:  	    Brett Churchill <brett at linguistlist.org>
		    Martin Jacobsen <marty at linguistlist.org>
		    Elaine Halleck <elaine at linguistlist.org>
                    Anita Huang <anita at linguistlist.org>
                    Ljuba Veselinova <ljuba at linguistlist.org>
		    Julie Wilson <julie at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/


Editor for this issue: Martin Jacobsen <marty at linguistlist.org>

=================================Directory=================================

1)
Date:  Mon, 20 Apr 1998 16:43:14 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpora from the Linguistic Data Consortium

2)
Date:  Mon, 20 Apr 1998 16:42:01 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpora from the Linguistic Data Consortium

3)
Date:  Mon, 20 Apr 1998 10:00:44 -0700 (MST)
From:  Don Nilsen <don.nilsen at asu.edu>
Subject:  Language Universals: Irony, Language Play, Metaphor, Metonymy

-------------------------------- Message 1 -------------------------------

Date:  Mon, 20 Apr 1998 16:43:14 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpora from the Linguistic Data Consortium




		Announcing NEW RELEASES from the
		    Linguistic Data Consortium

1996 Broadcast News Training Speech Data
1996 Broadcast News Dev. and Eval. Data
1996 Broadcast News Transcripts


The 1996 Broadcast News Speech Corpus contains a total of 104 hours of
broadcasts from ABC, CNN, and CSPAN television networks and NPR and
PRI radio networks with corresponding transcripts. The primary
motivation for this collection is to provide training data for the
DARPA "Hub-4" Project on continuous speech recognition in the
broadcast domain. The speech files are available in a 19 disc training
data set with one additional disc of development data and an
additional disc of evaluation data. The following programs are
represented in this corpus:

  ABC Nightline
  ABC World Nightly News
  ABC World News Tonight
  CNN Early Edition
  CNN Early Prime News
  CNN Headline News
  CNN Prime Time News
  CNN The World Today
  CSPAN Washington Journal
  NPR All Things Considered
  NPR Marketplace

Transcripts have been made of all recordings in this publication,
manually time aligned to the phrasal level, annotated to identify
boundaries between news stories, speaker turn boundaries, and gender
information about the speakers. The released version of the
transcripts is in SGML format, and there is accompanying
documentation, and an SGML DTD file, included with the transcription
release.  The transcripts are available via ftp.

Because of restrictions imposed by the copyright holders of the news
text, these corpora are available to 1997 and 1998 LDC members only.
Members who wish to receive these corpora MUST SIGN BOTH THE USC AND
THE NPR AGREEMENTS.  These agreements are available on the Linguistic
Data Consortium WWW Home Page at URL

http://www.ldc.upenn.edu/ldc/catalog/index.html.


If you would like to order a copy of these corpora, please email your
request to <ldc at unagi.cis.upenn.edu>. If you need additional
information before placing your order, or would like to inquire about
membership in the LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL:

http://www.ldc.upenn.edu/

Information is also available via ftp at ftp.cis.upenn.edu under
pub/ldc; for ftp access, please use "anonymous" as your login name,
and give your email address when asked for password.


-------------------------------- Message 2 -------------------------------

Date:  Mon, 20 Apr 1998 16:42:01 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpora from the Linguistic Data Consortium


		Announcing a NEW RELEASE from the
                   LINGUISTIC DATA CONSORTIUM

			
COMLEX English Syntax Lexicon, Version 3.0


This is a moderately broad coverage English lexicon (with about 38,000
lemmas) developed at New York University under LDC sponsorship. It
contains detailed information about the syntactic characteristics of
each lexical item, and is particularly detailed in its treatment of
subcategorization (complement structures).

In the current dictionary, nouns have 9 possible features and 9
possible complements; adjectives have 7 features and 14 complements;
verbs have 5 features and 92 complements; and adverbs have 11
positional classes and 12 features. The entries for 750 frequent verbs
contain 100 tags each, where a tag includes: a pointer to an instance
of that verb in a corpus and the subcategorization appropriate for
that instance.

This latest version of COMLEX Syntax has been updated to include the
adverb classes. We also added diacritics to foreign words, while
retaining the unaccented versions and performed various other updates
to correct and supplement our lexical entries.  For more details about
this revised version, please contact Adam Meyers at New York
University (meyers at cs.nyu.edu).

This release is accompanied by the COMLEX Syntax Text Corpus, Version
2.0.  The Text corpus consists of material from the following sources:

The Brown Corpus, Francis, W. Nelson, 1964 Brown University,
Providence

Wall Street Journal Material, Copyright 1989 Dow
Jones, Inc.

San Jose Mercury News, Copyright 1991 San Jose Mercury News

Associated Press, Copyright 1988

Federal Register materials courtesy of IBM; formatted version
copyright 1992, University of Pennsylvania

Computer Library materials copyright owned by Ziff Communications
Company and other parties as their respective interests may appear.

Institutions that have membership in the LDC during the 1998
Membership Year will be able to receive COMLEX Syntax Lexicon 3.0 at
no additional charge, in the same manner as all other text and speech
corpora published by the LDC.  Members who wish to receive this corpus
must sign the COMLEX user agreement.  This agreement is available on
the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/ldc/catalog/index.html.

Nonmembers can receive a copy of COMLEX Syntax Lexicon 3.0 for
research purposes only for a fee of $1500. If you would like to order
a copy of this corpus, please email your request to
ldc at unagi.cis.upenn.edu. If you need additional information before
placing your order, or would like to inquire about membership in the
LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.


-------------------------------- Message 3 -------------------------------

Date:  Mon, 20 Apr 1998 10:00:44 -0700 (MST)
From:  Don Nilsen <don.nilsen at asu.edu>
Subject:  Language Universals: Irony, Language Play, Metaphor, Metonymy

     In response to Arthur Merin's query on "Verbal Irony as a
Language Universal," I have evidence suggesting that it might be, and
even more evidence suggesting that Language Play, Metaphor, and
Metonymy are language universals.  I have bibliographies relating to
these areas for anyone out there who is interested in the current
research.

Don L. F. Nilsen                                  8-)
<don.nilsen at asu.edu> (602) 965-7592; FAX: (602) 965-3451
Executive Secretary
International Society for Humor Studies
English Department
Arizona State University
Tempe, AZ 85287-0302

---------------------------------------------------------------------------
LINGUIST List: Vol-9-598



More information about the LINGUIST mailing list