8.627, FYI: New Corpus

linguist at linguistlist.org linguist at linguistlist.org
Tue Apr 29 13:07:56 UTC 1997


LINGUIST List:  Vol-8-627. Tue Apr 29 1997. ISSN: 1068-4875.

Subject: 8.627, FYI: New Corpus

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            T. Daniel Seely: Eastern Michigan U. <seely at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Associate Editors: Ljuba Veselinova <ljuba at linguistlist.org>
                   Ann Dizdar <ann at linguistlist.org>
Assistant Editor:  Sue Robinson <sue at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

               ************************************************
  During the month of April, you may make credit card donations to
  LINGUIST via the Cascadilla Press web site:
                  http://www.cascadilla.com/linglist.html
  If you believe LINGUIST is a valuable service, please contribute to
  the LINGUIST Editorial Support Fund, which pays our student editors.
                ***********************************************


Editor for this issue: T. Daniel Seely <seely at linguistlist.org>

=================================Directory=================================

1)
Date:  Tue, 29 Apr 1997 08:36:27 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpus from the Linguistic Data Consortium

-------------------------------- Message 1 -------------------------------

Date:  Tue, 29 Apr 1997 08:36:27 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  New Corpus from the Linguistic Data Consortium



              Announcing a NEW RELEASE from the
                   LINGUISTIC DATA CONSORTIUM

      DSO CORPUS OF SENSE-TAGGED ENGLISH NOUNS AND VERBS

This corpus contains sense-tagged word occurrences for 121 nouns and
70 verbs which are among the most frequently occurring and ambiguous
words in English.  These occurrences are provided in about 192,800
sentences taken from the Brown corpus and the Wall Street Journal, and
have been hand tagged by students at the Linguistics Program of the
National University of Singapore.  WordNet 1.5 sense definitions of
these nouns and verbs were used to identify a word sense for each
occurrence of each word.

In addition to providing the word occurrences in their full sentential
context, the corpus includes complete listings of the WordNet 1.5
sense definitions used in the tagging.

The following example illustrates the format of a sentence with a
sense tag for the word "action", followed by the corresponding
WordNet1.5 sense definition:

  ca01.db #020 `` These >> actions 8 << should serve to protect in
       fact and in effect the court 's wards from undue costs and its
       appointed and elected servants from unmeritorious criticisms ''
       , the jury said .

  Sense 8
    legal action, action, case, lawsuit, suit -- (a judicial proceeding
    brought by one party against another; "no criminal cases were heard
    while the judge was ill")
      => proceeding, legal proceeding, judicial proceeding,
         proceedings -- (the institution of a legal action)
          => due process, due process of law -- (the administration
             of justice according to established rules and principles)
              => group action -- (action taken by a group of people)
                  => act, human action, human activity -- (something
                     that people do or cause to happen)

(In the actual corpus, all tagged occurrences of a given noun or verb
are stored together in one file, with each full sentence on one line;
all noun and verb word sense definitions are stored together in two
separate files.)

This sense tagged corpus was provided by Hwee Tou Ng of the Defence
Science Organisation (DSO) of Singapore.  It was first reported in the
following paper at ACL-96:

"Integrating Multiple Knowledge Sources to Disambiguate Word Sense:
  An Exemplar-Based Approach," by Hwee Tou Ng and Hian Beng Lee, in
  Proceedings of the 34th Annual Meeting of the Association for
  Computational Linguistics, pages 40-47, Santa Cruz, California, USA,
  June 1996.  ( http://xxx.lanl.gov/abs/cmp-lg/9606032 )

Institutions that have membership in the LDC during the 1997
Membership Year will be able to receive DSO Corpus of Sense-Tagged
English Nouns and Verbs at no additional charge, in the same manner as
all other text and speech corpora published by the LDC.

Nonmembers can receive a copy of this corpus for research purposes
only for a fee of US$100. If you would like to order a copy of this
corpus, please email your request to ldc at unagi.cis.upenn.edu. If you
need additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or call
(215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.cis.upenn.edu/~ldc. Information is also available via ftp
at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.

---------------------------------------------------------------------------
LINGUIST List: Vol-8-627



More information about the LINGUIST mailing list