9.1355, FYI: LDC new corpus

Wed Sep 30 15:40:49 UTC 1998

LINGUIST List:  Vol-9-1355. Wed Sep 30 1998. ISSN: 1068-4875.

Subject: 9.1355, FYI: LDC new corpus

Moderators: Anthony Rodrigues Aristar: Wayne State U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Reviews: Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Associate Editors:  Martin Jacobsen <marty at linguistlist.org>
                    Brett Churchill <brett at linguistlist.org>
                    Ljuba Veselinova <ljuba at linguistlist.org>

Assistant Editors:  Scott Fults <scott at linguistlist.org>
		    Jody Huellmantel <jody at linguistlist.org>
		    Karen Milligan <karen at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Chris Brown <chris at linguistlist.org>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Brett Churchill <brett at linguistlist.org>

=================================Directory=================================

1)
Date:  Tue, 29 Sep 1998 12:11:25 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  1997 Mandarin Broadcast News Speech and Transcripts

2)
Date:  Tue, 29 Sep 1998 12:12:24 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  Voicemail Corpus - Part I

3)
Date:  Tue, 29 Sep 1998 12:10:48 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  1998 Speaker Recognition Evaluation Test-Set

4)
Date:  Tue, 29 Sep 1998 12:12:59 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  JURIS (Justice Department Retrieval and Inquiry System) Text Corpus

-------------------------------- Message 1 -------------------------------

Date:  Tue, 29 Sep 1998 12:11:25 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  1997 Mandarin Broadcast News Speech and Transcripts

Announcing a NEW CORPUS from the LDC

***************************************************
1997 Mandarin Broadcast News Speech and Transcripts
***************************************************

This collection consists of 30 hours of recorded broadcasts
and transcripts that have been drawn from the following
sources:

  Voice of America (VOA): United States Information Agency Radio
  People's Republic of China Television (CCTV)
  Commercial radio based in Los Angeles, CA. (KAZN-AM)

Of these three sources, the first two comprise the bulk of the
collection, and are represented in roughly equal amounts; only
a relatively small sample of KAZN-AM recordings are included,
owing to the relatively high proportion of unusable material
(commercials, local traffic reports loaded with California
place names, etc).

The transcripts were created by native speakers of Mandarin
working at the LDC; they are in GB-encoded form, with SGML
tagging to identify story boundaries, speaker turn boundaries,
and phrasal pauses; these tags include time stamps to align
the text with the speech data.  Word segmentation (white-spacebetween words) is included.  A working DTD is provided, and
the markup is consistent with that of the 1997 English and
Spanish Hub-4 collections.

Because of restrictions imposed by the copyright holders, this
corpus is available to 1998 LDC members only. Members who wish
to receive this corpus must sign the 1997 Mandarin Broadcast
News license.  This license can be retrieved from the LDC
website at:

http://www.ldc.upenn.edu/ldc/catalog/nonmem_agree/agreements.html

If you would like to order a copy of this corpus, please email
your request to <ldc at unagi.cis.upenn.edu>. If you need
additional information before placing your order, or would
like to inquire about membership in the LDC, please send email
or call (215) 898-0464.

Further information about the LDC and its available corpora
can be accessed on the Linguistic Data Consortium WWW Home
Page at URL:

http://www.ldc.upenn.edu/

-------------------------------- Message 2 -------------------------------

Date:  Tue, 29 Sep 1998 12:12:24 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  Voicemail Corpus - Part I

	Announcing a NEW CORPUS from the LDC

*************************
Voicemail Corpus - Part I
*************************

The Voicemail Corpus - Part I was created by the following
researchers at IBM:

M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P.S.
Gopalakrishnan, and C.  Dunn.

This CD-ROM corpus consists of 1801 voicemail messages,
collected from volunteers at various IBM sites in the United
States, comprising the training data set and 42 messages in the
development test set.  The average voicemail message is 31
seconds in duration, and has about 100 words.  Approximately 38%
of the messages correspond to male speakers; the remainder
correspond to females. All messages were transcribed by IBM.

During the collection period, volunteers were asked to forward
some of their voicemail messages to a local extension number set
up for the purpose of collecting this data. The messages were
then collected periodically from the voicemailbox of this local
extension and added to the database.

DirectTalk6000 (DT6K) software was used to transfer the
voicemail messages to the computer.  DT6K is an application that
runs under the AIX operating system on a host computer, and can
interface to a phone line through special hardware on the host
computer. Note that the data was collected from IBM sites all
over the US whereas the host computer that the DT6K application
was running on was located at a single IBM site. Consequently,
when the application dialed into the phonemail system of an IBM
site in a different state, the voicemail messages were played
out over a long distance line before they were recorded on the
host computer.

The data was sampled at 8 KHz, and recorded in 8-bit u-law
compressed format onto a local disk of the host computer. The
messages were compressed by the proprietary compression
techniques used by the ROLM phonemail system, which is the
phonemail system in use at various IBM locations.

IBM would like to acknowledge the support of DARPA for funding
this data collection effort under Grant MDA972- 97-C-0012 and is
also extremely grateful to George di Simone and Ira Ellis
(Watson telephone system support) for their help in setting up
the data collection process. IBM would also like to thank Dr.
Ellen Eide for helping with the verification of transcripts and
Dr. Salim Roukos, Dr. David Nahamoo, and Dr. Lalit Bahl for
their help and support.  Finally, thanks are due to the various
volunteers who contributed their voicemail messages to the
database.

Institutions that have membership in the LDC during the 1998
Membership Year will be able to receive this corpus in the same
manner as all other text and speech corpora published by the
LDC.

If you would like to order a copy of this corpus, please email
your request to <ldc at unagi.cis.upenn.edu>. If you need
additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or
call (215) 898-0464.

Further information about the LDC and its available corpora can
be accessed on the Linguistic Data Consortium WWW Home Page at
URL:

http://www.ldc.upenn.edu/

-------------------------------- Message 3 -------------------------------

Date:  Tue, 29 Sep 1998 12:10:48 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  1998 Speaker Recognition Evaluation Test-Set

	Announcing a NEW CORPUS from the LDC

********************************************
1998 Speaker Recognition Evaluation Test-Set
********************************************

The 1998 speaker recognition evaluation is part of an ongoing
series of yearly benchmark tests conducted by NIST.  These
tests are intended to provide a stable reference point for
measuring and comparing the performance of diverse methods for
text-independent speaker recognition over the telephone, and
should be of interest to all researchers working in this area
of speech technology development.  The test sets and
evaluation protocols have been designed to be simple, to focus
on core technology issues, to be fully supported, and to be
accessible.

In 1996 and 1997 handset variation was featured as a prominent
technical challenge to be addressed.  While handset variation
remains a formidable challenge, the 1998 evaluation directs
greatest attention toward speaker recognition performance for
the case in which both training and test data are from the
same source.  The speech data were recorded by the LDC between
January and March, 1997; most of the speakers recruited for
this collection were college students from the Great Lakes
(Northern Mid-West) region of the U.S.

Institutions that have membership in the LDC during the 1998
Membership Year will be able to receive this corpus in the
same manner as all other text and speech corpora published by
the LDC.  Nonmembers may purchase the 1998 Speaker Recognition
Evaluation Test-Set for $600.

If you would like to order a copy of this corpus, please email
your request to <ldc at unagi.cis.upenn.edu>. If you need
additional information before placing your order, or would
like to inquire about membership in the LDC, please send email
or call (215) 898-0464.

Further information about the LDC and its available corpora
can be accessed on the Linguistic Data Consortium WWW Home
Page at URL:

http://www.ldc.upenn.edu/

-------------------------------- Message 4 -------------------------------

Date:  Tue, 29 Sep 1998 12:12:59 EDT
From:  LDC Office <ldc at unagi.cis.upenn.edu>
Subject:  JURIS (Justice Department Retrieval and Inquiry System) Text Corpus

	Announcing a NEW CORPUS from the LDC

*******************************************************************
JURIS (Justice Department Retrieval and Inquiry System) Text Corpus
*******************************************************************

The text data contained on this two-CD-ROM set
represent a release of the JURIS (Justice Department
Retrieval and Inquiry System) data collection that
has been made available to the Linguistic Data
Consortium (LDC) by the U.S. Department of Justice.
The time span of the text ranges from the 1700's to
the early 1990's.

There are 1664 individual text files in the corpus,
1011 on the first CD-ROM, and 653 on the second. The
original archive consisted of 219 files ranging
between less than 1 MB and nearly 70 MB in size. In
order to make the data more accessible for research
use, we chose to divide the larger files into pieces,
such that the average file size was about 2 MB when
uncompressed (the largest uncompressed file size is
about 4.5 MB).  Divisions of the files were done at
document boundaries, so all files contain whole
documents.

There are a total of 694,667 document units in the
corpus, and these can be categorized to some extent
with regard to their content.  The following is a
partial list of categories and their descriptions
drawn from JURIS documentation contained in the
corpus. The terminology and organization of
categories are those used in the JURIS documentation:

 * ADMINISTRATIVE LAW

Published Comptroller General Decisions; Unpublished
Comptroller General Decisions; Opinions of the
Attorney General; Office of Legal Counsel (US Dept.
of Justice Board of Contract Appeals; ADP Protest
Report (Summary of ADP Procurement Protests before
the GSBCA); Federal Labor Relations Authority Case
Decisions; FLRA Administrative Law Judge Decisions;
Federal Service Impasses Decisions; Decisions and
Reports on Rulings of the Assistant Sec.  of Labor
for Labor Management Relations; Federal Labor
Relations Council Rulings on Requests of the Asst.
Sec. of Labor for Labor Management Relations; HUD
Administrative Law Decisions; Merit System Protection
Board Decisions; Decisions under Immigration and
Nationality Laws; Environmental Protection Agency
General Counsel Opinions; Equal Opportunity
Commission Decisions; Equal Employment Opportunity
Commission Policy Statements; US Office of Government
Ethics Decisions; HHS Department Appeals Board
Decisions.

 * DEPARTMENT OF JUSTICE BRIEFS

Office of the Solicitor General; Civil Division;
Civil Division Trial; Environmental and Natural
Resources Division; Tax Division Criminal Appellate;
US Attorney's Offices; US Trustees' Offices.

 * CASE LAW

U.S. Supreme Court; Federal Reporter, 2nd Series;
Court of Appeals Unpublished Decisions; Federal
Supplement; Federal Rules Decisions; Atlantic 2nd
Reporter (DC only); Bankruptcy Reporter; Courts of
Military Review; Military Justice Reporter; Court of
Claims.

 * FREEDOM OF INFORMATION ACT

FOIA Update Newsletter; DOJ Guide to the FOIA Case
List Publications.

* FEDERAL REGULATIONS

Code of Federal Regulations; Unified Agenda of
Federal Regulations; Defense Acquisition Regulations.

 * TREATIES AND OTHER INTERNATIONAL AGREEMENTS

United States Treaties and Other International
Agreements; Department of Defense Unpublished
International Agreements.

 * INDIAN LAW

Opinions of the Solicitor (Dept. of Interior);
Ratified Treaties; Unratified Treaties; Presidential
Proclamations; Executive Orders and Other Orders
Pertaining to Indians.

 * IMMIGRATION AND NATURALIZATION LAW

Decisions Under Immigration and Nationality Law;
Title 8 - Code of Federal Regulations; Immigration
Reform and Control Act of 1988, Legislative History;
Equal Access to Justice Act, Legislative History.

* STATUTORY LAW

Public Laws; United States Code; Executive Orders;
Anti-Drug Abuse Act of 1988; Section-by-section
analysis of anti-drug abuse act of 1988; Criminal
Division Handbook on CCCA; The Organic Laws of the
United States.

 * TAX LAW

US Tax Court Decisions; US Board of Tax Appeals
Decisions; Tax Division's Summons Enforcement
Decisions; Tax Division's Tax Protester Case List;
Tax Division's Criminal Tax Manual; Tax Division's
Criminal Tax Indictment/Information Forms; Tax
Division's Standardized Criminal Tax Jury
Instructions; Tax Division's Criminal Section
Newsletter; Tax Court Memorandum Decisions; IRS
Cumulative Bulletin; Tax International Acts; IRS News
Releases; IRS General Counsel Memoranda; IRS Actions
on Decisions; IRS Technical Memoranda.

 * MANUALS

United States Attorney's Manual; United States
Trustees' Manual; Federal Personnel Manual; Federal
Acquisition Regulations; Federal Acquisition
Circulars; Federal Travel Regulation; Federal
Information Resources Management Regulation; Federal
Property Management Regulations; Principles of
Federal Appropriations Law; Justice Department
Acquisition Regulation; Justice Property Management
Regulations.

 * DEPARTMENT OF JUSTICE WORKPRODUCTS

Civil Division Monographs; Civil Division Torts
Branch Handbook on damages under FTCA; Criminal
Division Monographs; Criminal Division Forms;
Criminal Division Guidelines for Drafting
Indictments; Criminal Division Narcotics; Forfeiture,
Prosecution Manual; Criminal Division Directory of
Services; Asset Forfeiture Manuals; Obscenity
Enforcement Reporter; Environmental and Natural
Resources Division Monographs; US Sentencing
Commission's Guidelines Manual; Sentencing Guidelines
Updates.

The text files are all formatted using a set of SGML
tags to mark document boundaries, and to mark major
structural features within documents.  As with file
organization, the markup is derived from the document
structures as provided by the Justice Department.

Institutions that have membership in the LDC during
the 1998 Membership Year will be able to receive this
corpus in the same manner as all other text and
speech corpora published by the LDC.  Nonmembers may
purchase JURIS for $1500.

If you would like to order a copy of this corpus,
please email your request to
<ldc at unagi.cis.upenn.edu>. If you need additional
information before placing your order, or would like
to inquire about membership in the LDC, please send
email or call (215) 898-0464.

Further information about the LDC and its available
corpora can be accessed on the Linguistic Data
Consortium WWW Home Page at URL:

http://www.ldc.upenn.edu/

- ----- End of Forwarded Message

---------------------------------------------------------------------------
LINGUIST List: Vol-9-1355