Corpora: "The Start of a Stop List at BA" by Barbara J. Flood
Einat Amitay
einat at ics.mq.edu.au
Thu May 18 02:57:05 UTC 2000
Hi all,
I know I'm probably doing something I shouldn't - posting a full text of
an article that doesn't belong to me - but it is so short - and so
relevant that I had too.
Many times corpora people ask about stop lists. This short text tells a
good story, and may add some references to the list some of us maintain.
+:o)
einat
------------------
This article has been accepted for publication in the Journal of the
American Society for Information Science. Copyright ? 1999 John Wiley &
Sons, Inc.
The Start of a Stop List at BA
Barbara J. Flood
ARC/Philadelphia Developmental Disabilities Corp.
PDDC/ARC at Libertynet.org
The start was 1961. Biological Abstracts' (BA's) traditional subject
index was falling further and further behind. BA decided to try the
recently introduced keyword-in-context indexing. This title index became
Biological Abstracts' Subject in Context or BASIC, first published in
October-November of 1961 after a year of planning (Biological Abstracts,
1961).
I was given early runs of computer print-outs to look through
editorially. There were inches of line printer pages with index entry
words such as >of,= >the=, >in,= >and.= It might be of trivial interest
that >of= was the most prevalent, perhaps because these were titles in
biology. I chose to delete these words and began to compile a list of
words that were going to be automatically stopped from printing. This
became a Stop List. I don't know whether it was the first to be called a
>Stop List.= By 1959, Luhn (1959, 1960) suggested types of
non-significant words to be omitted. >Stop words= was used by Parkins
(1963), but >Stop List= was Stevens= (1965) term of choice, and
>stoplist= was used by Fischer (1966).
Members of the editorial department met often to discuss candidate stop
list words generated from the print-outs to make sure that homographs
(such as Aa@ in AVitamin A@ or Aare@ as a measure of area) were not
overlooked. Thus, the Stop List was generated from frequency data with
concurrence by committee.
Multiple word terms such as Rana esculenta had to be evaluated as to
whether the second word provided a significant index entry. Should the
second word be added to the Stop List? Was it frequent enough? The cost
of adding a word to the Stop List with resultant added computer time had
to be compared to the cost per copy of printing the extra line.
The Stop List grew rapidly. Parkins soon reported (Parkins, 1963) that
14 words prevented eighty percent of the entries for BASIC and that at
the time there were already 1,000 words. This is comparable to the
experience at Chemical Abstracts Service with Chemical Titles; Freeman &
Dyson (1963) report an initial list of 750 words, expanded to 950, and
then culled to 328 words. But each additional comparison added to the
cost of the computer run. Was eighty percent enough? The decision was
made on the basis of frequency. A word that did not show up often was
not worth the extra sort comparison. My favorite examples are
typographical such as >hte= and >fo.= Because nobody is going to look up
these words in an index, it isn't important that there be an extra line
or two.
Later considerable editorial >augmentation= modified titles, but the
Stop List was the start. The Stop List could be run automatically. It
removed clutter for the user and reduced cost for the producer. The
title index could also be produced in a timely manner. The Stop List
provided an improvement over the raw computer output of titles while
retaining the advantages of a computer produced keyword-in-context
index.
Acknowledgment
I thank the editor and anonymous reviewers for helpful suggestions.
References
Biological Abstracts (1961). Introduction to the BASIC Index Volume 36
part 4, October-November.
Fischer, Marguerite (1966). The KWIC index concept: a retrospective
view, American Documentation, 17:57-70.
Freeman, R. R. & Dyson, G. Malcolm (1963). Development and production of
Chemical Titles, a current awareness index publication prepared with the
aid of a computer, Journal of Chemical Documentation, 3:16-20.
Luhn, H. P. (1959). Keyword in Context Index for Technical Literature
(KWIC Index), Yorktown Heights, N.Y., IBM, Report RC 127. Also in:
American Documentation, 11:288-295, 1960.
Parkins, Phyllis V. (1963). Approaches to vocabulary management in
permuted-title indexing of Biological Abstracts, in Automation and
Scientific Communication Part 1, Proceedings of The 26th Annual Meeting
of the American Documentation Institute, (pp 27-28), Washington, D.C.:
ADI.
Stevens, Mary Elizabeth (1965). Automatic Indexing: A State-of-the-Art
Report. National Bureau of Standards Monograph 91, pp 41, 64-66.
-------------------
--
Einat Amitay
einat at ics.mq.edu.au
http://www.ics.mq.edu.au/~einat
More information about the Corpora
mailing list