[Corpora-List] Laypersons' applied corpus linguistics

Mon Jan 26 20:32:45 UTC 2009

I think we can say there exists a special category of techniques (of which
content analysis is the most common) which attempt to characterize text
passages primarily dependent upon lexical information alone. They typically are
accompanied by a dictionary which give content categories to associate with
individual words and phrases in the text. It is hard to say whether content
analysis is the appropriate label for all these techniques.

For example, periodically, there are reports of automated techniques for
assessing the likelihood that essay text answers in quizes indicate an
understanding of the subject content. The inference typically given is that you
can tell whether someone knows what they are talking about by the words they use
and can correlate high-scoring essay answers with vocablary choice alone,
without knowing anything about sentence structure and what was said using those
words.

Content Analysis is primarily a means of evaluating whole texts as its success
depends upon multiple occurrences of words or phrases elevating the score of a
number of content categories which those words or phrases are deemed to
indicate in the text. It was developed as a technique for dealing with large
volumes of text at a time when computational linguistics was still solely
concerned with parsing individual sentences. While it doesn't allow for
"understanding" of a text, it has been by psychologists, psychiatrists,
sociologists, political scientists to characterize attitudes expressed in the
text.  It is sometimes combined with minimal syntactic analysis to determine
the attitudes expressed in the text, as "hostility towards authority" or
"affection toward animals", etc. This is done primarily by using proximity of
words (such as adjectives to nouns). So, for example, the words "arrogance",
"hate", "dislike", "malicious", etc. are an indicator of negative feelings
toward the noun they precede in a sentence. A sequence such as "dislike
doctors" or "hate bureaucrats" could be used to assess the degree of dislike
toward authority figures by the author of the text.

Surprisingly, Wikipedia doesn't have an entry on the General Inquirer (note:
it's Inquirer, not Enquirer). The primary reference book is the 1966 volume,
"The General Inquirer: A Computer Approach to Content Analysis" by Philip J.
Stone, Dexter C. Dunphy, Marchall S. Smith and Daniel M. Ogilvie (MIT Press). A
good background article on the web is: Donald L. Diefenbach's "Historical
Foundations of Computer-Assisted Content Analysis"
(http://facstaff.unca.edu/ddiefenb/ca.html).

Content analysis resulted in the production of dictionaries which indicated the
content categories of words. The General Inquirer was accompanied by the
creation of several dictionaries such as the Harvard Psychosociological
Dictionary, the Stanford Political Dictionary and the Need-Achievement
Dictionary.

In psychiatry, Julius Laffal created "A Concept Dictionary of English" (1973,
Gallery Press) for use in assessing psychiatric patients states of mind based
on computer processing of transcripts of their therapy sessions.

Also related to content analysis is a technique I'd refer to as "subject
analysis" which involves determining the subject of a text based on counting
occurrences of subject-indicating words and phrases in a text. So, for 
example, "tank", "artillery", "soldier", "army" and "jeep" could be taken as
indicators of the subject "military" and a text which contained a number of
these words would be considered to have "military" content. This technique
works well enough to identify subject categories such as a newspaper would
assign to its news stories --- or using a specialized dictionary with subject
codes (e.g., the McGraw-Hill Dictionary of Scientific and Technical Terms) the
subject categories of Scientific American articles.

This methodology was described in Walker & Amsler's paper, "The Use of
Machine-Readable Dictionaries in Sublanguage Analysis" (in Grishman &
Kittridge, "Anaylzing Language in Restricted Domains", 1986). The dictionary
was based on the subject-codes in the machine-readable tapes of the Longman
Dictionary of Contemporary English produced by Paul Procter.

Dr. Robert A. Amsler
computational lexicologst
Vienna, Virginia

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora