17.104, Review: Corpus Linguistics: Sinclair (2004)

Fri Jan 13 21:23:27 UTC 2006

LINGUIST List: Vol-17-104. Fri Jan 13 2006. ISSN: 1068 - 4875.

Subject: 17.104, Review: Corpus Linguistics: Sinclair (2004)

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Dooley, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Lindsay Butler <lindsay at linguistlist.org>
================================================================  

What follows is a review or discussion note contributed to our 
Book Discussion Forum. We expect discussions to be informal and 
interactive; and the author of the book discussed is cordially 
invited to join in. If you are interested in leading a book 
discussion, look for books announced on LINGUIST as "available 
for review." Then contact Sheila Dooley at dooley at linguistlist.org. 

===========================Directory==============================  

1)
Date: 11-Jan-2006
From: Oliver Streiter < ostreiter at web.de >
Subject: Trust The Text: Language, Corpus and Discourse 

-------------------------Message 1 ---------------------------------- 
Date: Fri, 13 Jan 2006 16:08:26
From: Oliver Streiter < ostreiter at web.de >
Subject: Trust The Text: Language, Corpus and Discourse 

AUTHOR: Sinclair, John McH.
EDITOR: Carter, Ronald 
TITLE: Trust The Text
SUBTITLE: Language, Corpus and Discourse
PUBLISHER: Routledge (Taylor and Francis)
YEAR: 2004
Announced at http://linguistlist.org/issues/15/15-2786.html 

Oliver Streiter, National University of Kaohsiung, Taiwan

OVERVIEW

The book under review, ''Trust the Text: Language, corpus and 
discourse'' by John Sinclair is a collection of 12 papers on written 
discourse structure, lexis structure, phraseology, lexicography and 
linguistic theory. All papers have been published previously between 
1982 and 2003, but many of these papers are not easily accessible. 
Some have been published in Festschriften, others are transcripts of 
lectures. The book thus tries to make these papers accessible to a 
wider audience.

The author, John Sinclair is one of the most influential and original 
figures in contemporary linguistics. His focus on the analysis of 
spoken language and his practical and theoretical work in corpus 
linguistics, long before this had become mainstream, has influenced 
many linguists and has changed the face of modern linguistics.

SUMMARY

Most of the ideas presented in this collection have been discussed or 
assimilated by the research community and taken as a basis for 
further research. A summary of this follow-up is out of the scope of 
this review. What this review thus only can do is to identify and explain 
key ideas presented in each paper and finally try to evaluate the book 
in terms of whether it succeeds in disseminating these ideas.

The book, edited by Ronald Carter, is organized in three parts, 
called 'Foundations', 'The organization of text' and 'Lexis and 
grammar'.

PART I Foundations
Paper 1: Trust the text
This paper argues that the availability of electronic corpora should 
lead to a re-evaluation of linguistic research traditions. It warns of 
upward projections of proven linguistic techniques to areas with larger 
linguistic units. For the analysis of discourse, thus, new techniques 
and a new framework of description are needed.

One notion introduced is the ''prospection'' in spoken discourse. A 
prospection classifies what is going to follow in discourse. Thus, 
different from backward oriented models which focus on antecedents 
in the preceding discourse, it is argued that either the entire discourse 
is encapsulated via a reference in the current sentence (examples can 
be found in Paper 5, pg. 86, eg. words like 'and', 'however', 'also' etc.) 
or that the current sentence has been projected by the preceding 
discourse (like when you say ''... has dramatic consequence.'', what 
follows will be understood as the consequences).

The paper then continues and makes a number of claims which 
challenge established assumptions:

+ The idea of a stable lemma is questioned as different word forms of 
a lemma have different patterns of meaning. 

+ A word that can be used in more than one word class tends to have 
specific meanings associated with each word class. This correlation 
between word class and meaning breaks down when the words form 
part of idiomatic phrases or technical terms.

+ Words may have specific privileges or restrictions how they are used 
(as subject, in prepositional phrases etc.)

+ Words have subliminal meanings, such as the verb 'happen' which 
refers to something nasty.

+ Grammar is a grammar of meaning and should state which meaning 
corresponds to which grammatical pattern.

+ Words are not selected independently but share meaning 
components which cannot be ascribed to a single word or a single 
morpheme.

+ As a result of the common selection of related words, these words 
have to give up parts of their meaning. This is referred to 
as 'delexicalization'. This delexicalization is easily visible with adjective-
noun combinations in which adjectives lose much of their meaning, 
e.g. when they stress part of the meaning of the noun (e.g. 'physical 
bodies');

Paper 2: The search for units of meaning
This paper proposes a linguistic unit called the 'lexical item', a unit in 
the lexical structure to be selected independently and which then 
selects lexical or grammatical patterns for its expression.

That words are not independent units can be seen from compounds, 
phrasal verbs, proverbs etc. Words are more or less dependent on 
each other and this dependence lies somewhere between an 'open 
choice' and an 'idiom'. Open choice represents the 'terminological 
tendency', i.e. the tendency for each word to have a fixed, context-
independent meaning. Idiomaticity represents the 'phraseological 
tendency' where words are selected together and make meanings 
from their combinations. While traditionally the terminological principle 
is seen as central to language, this paper focuses on the 
phraseological tendency.

Phraseological combinations, even if considered to be fixed, allow for 
small variations to fit the phraseological combination into its context. In 
addition, the different components of a phraseological combination 
have distinct functions. This is taken as an argument for their co-
selection.

The phraseological combination 'the naked eye' is analyzed. It is 
shown that it consist of a semantic prosody ('difficult'), a semantic 
preference ('see'), a colligation (preposition) and an invariable core, 
i.e. the collocation 'the naked eye', example: 'just visible to the naked 
eye'.

For the phraseological combination 'true feeling' the lexical item 
consists of a semantic prosody ('reluctant'), a semantic preference 
('communicate'), a colligation (possessive) and a collocation ('true 
feelings'), as in 'try to communicate our true feelings'. The semantic 
prosody and the semantic preference can be fused as 
in 'conceal, 'hide' or 'mask'.

A similar analysis is provided for the verb 'brook', which because of its 
infrequent usage, might be more independent of the context. But even 
for this verb, a complex lexical unit can be identified if sufficient corpus 
data are available.

PART II The organization of text
Paper 3: Planes of discourse
This paper integrates written language and discourse in one 
framework as both are essentially interactive. Two notions are 
introduced. The 'autonomous plane' of discourse gives access to the 
record of experience of speakers by integrating previous experiences 
in the form of words and phrases in a text structure. The 'interactive 
plane' of discourse is in charge of negotiating between participants, 
selecting the effect of utterances and what features of the outside 
world utterances should incorporate. The organization of written text 
is also managed on the interactive plane, e.g. predictions, 
anticipations, self-reference, discourse labeling and participant 
intervention.

Some operation allows switching the attention between the two 
planes. 'Reports' transfer attention to the autonomous plane within an 
utterance, so that the author does not have to adhere to the fact. 
A 'reference' to the preceding discourse encapsulates the old 
interaction and makes it available on the autonomous plane. 'Quotes' 
however remain on the interactive plain.

In fiction, then, similar to a report, the author no longer averes each 
utterance. However she does not attribute the utterances to an author 
in the real world neither The evaluation at the end (laughter, moral) 
marks then the return to averral. The notions introduced in this paper 
are then illustrated in the analysis of a fragment of fiction.

Paper 4: On the integration of linguistic description
This paper elaborates and illustrates the notions developed in the 
previous paper. It is shown how the identification of the interactive and 
autonomous plane of discourse can be used for a descriptive system 
(annotation scheme) for the analysis of written texts and spoken 
discourse.

Paper 5: Written discourse structure
This paper elaborates ideas presented in Paper 1 in the analysis of 
data. Of central importance is the idea of encapsulation. Each new 
sentence takes over from the previous sentence the status of 'state of 
the text'. By default, each new sentence encapsulates the previous 
one by a reference. This removes the discourse function from the 
previous sentence and leaves mainly a meaning trace in memory, and 
only partially a trace of form. The encapsulation creates coherence 
and cohesion is defined as the referencing act. Point-to-point 
references, eg. a pronoun referring to its antecedent are then 
interpreted mainly with reference to the shared knowledge and not the 
text.

'Logical acts' encapsulate the whole of the previous sentence (eg. 
through the words 'but', 'therefore') or the previous half of the same 
sentence (eg. through the words 'and', 'rather'). 'Deictic acts' also 
include the whole of the previous sentence (eg. 'that', 'this').

A 'prospection' about the next sentence requires the next sentence to 
fulfill the created expectancies if coherence is to be maintained. A text 
is analyzed to illustrate and discuss this notion. Different sub-types of 
prospections, such as prospection through an attribution, internal 
prospection or advanced labelling are introduced.

Paper 6: The internalization of dialogue
This paper tries to link spoken and written discourse in a single 
description and does so in a very original way. The author claims that 
properties of sentence grammar can be understood by relating 
grammatical structures (subordinate clause, relative clause, noun 
phrase etc.) to features of spoken interaction, and that in the 
phylogenetic development of languages these features of spoken 
interaction are internalized (understood as ''creating a (language)-
internal representation of'').

Through the internalization of the 'speaker change', a single speaker 
can change the posture and present conflicting ideas. The speaker, 
when marking this change, is no longer bound by the requirement to 
be coherent in his posture.

Declarative, interrogative or imperative mood can be equally 
understood as internalization of performative aspects of discourse. By 
internalizing them the speaker can now achieve the same speech act 
with a combination of different moods. This extends the range and the 
finesse of mood choices and thus creates an open set of possible 
speech acts.

The internalization of speech acts as subordinate clauses free them 
from their interactive function. Thus, hypotheses can be formulated by 
the speaker. Through the internalization, the move (i.e. the discourse 
unit) becomes a proposition, the averral becomes a truth value and 
the situational context becomes a possible world. When internalized 
as restrictive relative clauses, then this clause may specify which 
referents are included under a denotation by reference to a possible 
world. Prepositional phrases and attributive adjectives are derived 
from these by leaving the truth value unexpressed (e.g. dropping the 
copula).

Paper 7: A tool for text explication
The author describes the history of text analysis/explication in its 
various forms (stylistics, discourse analysis) as a periodical movement 
between the poles of objectivity (e.g. using descriptive schemes) and 
subjectivity (to achieve a qualitatively rich analysis). In an impressive 
analysis of a small text fragment, the author shows how corpus data 
can be used in a qualitatively rich analysis of discourse strategies, 
having as supported massive objective data.

PART III Lexis and grammar
Paper 8: The lexical item
This paper starts from a historic account of the distinction 
between 'word' and 'lexical item'. The author revives the notion 
of 'lexical item' to describe the vocabulary in more meaningful terms, 
e.g. to account for the fact that a vocabulary is a limited set of 
meaningful items which in text can assume an unlimited number of 
meanings. An alternative model according to which words are 
exchange in their paradigm is rejected as it creates artificial meanings 
and meaning ambiguities which are not felt by a native speaker. 
Instead, a mechanism called 'reversal' is introduced according to 
which meaning is created from the context and takes precedence over 
the meaning assigned in the vocabulary. When using 'lexical items' in 
generation, there is less choice than with words and almost no 
ambiguity.

The components of lexical items are those we have seen in Paper 2, 
the core, the semantic prosody (both obligatory), collocation, 
colligation and semantic preference. Through their syntactic flexibility 
(colligation) and semantic flexibility, lexical items allow for a limited 
paradigmatic choice and thus an integration with other lexical items in 
their context. New meanings are created when contextual constraints 
and lexical specifications do not match. The nature of a lexical item is 
illustrated in an analysis of the usage of the verb 'budge'.

Paper 9: The empty lexicon
This paper argues against the conception of language as a simple 
code for a message. According to the author, a message is only part 
of communication and the message cannot be easily separated or 
distilled from the form as many elements are concerned with 
negotiating the interaction and contributing to the message at the 
same time.

Discussing terminology first, the paper contrasts the 'terminological 
tendency' where words have fixed meanings and the natural flexibility 
and variability of language. The function terminology has in the lexis, 
is the same function that sublanguages have in grammar. 
Sublanguages also try to protect a chosen set of patterns and limit 
contextual factors on meaning. The terminological approach and the 
sublanguage approach are prevalent in a technical view on language, 
e.g. in Natural Language Processing. The technical approach is better 
suited to describe written language, especially scientific texts.

A proposal for a lexicon structure is elaborated. It includes two 
sublexica. One is similar to a termbank, the other is the flexible 
lexicon, initially empty. The lexicon learns about vocabulary from text 
and it is constantly updated. The only fixed element in this lexicon is 
its structure. It has three subcomponents, (1) the form of a lexical item, 
(2) an environment and a (3) meaning, and associations between 
elements of these subcomponents.

Paper 10: Lexical grammar
This paper discusses the notions of lexis and grammar. It explains why 
these notions have been seen historically as two separate entities. A 
model based on this opposition, however, cannot account for 
meaning. Neither the study of the lexis with the help of referential or 
logical semantics, nor the study of grammar can assign meaning to 
syntagmatic patterns (c.f. 'the naked eye'). Traditional frameworks 
cannot handle cross-border categories, semantic prosody or the 
vagueness of word classes. Without presenting an alternative model, 
however, the paper finishes with an exemplary analysis, similar to 
what we have seen with 'the naked eye'.

Paper 11: Phraseognomy
This short paper provides an analysis of the phrases 'Society of X' 
and 'Society for X'. This paper does not pretend to provide deeper 
insight beyond the specific example.

Paper 12: Current issues in corpus linguistic
This paper argues, essentially, against a number of ideas that are 
neither referenced, or fully described. The first argumentation defeats 
the idea of fixed adequate lexicon for the purpose of Natural 
Language Processing, and related to it, the idea of sublanguage. The 
second fusillade goes against small corpora and the third against the 
(over-)annotation of corpora.

CRITICAL EVALUATION

While the overall impression of the book is very positive in terms of its 
intellectual challenges, its linguistic inspirations, the historical 
perspectives it offers and its capacity to bring together different lines 
of research, I won't spare some critical remarks.

First, different contributions vary in quality, scope and relevance. 
Paper 11 is nice to read but lacks any import beyond what has been 
stated repeatedly in the book. Paper 12, I experienced as simply 
annoying. This paper epitomizes a writing style where positions are 
criticized with a minimal summary or a reference to a specific person, 
publication, a school. I have been forgiving throughout the book, 
seeing this style as the price for the wider view the author offers to the 
reader, but his paper doesn't offer this wider view and the discourse 
slips down into an unfair and unscientific shadow-boxing.

''But when someone says their corpus does not need to get any 
bigger ...'' (pg. 188)
Second, statements as the one above can only be understood in the 
light of the assumption that corpus linguistics is a scientific paradigm 
defined by the 'exemplary instance of scientific research' (Kuhn 
1996/1962) realized by the author and his colleagues. Sometimes, this 
assumption shows up in half-sentences:
''In corpus linguistics, by contrast, we have to work on the assumption 
that ...'' (pg. 170)
''[T]he vast majority of work with corpora still takes place under the 
assumptions of pre-corpus linguistics'' (pg. 176)

The author thus silently tries to monopolize the term 'corpus 
linguistics' and to assign it the meaning of what Tognini-Bonelli 
identifies correctly as 'corpus-driven approach' within the area of 
corpus linguistics. The author thus denies the label 'corpus linguistics' 
to those researchers which understand corpus linguistics differently, 
e.g. as a (complementary) research method (Biber et al. 1998).

Third, the general tendency in these articles to cite research only 
when it can be integrated en passant or to fire a broadside 
on 'computational linguistics' or 'structural linguistics' is 
counterproductive to the advancement of science. As Kuhn 
(1996/1962) has taught us, new paradigms not only come up with a 
new theory but also with new data. And this is what the author does 
extraordinarily well. But as long as the data of the other paradigms 
cannot be accounted for, or can be shown to be artificial data or 
represent an artificial problem, we have two theories (old and new) 
which describe different data derived from the same world. Much 
would have been gained in this book, if, instead of repeatedly 
providing new data for theory verification, an analysis of other 
theories' data would have been given (e.g. in Paper 5, the so called 
donkey-sentences of Kamp & Reyle 1993, or in Part III, 
Mel'cuk's 'heavy smoker' (1974) or Pustejovsky's 'fast car' and 'fast 
secretary' (1995)).

Finally, attempts to make the language of the book accessible have 
either not been made or they have not been successful. Sentence 
structure is unnecessarily complex, e.g.:
''This chapter concerns the relation between the two types of patterns 
that are mainly recognized as the means whereby language creates 
meaning.'' (pg. 164)

and sometimes barely understandable:
''A user community that kept clearly separate the language that was 
used in a particular subject-matter area, and whose usage in that area 
differed markedly from its other usage and the usage of comparable 
communities, while remaining largely within the rules of the general 
language - such conditions would identify a sublanguage.'' (pg. 152)
''Professional linguists should not be surprised to experience a rather 
disturbing effect from the massive surge in the availability of evidence 
and the growing sophistication of the tools for examining it and testing 
hypotheses against it that corpus linguistics has brought.'' (pg. 173)

To sum up, the content of book will serve as rich source of inspiration 
to those who are involved in corpus linguistic research, lexicography 
and discourse analysis. The book however is not suited as general 
introduction and certainly not as a text book for university courses. 
The price of the book, the writing style and the fragmented 
presentation of ideas are responsible for the fact that, the ideas will 
still remain difficult to access.

REFERENCES

Douglas Biber, Susan Conrad and Randi Reppen, (1998) Corpus 
Linguistics- Investigating Language Structure and Use, Cambridge 
University Press.

Hans Kamp & Uwe Reyle, (1993) From Discourse to Logic. 
Introduction to Model theoretic Semantics of Natural Language, 
Formal Logic and Discourse Representation Theory, Dordrecht, 
Kluwer Academic Publishers.

Thomas S. Kuhn (1996/1962) The Structure of Scientific Revolutions. 
University of Chicago Press, 3rd edition.

Igor A. Mel'cuk (1974) Opyt teorii lingusticeskix modelej Smysl <=> 
Text. Semantika, sintaksis . Izdatel'stvo ''Nauka'', Moskva.

James Pustejovsky (1995) The Generative Lexicon, MIT Press, 
Cambridge.

Elena Tognini-Bonelli (2001) Corpus Linguistics at Work. Benjamins. 

ABOUT THE REVIEWER

Oliver Streiter teaches computational linguistics and corpus linguistics 
at the National University of Kaohsiung, Taiwan. His current research 
focuses on applications in Computer Assisted Language Learning 
("Gymn at zilla") and a project which aims at the compilation and 
annotation of linguistic resources to support low density languages.

-----------------------------------------------------------
LINGUIST List: Vol-17-104