16.27, Review: Text/Corpus Ling: Aijmer & Altenberg (2004)

Tue Jan 11 02:00:31 UTC 2005

LINGUIST List: Vol-16-27. Mon Jan 10 2005. ISSN: 1068 - 4875.

Subject: 16.27, Review: Text/Corpus Ling: Aijmer & Altenberg (2004)

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Collberg, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Ogasawara <naomi at linguistlist.org>
================================================================  

What follows is a review or discussion note contributed to our 
Book Discussion Forum. We expect discussions to be informal and 
interactive; and the author of the book discussed is cordially 
invited to join in. If you are interested in leading a book 
discussion, look for books announced on LINGUIST as "available 
for review." Then contact Sheila Collberg at collberg at linguistlist.org. 

===========================Directory==============================  

1)
Date: 10-Jan-2005
From: Rolf Kreyer < rkreyer at uni-bonn.de >
Subject: Advances in Corpus Linguistics 

-------------------------Message 1 ---------------------------------- 
Date: Mon, 10 Jan 2005 20:46:41
From: Rolf Kreyer < rkreyer at uni-bonn.de >
Subject: Advances in Corpus Linguistics 

EDITOR: Aijmer, Karin; Altenberg, Bengt 
TITLE: Advances in Corpus Linguistics 
SUBTITLE: Papers from the 23rd International Conference on English 
Language Research on Computerized Corpora (ICAME 23) Göteborg 22-26 May 
2002 
SERIES: Language and Computers Vol. 49 
PUBLISHER: Rodopi 
YEAR: 2004
Announced at http://linguistlist.org/issues/15/15-2014.html

Rolf Kreyer, University of Bonn

 The volume under review is a collection of papers from the 23rd 
International Conference on English Language Research on Computerized 
Corpora and contains a total of 22 articles on 419 pages. The papers cover 
a wide range of topics, which according to the editors "illustrate clearly 
the diversity of research that is characteristic of corpus linguistics 
today" (1). The contributions are subsumed under six "broad -- and 
inevitably overlapping -- categories" (1):
* The role of corpora in linguistic research
* Exploring lexis, grammar and semantics
* Discourse and pragmatics
* Language change and language development
* Cross-linguistic studies
* Software development 
The following synopsis will give a summary of the key points of each of 
the articles. The review will conclude with a critical evaluation.

SYNOPSIS

The first section, 'the role of corpora in linguistic research' starts 
with an article by Michael Halliday, who explores the spoken language 
corpus as a foundation for grammatical theory. Quantitative research into 
spoken language, in his view, will not only increase our understanding of 
spoken language itself but also of language as a whole. In his view, it is 
in spoken language that "systemic patterns are established and maintained 
[...],instantial patterns are all the time being created [...] and the 
instantial can become systemic." (25) For instance, patterns, as they are 
described by Hunston/Francis (2000), Halliday claims, will most probably 
develop and change in speech. Here also 'non-standard' patterns like the 
ones below are found (19):
(1) It's been going to've been being taken out for a long time. [of a package
left on the back seat of the car] 
(2) All the system was somewhat disorganized, because of not being sitting 
in the front of the screen. [cf. because I wasn't sitting ...]. 

Such instances should not be dismissed as errors but rather as "productive 
innovations which pass unnoticed in speech but have not (yet) found their 
way into the written language" (19). The transcription of spoken corpora, 
however, is not without problems: it is well-known that meaningful 
prosodic features are often not represented, but in Halliday's view the 
problem of over-transcribing is more serious. For instance, only in 
transcribed speech are homophonous forms such as 'icicle' and 'eye sickle' 
overtly distinguishable; thus "writing systems mask the indeterminacy in 
the spoken language" (16). The analysis of spoken corpora might also prove 
challenging, due to what Halliday calls "the lexicogrammatical bind" (21) 
of corpus research. Obviously, lexical phenomena are more accessible by 
corpus linguistic methods than grammatical ones. Spoken language, however, 
shows a high level of grammatical intricacy and favours grammatical 
systems as opposed to written language, where meaning tends to be conveyed 
through lexis (cf. Halliday 1989). Written language therefore is 
inherently more prone to corpus linguistic analysis than spoken language. 
So, "especially in relation to a spoken language corpus, there is work to 
be done to discover ways of designing a corpus for the use of 
grammarians". (23)

John Sinclair examines "the roles of intuition and annotation in corpus 
linguistics" (41), thereby trying to clarify the stance of corpus-driven 
as opposed to corpus based linguists. For Sinclair the "elusive faculty" 
(41) of intuition seems to have a dual status: on the one hand, intuition 
has been shown not to be trustworthy: for the most part invented sentences 
are not of the kind that are usually found in a corpus, and the findings 
that the corpus yields often differ drastically from what has been 
expected. One the other hand, the corpus-driven linguist has "a great 
respect for intuition, and cannot work without it" (56), since it 
organises corpus evidence; as Sinclair puts it: "[t]here is no escape from 
intuition if you have command of the language you are investigating" (47). 
However, while the corpus-based linguist "allows his intuition to overrule 
his corpus data and hence gives primacy to the former" (40), the corpus-
driven linguist tries to keep intuition at bay and is aware of its 
limitations at all times. 

Similar discrepancies seem to divide corpus-based and corpus-driven 
researchers on the topic of annotation: while it seems indispensable to 
the former, it is rather obfuscating to the latter. Sinclair's scepticism 
towards annotation is due to two reasons: firstly, the language models 
that underlie most of the tagging programmes are usually pre-corpus 
models. Unfortunately, these models are not made subject to close scrutiny 
on the basis of corpus evidence but, according to Sinclair, it is usually 
assumed "that the models are basically correct, and [that ...] there is no 
need to open up the whole complexity of language theory and description 
for the sake of some minor blemishes" (52). The second argument against 
annotation is linked to the first one: since pre corpus language models 
are inadequate for the description of corpus data, human intervention is 
necessary. As a consequence, the process of annotation is not entirely 
replicable thereby failing the first test of scientific method. However, 
despite his conclusion that "corpus-driven linguists are not likely to 
have much use for annotation"(56), Sinclair concedes that it "has its 
place in application, where quick results are needed and rough-and ready 
ones will suffice" (56).

Starting off with a short discussion of Chomsky's well known three levels 
of explanatory, descriptive and observational adequacy (1964: 62-3), Leech 
argues that "a more realistic account of the main strata of investigation 
in linguistics" (62) could be arrived at by the following hierarchy 

THEORY: formal [and functional] characterization or explanation of 
language as a phenomenon of the human mind and of society. 
DESCRIPTION: formal [and functional] characterization of a given language, 
in terms of theory. 
DATA COLLECTION: collection of observations which a description, and 
ultimately a theory, has to account for [e.g. corpora] (62). 

In order to explore the relation between the above levels and in order "to 
argue against the common assumption that corpus linguistics is concerned 
with 'mere data collection' or 'mere description' (62), Leech describes 
two short-term diachronic case studies on modal auxiliaries and 
grammatical changes relating to colloquialization. Both studies are based 
on the Brown, LOB, Frown and FLOB corpora and two spoken mini-corpora 
extracted from the SEU and the ICE-GB corpora. Leech emphasizes that the 
description of corpus data does not necessarily lead to true statements 
about a language as such. The corpus linguist always has "to bear in mind 
some hazardous assumptions which can be made in moving from data 
description to language" (70), for instance, the well-known issues of 
representativeness and of interpreting statistical significance. 

This, however, should not lead to discarding the corpus linguistic 
enterprise as such. Rather, these hazards should be regarded as a reminder 
that corpus-linguistic results usually are provisional and that "further 
corroborating evidence as well as means of increasing accuracy and 
reliability" (71) need to be sought for. Finally, in moving from the level 
of description to the level of theory, the researcher will have to find 
explanations for empirical data: for instance, the decline of modals 
between the 1960s and the 1990s, that Leech describes, might be accounted 
for by language-internal factors, such as processes of grammaticalization, 
or by external factors, such as colloquialization, democratization or 
Americanization. On the whole, then, "corpus linguistics is not purely 
observational or descriptive in its goals, but also has theoretical 
implications" (61).

Section 2, 'Exploring lexis, grammar and semantics', starts with an 
article by Joybrato Mukherjee who investigates the place of corpus data in 
a usage-based cognitive grammar. The author tries to show "that corpus 
linguistics and cognitive linguistics are not at all mutually exclusive 

but can fruitfully complement each other in developing a genuinely usage-
based model of [...] speakers' knowledge of the underlying language 
system" (96). In particular, the author uses an analysis of the 
ditransitive verb GIVE in ICE-GB to illustrate how the lexical and 
constructional networks of cognitive grammar (e.g. Langacker 1999) can be 
refined by incorporating corpus data. Firstly, corpora provide 
frequencies, which in turn yield insights into the strength of the 
different links between a particular lexical item and the constructions in 
which it can occur. In the case of GIVE, for instance, it is found that 
38% of all tokens occur in the pattern 'GIVE + Oi + Od'. The second most 
frequent pattern, 'GIVE + Od', accounts for 23.2% of the data. These 
patterns are supposed to be more deeply entrenched in the cognitive system 
than the other less frequent patterns of GIVE. In addition, corpus data 
also provide insights into the context-dependent principles that are at 
work in the selection of a particular pattern. The author, for instance, 
finds that the pattern 'GIVE + Od' is used in those cases only where the 
recipient is either retrievable from the context or where the 
specification of the recipient is irrelevant. Thus, Mukherjee 
claims, "corpus-linguistic methodology obviously opens up new and 
promising perspectives in cognitive linguistics" (97).

Caroline David puts 'putting verbs' to the test of corpora. In particular, 
she attempts to outline a new typology of 'putting verbs' by taking into 
account quantitative data from the corpora Brown, Frown, LOB, FLOB and the 
BNC. The first part of her paper is concerned with PUT, SET, PLACE, and 
LAY. The author finds that PUT is the most frequent of the four and is 
more likely to occur in idiomatic structures than the other three. This 
the author counts as evidence for "generalness of meaning" (102). The 
other three, in contrast, seem to be associated with a particular way of 
putting, namely a rather careful way. PUT, therefore, "is considered the 
prototypical verb of the general process of putting with little additional 
information regarding the way things are displaced" (105) while the other 
three "are classified together as a kind of manner of putting" (105). The 
second part of the paper concerns verbs of the SPRAY/LOAD class, namely 
LOAD, COIL and FILL. Here, the author is mainly concerned with syntactic 
alternations of the following kind: 
(3) I loaded school trunks on to the car. 
(4) I loaded the car with school trunks. 

The author claims that in example (3) "the default interpretation is that 
all the trunks are loaded, irrespective of whether the car is 'full' or 
not" (107). Constructions of type (3), therefore, usually take 
a 'quantification' reading and are thus similar to construction with COIL. 
In the second case, however, a qualification, namely that the car is now 
full, is emphasized. Constructions of type (4) thus resemble those with 
FILL-verbs, such as CLOAK, FLOOD, or SOAK. 

Peter Willemse explores the relationship of 'esphoric' reference, 
cataphoric reference within the same nominal group and pseudo-definite 
NPs, i.e. NPs that "are formally definite but in fact realize presenting 
rather than presuming reference" (117). Willemse focuses on pseudo-
definite NPs in unmarked existential constructions, since their semantics 
entail that the postverbal NP is indefinite. A formally definite 
postverbal NP will therefore always have 'pseudo'-definite referential 
status, as the NP "the usual sleazy reasons for that" in the following 
(122):  (5) The Woody Allen-Mia Farrow breakup [...] seems to have everyone's 
attention. There are the usual sleazy reasons for that, of course - the 
visceral thrill of seeing the extremely private couple's dirt in the 
street, etc.

On the basis of 200 tokens from the Bank-of-English corpus, the author 
tries to find a "motivation of the use of the definite article in [...] 
the pseudo-definite NPs" (130). Willemse provides two possible 
explanations: 
(i) The postverbal NP may have 'dual reference', i.e. it may refer to a 
type, which is usually hearer-old, and a token, which is usually hearer-
new. In example (5) above, for instance, the specific reasons for the 
public fascination are introduced into the discourse and, therefore, 
hearer-new. However, the general type of reason that explains such 
attention is assumed to be known to the hearer, i.e. hearer-old. 
(ii) The other explanation lies in what Willemse calls "a relation of 
[...] 'forward bridging' within the NP" (131). In example (6) below, the 
definite article in 'the shrunken head' is licensed through the fact 
that "the identity of its referent is recoverable by virtue of an 
experiential connection with the entity introduced by the second NP: a 
head is a part of (the body of) a boy" (123) In such cases, as in (6) 
therefore, the definite article is motivated by esphoric reference (123).
(6) In a room outside the court he talked with the French prosecuting 
counsel, [...]. There was the shrunken head of a Polish boy. 

In his article 'Why "an angel rides in the whirlwind and directs the 
storm"', Jonathan Charteris-Black analyses the use of metaphor in 
political corpora. On the basis of the 51 Inaugural Addresses of the 
American Presidents and the political manifestos of the Labour and the 
Conservative party from 1945 to 1997, the author explores the similarities 
and the differences between types of American and British political 
discourse. With regard to similarities, for instance, Charteris-Black 
finds that POLITICS IS CONFLICT is the most frequently used metaphor in 
the two corpora. This conflict either shows in action for "abstract social 
goals that are positively evaluated" (138) or in action against "social 
phenomena that are negatively evaluated" (138), as shown in these examples 
(138, 139):
(7) While continuing to defend and respect the absolute right of 
individual conscience .... 
(8) [...] we intend to continue our fight against all form of social 
injustice. 

More interesting maybe are the differences between the two corpora. For 
instance, the author finds that the fire metaphor is only used in the 
American corpus. This may be due to the fact that the fire metaphor was 
used by George Washington in the context of liberty. Apparently, "the 
metaphorical link between fire and liberty has become a source of 
intertextual reference in presidential addresses" (143). On the other 
hand, plant metaphors are only attested in the British manifestos. Again, 
the author suggests a historical-cultural explanation: "the British 
passion for gardening lead[s] to the positive associations of words such 
as 'growth' and 'nurture'" (149). Charteris-Black also reports on metaphor 
borrowing. The conceptual metaphor POLITICS IS RELIGION is well 
represented in the American corpus but is only found in the more recent 
British manifestos; this metaphor seems to have found its way from 
American into British political discourse.

Peter Tan, Vincent Ooi and Andy Chan, in their article on "Signalling 
spokenness in personal advertisements on the Web", discuss the use of 
English as a second language in this register by South East Asians. Within 
this speech community, "English is often relegated to the position of 
a 'neutral' and 'transactional' (as opposed to 'interactional') language 
where 'affect' (emotion) is played down" (151). The question now arises as 
to how English language resources are employed for informal, private and 
personal means in personal advertisements (PA) by South East Asians. In 
particular, the authors want to analyse "to what extent [...] resources of 
spoken discourse [are] relied on in PA" (163). To this end, they compare 
the frequencies of augmenters (e.g. 'very', 'a lot', or 'really') and 
mitigators (e.g. 'somewhat', 'a bit' or 'only') in a corpus of South East 
Asian adverts with their usage in a spoken and a written subcorpus of ICE-
SIN (the Singapore component of the International Corpus of English). On 
the basis of this data, the authors find "that personal advertisers tend 
to make use of features of spokenness" (163). However, it would 
be "premature to say at this stage that Netspeak in South East Asia is 
closely associated with the norms of spoken language although it seems to 
be an important contributor to the norms associated with personal 
advertisements" (163).

"Textual colligation: a special kind of lexical priming" by Michael Hoey 
opens up the third section of the proceedings, "Discourse and Pragmatics". 

Hoey advocates a view that regards "textual relationships (interactive, 
linear, cohesive, hierarchical and structural) as dependent upon and 
created by the lexis of the language in a manner not exhausted by the 
demands of the individual text" (173), thereby claiming a vital role for 
corpus linguistic methods and findings in text linguistic research. In 
analogy to the term 'colligation', which captures the interdependencies of 
lexis and syntax, the author employs the term 'textual colligation' to 
denote the "positive and negative preferences of a lexical item with 
regard to [...] textual features" (174) such as participation in cohesive 
chains or occurrence as part of the theme in a Theme-Rheme relation. 

An analysis of a 100 million word, predominantly Guardian newspaper corpus 
shows, for instance, that the lexical items 'army', 'baby', or 'political' 
occur as members of cohesive chains, whereas 'afterwards', 'best' 
or 'particularly' seem to show no tendency to form such chains, i.e. these 
lexical items have a negative preference with regard to the textual 
feature 'cohesion'. Words such as 'reason' or 'option', on the other hand, 
are neutral in this respect; they may occur in cohesive chains but if so, 
the chains are usually short. With regard to the feature 'occurrence as 
theme', Hoey finds that in 75% of 294 instances 'sixty' occurs as part of 
the theme in a Theme Rheme relation; interestingly, orthography seems to 
be relevant here, since '60' does not show this tendency. The preferences 
of lexical items for particular textual features should not be analysed in 
isolation from each other. The simultaneous occurrence of a lexical item 
in two or more textual features will lead to highly interesting 
generalizations: for instance, an item that "has a positive preference for 
both Theme and cohesive chains [...] will inevitably have a positive 
preference for Thematic Progression" (177).  Moreover, textual-colligation 
analysis must not necessarily stop at the word level. The lexical items 
within a phrase may share certain preferences for textual features and 
thus create a particular 'colligational prosody'.

Hilde Hasselgard explores "adverbials in IT-cleft constructions" on the 
basis of data drawn from the British component of the International Corpus 
of English (ICE-GB). In particular, Hasselgard focuses on two aspects: (1) 
the information structural role of the adverbial, and (2) the discourse 
function of the whole IT-cleft construction. As to the first point, the 
author reports a marked difference in the information structure of clefts 
with adverbials as opposed to the other kinds of cleft constructions: "IT 
clefts with adverbials occur by far most commonly with cleft clauses 
conveying new information (86%), while the cleft clauses of IT-clefts in 
general seem to be divided about equally between given and new 
information" (200). The author's discussion of the discourse functions of 
adverbial-IT-clefts largely capitalizes on Johansson's (2002) fourfold 
taxonomy, which distinguishes contrast, topic launching, topic linking and 
summative functions, all of which Hasselgard finds attested in her data, 
too. However, she adds a further function, namely thematization, which 
serves "to make extra clear what is to be understood as the theme and the 
rheme of a sentence" (204), as in the following example (204): 
(9) It is with much regret that I find it necessary to send you a copy of 
the enclosed letter which is self explanatory. 

According to Hasselgard, the writer here "wants to give thematic 
prominence to the regret he/she feels" (204). In addition, she suggests 
that thematization might be regarded as superordinate to Johansson's four 
discourse functions. For instance, if the focused constituent in a cleft 
construction is especially marked off as the theme, this may also serve to 
mark the theme as contrastive or it may be employed to introduce a new 
topic into the discourse.

Section 3 concludes with Bernard De Clerck's article "on the pragmatic 
functions of 'let's' utterances" in the spoken part of ICE-GB. 
Prototypically, these utterances "have the directive illocutionary force 
of a proposal for joint action [... where] the speaker commits herself to 
an action and seeks the addressee's agreement" (217). However, 'let's' 
utterances may also assume speaker or hearer orientation. In the first 
case, the construction may be used to secure the addressee's agreement to 
an action that the speaker is currently carrying out. In the case of 
hearer-orientation, the utterance may "camouflage an authoritative speech 
act as a collaborative one" (219). In both cases, the idea of joint action 
recedes into the background. Most frequently, 'let's' is used in a 
conversational function, namely to influence the flow of conversation. In 
this case "they are more like announcements of a topical shift that round 
off the present topic and introduce the next step in the talk" (225). This 
function involves interesting sociolinguistic consequences: 'let's' as a 
conversational imperative "seem[s] to be part of the repertoire of [...] 
interactionally more powerful speakers, who present the conversation as a 
joint enterprise, but actually try to control it by restricting the 
hearer's influence to a minimum" (226). A minor function of the 
construction is to present the speaker's evaluations or feelings at an 
interpersonal level, as in example (10) below, where the speaker evaluates 
the hearer's behaviour Again, the prototypical aspect of 'proposal for 
joint action' is no longer present in such cases (228):
(10) A: God you really know how to put someone down don't you
B: Oh let's not get touchy touchy. 

The fourth section on "Language change and language development" starts 
off with a paper by Thomas Kohnen, who provides a diachronic case study of 
English directives, thereby addressing a number of "methodological 
problems in corpus-based historical pragmatics". Such problems, for 
instance, include what Kohnen calls 'pragmatic false friends', 
i.e. "constructions which, against a contemporary background, suggest a 
wrong pragmatic interpretation" (239). Example (11) (taken from 
Shakespeare's 'The Merry Wives of Windsor') is a case in point (239):
(11) Ford: Blesse you sir.
Fal.: And you sir: would you speake with me? 

In this case, the utterance 'would you speake with me?' should not be 
understood as a request but as "a real question which serves to identify 
the man who wanted to talk to Falstaff" (240). Modern English does not 
allow this interpretation. Another methodological issue, not surprisingly, 
is the lack of sufficient data. This may be balanced by concentrating on 
individual texts types or genres and their functional profiles. On the 
whole, Kohnen argues for what he calls 'structured eclecticism': 
diachronic pragmatic analysis should be based on "a deliberate selection 
of typical patterns which we trace by way of representative analysis 
throughout the history of English" (238). Furthermore, "a diachronic 
analysis of speech acts should be embedded in a reasonably stable 
functional profile of text types" (242). This method is put into practice 
in a diachronic analysis of English directives. The author finds that, on 
the whole, there seems to be a move away from the explicit and direct 
forms of directives (e.g. imperatives) to more indirect alternatives, such 
as interrogative realisations. As an underlying motivation for this 
development Kohnen regards "the growing importance of considerations of 
politeness" (246) which entails a reduction of possibly face-threatening 
speech acts.

Liselotte Brems discusses "degrees of delexicalization and 
grammaticalization" in measure nouns (MNs) such as 'bunch(es) of' or 'heap
(s) of', and attempts to clarify "the status of the MNs [...] within their 
respective NPs" (250). In particular, two analyses seem appropriate: the 
MN may either function as the head of the bi-nominal NP of which it is a 
part, as in (12) (250), or it may be regarded as a quantifier of the 
second NP within the construction, as in (13) (251). Other instances, such 
as (14) (250) are not easily decided on. 
(12) The fox, unable to reach a bunch of grapes that hangs too high, 
decides that they were sour anyway. 
(13) But then, when I needed one, there were a load of excuses as to why I 
couldn't borrow one. 
(14) We still have to move loads of furniture and other stuff. 

The general structural status of MNs, therefore, is far from clear. As an 
answer to this problem Brems suggests to regard "the developments observed 
in MN constructions [...] as a case of ongoing delexicalization and 
grammaticalization in MNs" (251). In particular, delexicalization is 
understood as a precursor to grammaticalization, i.e. the "gradual 
broadening of collocational scatter [... and the] loosening of the 
collocational requirements imposed by the MN" (256) paves the way for "the 
re-interpretation of the MN as a quantifier" (256). Her corpus study of 
MNs reveals different degrees of synchronic grammaticalization. For 
instance, 'heaps of' is used as a quantifier in 65.6% of all cases, 
whereas only 4.7% of the tokens of the semantically related 'piles of' 
occur in the same function. According to Brems, these findings can be 
explained by the fact that 'pile' is associated with a "feature of 
verticality and constructional solidity" (261) which blocks processes of 
semantic generalization. On the other hand, 'heap' lends itself more 
easily to delexicalization (and subsequent grammaticalization) since it 
is "in itself more vague and simply profiles an undifferentiated mass" 
(261).

Göran Kjellmer investigates the use of 'yourself' as "a general-purpose 
emphatic-reflexive". The traditional grammar view of the personal 
pronoun 'you' and its reflexive counterparts 'yourself' and 'yourselves' 
is fixed and stable. However, Kjellmer comes up with a large amount 
of 'deviant' uses of 'yourself' in the CobuildDirect and the BNC corpora 
which seem to imply "an ongoing extension of its semantic range, and 
consequently an increasing lack of precision" (270). In (15) below, for 
example, 'yourself' unambiguously refers to plurals only (272):
(15) Well can you sort that out amongst yourself [...] 

Kjellmer reports on even more deviant (and also rarer) cases, where the 
plural that the reflexive pronoun refers to is not limited to the second 
person (273): 
(16) [...] we were told to use physical resources like deep breathing and 
actually making yourself sit down and making yourself go floppy. 

Apparently, 'yourself' has "become more general in its application" (273) 
Furthermore, similar to 'you' as a substitute for the missing generic 
personal pronoun in English, 'yourself' also seems to be used generically. 
A most illustrative example is given in (17) where 'yourself' refers back 
to generic 'one' (274):
(17) [...] in an engineering course one concerns yourself only with how to 
apply and harness phenomena 

A possible final stage of the changing use of 'yourself', in Kjellmer's 
view, might be witnessed in the following examples (275): 
(18) I like boxing because it means I can defend yourself if you ever 
needed to 
(19) Pete's gone down to the shop and got yourself a bottle of whisky  

Here, the reflexive pronoun is used specifically with reference to non-
second-person entities On the whole, Kjellmer argues, that 'yourself' 
might be regarded as "a general-purpose emphatic reflexive pronoun" (175) 
which "has become a close reflexive pronoun copy of [... 'you'] by getting 
rid of constraining features in its later stages of development" (275).

Clive Souter explores "aspects of spoken vocabulary development in the 
Polytechnic of Wales Corpus of Children's English [POW]". Although the 
corpus is fairly small (roughly 61,000 words) and has originally been 
compiled to study syntactic and semantic development in children from 6 to 
12, Souter argues that "it does have great value for researchers into 
child language development, TEFL [Teaching English as a Foreign Language] 
syllabus designers and course-book authors" (280) and sets out to show the 
potential of POW for the study of children's vocabulary development. 
However, as Souter points out, results have to be interpreted with great 
care due to limitations of corpus size and corpus compilation. For 
instance, the data show that the active vocabulary of children in the 
corpus increases only around 50 words per year, which, however, might be 
an artifact due to "the limited activities used to elicit speech from the 
children" (279), such as Lego building or conversation with adults about 
games or TV. The author also reports on a difference in frequency of the 
most common affirmative or negative expressions (e.g. 'yeah', 'yes', 'no' 
or 'can't') among boys and girls: boys, in general, seem to prefer 
positives while girls fore frequently use negatives. Again, the 
interpretation of the results is difficult. They might indicate a general 
trend but the frequencies might also be explained as a consequence of 
corpus compilation - the author concedes: "[p]erhaps Lego building elicits 
more positive responses from boys and more negative responses from girls" 
(285). More interesting is the finding that the vocabulary of boys and 
girls used in similar contexts only partly overlaps. No more than half of 
the words boys and girls use are used by both sexes, whereas the other 
half seems to be sex-specific. This feature, as Souter points out, is 
worth more investigation and then might indeed turn out to be "promising 
and perhaps disturbing, from the point of view of syllabus and course 
material designers" (288).

In the last paper of section 4, Roumiana Blagoeva describes the use 
of "demonstrative reference as a cohesive device in advanced learner 
writing". In particular, she is interested in "the under/overuse of the 
demonstratives 'this', 'that' and their plural variants 'these', 'those'" 
(298) by advanced Bulgarian learners of English. As a basis for comparison 
she chooses the Bulgarian sub-corpus of the International Corpus of 
Learner English, the British component of the Louvain Corpus of Native 
English essays, a sub-corpus of the BNC from the domains 'Applied 
Science', 'Social Science' and 'World Affairs', and a collection of 
Bulgarian texts similar to the BNC sub-corpus. Her analysis shows, for 
instance, that 'near'-demonstratives, i.e. 'this' and 'these' are 
underused by Bulgarian learners when compared to British students while at 
the same time the 'remote' types of demonstratives are overrepresented. 
This cannot be accounted for by L1 interference, since the Bulgarian 
equivalents of 'that' and 'those' show a very low frequency in the 
Bulgarian corpus. Rather, the author suggests, a reason seems to lie in 
the teaching material that is used in Bulgaria: although Bulgarian, 
similarly to English, distinguishes near and remote demonstratives, the 
distinction between the English counterparts seem to be overlooked in 
teaching materials: "learners are left with the impression [...] that 
both 'this' and 'that' [...] could be used indiscriminately to point to 
any word, phrase or longer stretch of text" (304). Interestingly, both 
Bulgarian and British students show a high proportion of 'this' 
and 'these' in comparison to the BNC sub-corpus. Blagoeva suggests that 
this might be due to "an influence on learner production by the nature of 
the text type" (305). Furthermore, the author contends that learners of a 
foreign language at some point stop learning and mainly seem to be focused 
on remedying remaining mistakes in the field of lexis and syntax rather 
than developing skills to arrive at "a more target-like way of producing 
coherent texts" (306), which, of course, would include a native-like use 
of demonstratives.

In "Translation as semantic mirrors", the first paper of section 5, Helge 
Dyvik describes a method for identifying wordnet relations (e.g. synonymy 
or hyponymy) on the basis of parallel corpora. The basic assumption 
underlying Dyvik's approach is that "semantically closely related words 
ought to have strongly overlapping sets of translations, and words with 
wide meanings ought to have a larger number of translations than words 
with narrow meanings" (311). The results he presents are extracted 
manually form the 2.6 million word English-Norwegian Parallel Corpus 
(ENPC). Searching for a particular Norwegian or English word form in the 
corpus will yield all the original sentences that contain this word form 
and its translations into English or Norwegian, respectively. From this 
set of translations, a human analyser can then compile a list of possible 
translations of the word form in question. These lists form the basis for 
further analyses. The information they contain may, for instance, be used 
to distinguish different senses of a particular word. The Norwegian 
word 'tak', for example, is translated into 'roof', 'ceiling', 'cover', 'grip',
'hold'. These five word forms are translated into various Norwegian words, which
form a number of sets which all contain 'tak' but also partially intersect. The
translations for English 'roof' and 'ceiling', for instance, in addition to
'tak' also overlap inNorwegian 'hvelving'. Similarly, translations for 'grip'
and 'hold' share Norwegian 'tak' and 'grep'. The respective translation sets,
however, do not intersect. One can thus conclude that Norwegian 'tak' has at
least two distinct
senses, namely 'roof/ceiling' and 'grip/hold'. After different senses have been
individuated semantic fields can be established on the basis of overlaps of
translation sets. 'Beautiful', for instance, translates into 'vakker' and
'nydelig'. These, in turn, in addition to 'beautiful' translate into 'cute'  and
'cute'/'delicious', respectively. It follows that 'beautiful', 'cute' and
'delicious' are part of the same semantic field. Further procedures assign
lexical feature to individual entries and eventually lead to lattices that
reveal hyperonym and hyponym relations among senses, and even identify sub
senses and near-synonyms of each individual sense.

Åke Viberg analyses "physical contact verbs in English and Swedish from 
the perspective of crosslinguistic lexicology". On the basis of data drawn 
from the English Swedish Parallel Corpus (ESPC), the author presents an 
extensive and highly detailed comparison of the English verbs 'strike', 'hit'
and 'beat' with their primary Swedish translation 'slå'. The author finds
several interesting differences between the items at issue. 'Strike', 'hit' and
'beat' in their prototypical usage as a "bodily action verb, for instance, most
frequently take human beings as objects. This, however, only seems to be a 
tendency, "whereas it is more or less a requirement of Swedish 'slå'" 
(332) Furthermore, the Swedish verb occurs with a human subject in 70% of 
all instances. The English counterparts show a mixed picture: while 'beat' 
with 72% of human subjects is similar to 'slå', 'strike' and 'hit' are not 
(41% and 48%, respectively). With these verbs "natural disasters, economic 
crises, wars and diseases" (334) seem to be frequent subjects. The same 
subjects, in Swedish usually cooccur with a different verb, namely 'drabba',
which could roughly be translated as 'afflict'. Similarly, if the subject is a
projectile (e.g. a bullet), English 'hit' is the most frequent verb, whereas
Swedish again does not use 'slå' but 'träffa' meaning 'hit a target'. It follows
that generally, 'slå' "is grounded more firmly in sensorimotoric experience of
limb movement" (349) which prototypically makes use of arm and hand. For the
English counterparts the sensorimotoric aspect does not seem to be as central.

Anna-Lena Fredriksson aims "to discuss different approaches to the notion 
of theme and to show how parallel corpora can successfully be used for 
cross-linguistic analyses of theme" (353). The author starts off with a 
description of theme and rheme in Systemic Functional Grammar (SFG) as 
laid out in Halliday (1994). However, SFG "has a strong orientation 
towards English which is a potential problem for using it in other 
languages" (354) One problem arises out of the V2 requirement in Swedish, 
since this leads to a different distribution of clause elements with 
initial non subject, as example (20) illustrates (EO = English Original; 
ST = Swedish Translation; LIT = Literal Translation) (361, adapted): 
(20) (a) EO: Surely I'd been freed from those painful memories long ago.
(b) ST: Vistt had jag för länge sedan blivit befriad från de där plågsamma minnena.
LIT: Surely had I for long ago become freed from those painful memories. 

In (20a) 'surely' and 'I' make up the theme. In the Swedish translation, 
due to the V2 constraint, the two thematic components are separated by the 
auxiliary verb. The question that arises is where to locate the theme-
rheme transition point. Fredrikson suggests a split theme, which "(in a 
declarative clause) can be defined as including all elements preceding the 
finite verb plus the postverbal subject" (365). Thus, the thematic 
elements 'surely' and 'I' of the English original can also be treated as 
thematic in the Swedish translation. Furthermore, the author questions 
Halliday's notion of 'topical theme'. In his approach, the thematic part 
of the clause contains one and only one experiential element, the topical 
theme, so "everything that follows the topical theme constitutes the 
rheme" (356). However, Fredriksson allows for several experiential 
elements in the theme. Accordingly, "[t]he concept 'topical theme' has no 
function in [... her] approach" (366). This modified understanding of the 
concept 'theme', in her view, is equally applicable to English and to 
Swedish data.

In their paper "Welcoming children, pets and guests" Elena Tognini Bonelli 
and Elena Manca search for translationally equivalent units in two 
comparable corpora, namely Italian texts that advertise 'Agriturismo' and 
English material that promotes 'Farmhouse Holidays'. The English corpus 
indicates that the notion of 'welcome' is central to the whole genre: a 
total of 324 instances of this word are attested in the data. 
Surprisingly, the 'prima facie' Italian equivalent 'benvenuto' and its 
related forms occur only 4 times in the Italian corpus. Translation 
equivalence, therefore, does not seem to be located at the word level. 
Rather, translation should always consider the context in which a 
particular word occurs. The authors therefore suggest a three-stage model 
of successive contextualisation for identifying translationally equivalent 
units. First, a collocational profile of the word to be translated should 
be established. For the word 'welcome' the corpus yields as 
collocates 'children', 'pets'/'dogs' and 'visitors'/'guests'. In a second 
step, the translator should try to find 'prima facie' translational 
equivalents for the respective collocates. In the current example these 
would be 'bambini', 'animali' and 'ospiti'. The final step would then try 
to identify collocates of these equivalents in L2. For instance, to find a 
suitable translation for 'welcome' in the context of 'guests' 
or 'visitors', the translator should compare the concordances of 'welcome' 
+ 'guests'/'visitors' with the concordance of 'ospiti'. In the English 
corpus, the nouns at issue are found to occur regularly in the 
structure 'Vb BE + 'welcome' + 'to'-inifitive' ('guests are welcome to 
relax'). The concordance of 'ospiti', on the other hand, shows that the 
Italian equivalent to this structure is the Italian modal 'potere' and its 
inflected forms, as in 'gli ospiti potranno fuire'. Obviously then, 
translation equivalents are often not found at the word level. Rather, 
translation should aim at "identifying and comparing syntagmatic units 
that share certain contextual feature with the view of identifying a 
similar function" (383).

In the last article of section 5, Natalie Kübler reports on her experience 
with "using WebCorp in the classroom for building specialized 
dictionaries". As the title already indicates, Kübler followed pedagogical 
objectives that are different from language teaching, namely "teaching 
students how to extract lexical and syntactic information to build 
customised dictionaries for machine translation (MT) in languages for 
specific purposes" (387). The particular register envisaged in this 
experiment was computer science, more specifically, the most recent user 
manuals of the operating system Linux (HOWTOs). In this particular field 
of computer science, new terms are coined almost regularly. Therefore, 
existing parallel corpora of HOWTOs, although providing useful information 
for translation of the more recent HOWTOs, "tend to become insufficient or 
slightly obsolete, even though they can be regularly updated" (395). The 
web, on the other hand, will contain most of the neologisms in this field. 
Accordingly, accessing the internet via WebCorp may be a useful way of 
balancing the shortcomings of finite corpora. The term 'buffer', for 
instance, occurs as part of five different compounds in the parallel 
corpus of English and French HOWTOs. However, terms that were coined after 
the translation of the HOWTOs will not be included. Here WebCorp can help 
to supplement findings from finite corpora, since French computer 
scientists often use English terms together with their French 
translations: the search for 'buffer' in the French domain (.fr) yields 
two more recent compounds together with the appropriate French 
translations, namely 'buffer overflow' and 'heap buffer overflow'. 
Accordingly, Kübler concludes that "WebCorp [...] is ideal for 
complementing and updating the information extracted from time-bound 
specialised finite corpora" (398).

The final section, 'Software development', consists of an article by 
Antoinette Renouf, Andrew Kehoe and David Mezquiriz, who discuss "some 
issues in extracting linguistic information from the web". The article 
provides insights into the WebCorp project, which was launched at the 
University of Liverpool at the end of 2000 in order to investigate "the 
usability of the Web as a linguistic resource, and [... to identify and 
address] some of the problems of retrieval and analysis that it presents" 
(404). In particular, the authors describe issues that are pertinent in 
regard to the WebCorp tool, which allows to use the internet as a corpus. 
Issues discussed include the fact that search engines are constantly 
changing thereby reducing the comparability of results: "corpus linguists 
[...] each access different pages, and different pages at each time. Thus 
the linguistic sample is not constant" (409). Furthermore, Web text may 
not easily be transformed into a format that meets linguistic data 
requirements. In this context, the authors mention the problem of 
providing sentence-length concordances: since Web text is untagged 
only "few clues exist at surface level as to sentence boundary" (410). The 
automatic retrieval of sentences therefore poses considerable problems. 
Nevertheless, WebCorp provides a number of useful ways to exploit the web 
linguistically. For instance, searches with wildcards serve to search the 
web for phrases. More elaborate searches may be used to discover new or 
unconventional forms: the string '[he|she|I] text* [him|her|me], for 
example, "reveals that 'text' not only functions as a verb but as an 
uninflected past tense verb" (413), as in (21) below (21) The next time I 
text him, he didn't reply (413) In addition, web information can be 
exploited by the WebCorp tool to refine searches. This, for example, 
includes the specification of text types or genre via the Open Directory 
or Yahoo, or a limitation to certain domains, such as '.net' or '.ac.uk'. 
Domains may also be combined by Boolean operators. The next steps that the 
authors sketch out lead one to hope that eventually the WebCorp tool will 
turn out a highly useful means that opens up the web for corpus linguistic 
research.

CRITICAL EVALUATION

Karin Aijmer and Bengt Altenberg have edited and excellent selection of 
papers. The articles (apart from two or three exceptions maybe) are of a 
very high quality and highly stimulating and show impressively the 
relevance of corpus linguistic research to linguistics in general. 
Furthermore, the diversity of topics covered will make this volume an 
interesting read for linguists of almost any area: from functionalists to 
cognitive linguists, from synchrony to diachrony, from syntacticians to 
text linguists and even translators.

Also, the variety of corpora analysed by the contributors show the wealth 
of material which corpus linguistics nowadays has at its disposal: in 
addition to the use of standard monolingual and parallel corpora, some 
contributors quite convincingly show how smaller special purpose corpora 
can be exploited: the HOWTOs corpus used by Kübler and the 'agriturismo' 
and 'farmhouse holidays' corpora by Tognini Bonelli and Manca are just two 
examples. In this context, mention must also be made of attempts to open 
up the worldwide web as a possible source of data; its relevance for 
future corpus linguistics, in my view, can hardly be overestimated. On the 
whole, this large variety of data reported on in this volume leaves no 
doubt as to the flexibility of corpus linguistics approaches in regard to 
data-mining.

A further point concerns the relationship between data and theory and the 
role of corpus linguistics, which "have been debated ever since the rise 
of corpus linguistics" (2). This debate has also found its way into the 
present volume. A number of extremely important issues are discussed by 
renowned linguists such as Michael Halliday, John Sinclair, and Geoffrey 
Leech. The mere fact that aspects like the role of intuition in corpus 
linguistics or the relation of corpus-based and corpus-driven approaches 
are still debated clearly shows the strong dedication of corpus linguists 
to theoretical and fundamental aspects of their approach. This is also 
mirrored in a number of papers that advance far beyond the word-crunching 
and case-studying that corpus linguistics often (and not always 
unfoundedly) has been accused of: Joybrato Mukherjee with his "from-corpus-
to cognition-approach" (85), for instance, impressively shows how corpus 
data can refine cognitive models and thus lead to a more appropriate 
description of the speaker's linguistic knowledge. Michael Hoey, through 
his concept of 'textual colligation', establishes a "theoretical 
relationship between lexis and text-linguistics" (171). Anna-Lena 
Fredriksson uses contrastive corpus data to refined the theoretical notion 
of 'theme'. Even if theoretical aspects are not an explicit focus, the 
papers usually give convincing (theoretical) explanations for their 
findings and, where appropriate, discuss implications for the model of the 
speaker's competence or for the abstract language system.

Nonetheless, critical remarks should be made on two individual 
contributions. The first concerns Tan, Ooi and Chiang's conclusion on the 
use of augmenters in personal advertisements (PA) as opposed to spoken 
(SP) or written (WR) texts. I find it difficult to agree with the authors 
that "PA tends towards SP norms -- but not quite reaching them, in most 
cases" (161). Even if the rare cases 'incredibly' and 'ever' are not taken 
into consideration, we find that only two of the remaining five types, 
namely 'really' and 'too', show similar normalised frequencies in PA and 
SP. In contrast, the normalised frequency of 'very' in PA (29.7) just lies 
between that of 'very' in WR (9.7) and SP (50.1). In addition, the 
frequency of 'a lot' in PA (5.1) is more similar to that in WR (0.6) than 
to that in SP (15.2), and 'lah' is highly frequent in SP (77.2) but 
extremely rare in both PA (0.2) and WR (0.0). Admittedly, the authors 
concede that "the situation is not always that clear-cut" (162). However, 
on the basis of data presented I would rather claim that the situation is 
not at all clear cut and that the use of augmenters in PA more strongly 
resembles their use in WR than in SP. Another remark concerns the article 
by Clive Souter: he wants to convince the reader that POW "is worth 
exploring, particularly if you are interested in learning and teaching 
language" (288). At the same time, however, he repeatedly stresses the 
shortcomings of the corpus and the problems that may arise out of the 
corpus's size and the compilation of the material. So I am not quite 
convinced that "interesting lexical information can be gleaned from this 
corpus for EFL instructors and curriculum designers" (279)

The proofreading has been good, the number of typos and inconsistencies in 
layout (I found around 15 cases) is within reasonable limits for a book of 
over 400 pages.

On the whole, the volume makes for a highly stimulating and interesting 
read and gives a good insight into current issues and aspects of corpus 
linguistics showing the vitality and the diversity of the field. Linguists 
from many different branches of linguistics will no doubt profit from the 
papers.

REFERENCES

Johansson, M. (2002): Clefts in English and Swedish: A Contrastive Study 
of IT-clefts and WH-clefts in original texts and translations. PhD 
dissertation, Lund University.

Chomsky, N. (1964): "Current issues in linguistic theory", The Structure 
of Language, ed. by J. A. Fodor & J. J. Katz. Englewood Cliffs, New 
Jersey, 50-118.

Langacker, R. W. (1999): Grammar and conceptualization. Berlin: Mouton de 
Gruyter.

Halliday, M. A. K. (1989): Spoken and Written Language. Oxford: Oxford 
University Press.

Halliday, M. A. K. (1994): An Introduction to Functional Grammar, 2nd ed. 
London: Edward Arnold. 

ABOUT THE REVIEWER

Rolf Kreyer is an Assistant Professor of Modern English Linguistics at the 
English Department of the University of Bonn/Germany. He holds a degree in 
English and Mathematics and has recently finished his PhD thesis, a corpus-
based analysis of inverted constructions in modern written English. His 
research interests include syntax, text linguistics, corpus linguistics 
and theoretical linguistics.

-----------------------------------------------------------
LINGUIST List: Vol-16-27