17.3579, Review: Computational Linguistics: Kehoe; Renouf (2005)

Mon Dec 4 06:42:20 UTC 2006

LINGUIST List: Vol-17-3579. Mon Dec 04 2006. ISSN: 1068 - 4875.

Subject: 17.3579, Review: Computational Linguistics: Kehoe; Renouf (2005)

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews: Laura Welcher, Rosetta Project / Long Now Foundation  
         <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Laura Welcher <laura at linguistlist.org>
================================================================  

This LINGUIST List issue is a review of a book published by one of our
supporting publishers, commissioned by our book review editorial staff. We
welcome discussion of this book review on the list, and particularly invite
the author(s) or editor(s) of this book to join in. To start a discussion of
this book, you can use the Discussion form on the LINGUIST List website. For
the subject of the discussion, specify "Book Review" and the issue number of
this review. If you are interested in reviewing a book for LINGUIST, look for
the most recent posting with the subject "Reviews: AVAILABLE FOR REVIEW", and
follow the instructions at the top of the message. You can also contact the
book review staff directly.

===========================Directory==============================  

1)
Date: 04-Dec-2006
From: Kalyanamalini Sahoo < kalyanamalini at yahoo.com >
Subject: The Changing Face of Corpus Linguistics 

-------------------------Message 1 ---------------------------------- 
Date: Mon, 04 Dec 2006 01:38:00
From: Kalyanamalini Sahoo < kalyanamalini at yahoo.com >
Subject: The Changing Face of Corpus Linguistics 

Announced at http://linguistlist.org/issues/16/16-3514.html 

EDITORS: Renouf, Antoinette; Kehoe, Andrew
TITLE: The Changing Face of Corpus Linguistics
SERIES: Language and Computers Vol. 55
PUBLISHER: Rodopi
YEAR: 2006

Kalyanamalini Sahoo, Zi Corporation, Calgary

The book under review is the edited conference proceedings of the 24th
International Computer Archive of Modern and Mediaeval English (ICAME) held
in Guernsey in May 2003.  It contains a brief introduction by the editors
followed by 22 contributions from different authors. Each article has an
abstract, endnotes and bibliographic references.  The editors have
thematically organized the articles into 6 sections:  corpus creation,
diachronic corpus study, synchronic corpus study, the web as a corpus,
corpus linguistics and grammatical theory and a grammar discussion panel.

SUMMARY

The book opens with Sue Blackwell's ''The corpus-user's chorus'', which is
all praise for the corpus-user's vitality for carrying out corpus based
research efficiently. This is followed by an introductory chapter where the
editors Antoinette Renouf and Andrew Kehoe lay out the key concepts of each
chapter and outline the scope of the book. Then start the contributions,
reflecting a fruitful period in the evolution of the field.

Section 1 'Corpus Creation' starts with Stefan Dollinger's 'Oh Canada!
Towards the Corpus of Early Ontario English', in which Dollinger introduces
the Corpus of Early Ontario English (CONTE), the first electronic corpus of
a variety of early Canadian English. He considers Ontarian English texts
focusing on the issue of selection of authors and texts, which play a major
role for corpora compilation.  He exemplifies three genres of the corpus -
diaries, letters and newspaper texts beginning from 1776 to 1899, also
addresses the transcription problem of Late Modern English handwriting.

This is followed by Clemens Fritz's 'Favoring Americanisms? <ou> vs. <o>
before <l> and <r> in Early English in Australia: A corpus-based approach'.
 Like Dollinger, Fritz also deals with the classic theoretical dilemmas for
the diachronic corpus linguist: at what point in its history is a language
variety to be regarded as representative or fully-formed?  What is the
crucial selectional criterion for corpus compilation: the language of the
texts themselves, or the geographical circumstances of the settlers?  Fritz
deals with a Corpus of Oz Early English (COOEE) containing about two
million words. The corpus is structured on chronological lines and takes
into account various registers and text types including court minutes,
parliamentary proceedings, private letters and diaries, reports, memoirs,
narratives, legal texts and petitions. One characteristic spelling
difference between American English and British English is found in <ou>
vs. <o> in words of the hono(u)r type. Australian English  lies in between
the standards followed by the two other varieties. The author shows that
this is not due to an increasing influence of American English on
Australian English, but is the result of the historical development from
'English in Australia' to 'Australian English'. He suggests that the
education and the origin of the author, as well as the semantics of a
particular word and the period when it was written, all play a significant
role in determining the choice between -or and -our. 

The next article is by Ian Lancashire.  Lancashire discusses the lexicons
of Early Modern English (LEME) compendium of lexicographic and
bibliographical material, a resource which builds on the unique information
provided by his EMEDD (Early Modern English Dictionaries Database).  LEME
documents what speakers of English thought about their language over the
lifetimes of authors like Sir Thomas More, William Shakespeare, John
Milton, and John Dryden covering the period served by the short-title and
Wing catalogues from the advent of printing to the early eighteenth
century. It lists word-entries alphabetically by lemmatized headword, and
then chronologically by lexicon date.  The author has shown how LEME serves
as a source of 'contemporary comments', quotations potentially useful in
illustrating word usage.  

Introducing the HEDGEHOG database of 18th and 19th century EFL pedagogical
and reference works, Manfred Markus discusses 'EFL dictionaries, grammars
and language guides from 1700 to 1850: testing a new corpus on points of
spoken-ness'.  He discusses the corpus in view of features of spoken-ness,
by analyzing typically spoken types of sound and syllable reduction,
morphemic and lexical colloquialisms, as well as syntactic, semantic,
pragmatic and idiomatic features of spoken English.

Antonio Miranda Garcia, Javier Calle Marin, David Moreno Olalla and Gustovo
Mnoz Gonzalez conclude the section with a report on their electronic
database of Old English work, ''Apolloniums of Tyre'', with reference to the
performance of a newly-developed concordancing software tool.  They present
Old English concordancer (OEC), a new tool to process an annotated corpus
of Old English, which goes beyond the prototypical operations of similar
programmes (lists, indexes, concordances, statistical information, queries,
etc).  OEC retrieves general and specific morpho-syntactic information from
an OE annotated corpus.  It allows lemma-based studies as well as some
simple syntactical research at sentence level, solves morphological queries
and generates statistical information including absolute and relative
values of items, the distribution of words, lemmas, class and/or accidence
[inflection], vocabulary profiles, etc.

Section 2, 'Diachronic Corpus Study' starts with Maurizio Gotti's study of
the semantic and functional evolution of verbs SHALL and WILL from 1350 to
the present day. The paper analyses the evolution of the use of SHALL and
WILL for the expression of the predictive function, using data drawn from
both diachronic and synchronic corpora. 

Anneli Meurman-Solin & Päivi Pahta's article 'Circumstantial adverbials in
discourse: a synchronic and a diachronic perspective' presents a study of
adverbials with grammaticalised connectives 'seeing' and 'considering',
appearing in corpora from 1550.  Considering electronic corpora ranging
from past like Helsinki Corpus of Older Scots (HCOS), Corpus of Scottish
correspondence (CSC), Corpus of early English Medical Writing (CEEM) to
those on present day English, like British National Corpus (BNC),
International Corpus of English - Great Britain (ICE-GB), the authors
distinguish 'circumstance' from other semantic roles of contingency. They
demonstrate how, chiefly because of their thematic potential,
circumstantial adverbials can be used in specific functions in genres as
different from one another as 'letter' and 'medical treaties'. 

Building on Leech's 1966 categorisation of formal features, Caren auf dem
Keller discusses 'Changes in textual structures of book advertisements in the
ZEN corpus'. She reviews the changes in textual structures of book advertisements in
early modern English newspaper covering the period from 1671 to 1791, and
provides a detailed overview of textual components and graphic makers used
in the eighteenth century.

Next comes Marianne Hundt's paper ''Curtains like these are selling right in
the city of Chicago for $1.50'' - The mediopassive in American 20th-century
advertising language.  Studying the mediopassives in a corpus of late
nineteenth and twentieth-century American mail order catalogues, Hundt
notes an increase in use, which contradicts a claim by Leech (1966).

Geoffrey Leech & Nicholas Smith discuss grammatical changes in American and
British written English in the Brown corpus (AmE, 1961) LOB Corpus (BrE,
1961), Frown Corpus (AmE, 1992) and FLOB Corpus (BrE, 1991).  The authors
use the POS-tagged versions of these corpora for tracking frequency changes
in grammatical usage in written English 1961-1991/2 and for comparing
similar changes in American and British English.  They note a significant
increase in the use of semi-modal, the present progressive,
that-relativization, proper nouns, s-genitives, verbs and negative
contractions; also on the other hand a significant decrease in the use of
core modals, the passive voice, wh-relativization, and of-genitives.  They
discuss these changes in terms of general diachronic processes such as
colloquialization and Americanization. They also note that the changes in
AmE are more extreme than those in BrE.

Section 3 contains a fairly representative spread of synchronic studies of
present-day English.

Mats Deutschmann explores sociolinguistic variation in the act of
apologizing in the spoken part of the British National Corpus (BNC).  He
investigates 'apology formula',  as exemplified by the lexemes 'afraid',
'apologise', 'apology',  'excuse', 'forgive', 'pardon', 'regret' and
'sorry'.  Analyzing more than 3,000 examples of apology forms, he notes
that in the BNC, young and middle-class speakers favour the use of the
apology form, although only minor gender differences in apologizing is
apparent.  He addresses how formulaic politeness is an important linguistic
marker of social class and shows that corpus linguistic methodology can
successfully be used in socio-pragmatic research. 

Göran Kjellmer takes a metalinguistic stance on the problem of semantic and
referential ambiguity of certain lexemes in the modern-day English of the
Cobuild Direct Corpus. Discussing 'How recent is recent? On overcoming
interpretational difficulties', he shows that the words 'recent' and
'recently' are ambiguous between the meanings of 'not long before the
present time' and 'not long before the time of the event  described'. He
illustrates how to resolve the ambiguity and claims that the disambiguation
phenomenon sheds some light on the process of textual interpretation and
comprehension.

Ute Römer's article 'Looking at looking: Functions and contexts of
progressives in spoken English and 'school' English' is based on
pedagogical texts focused on their shortcomings, in the unnatural
representation of present-day verb usage. Studying the use of progressive
forms in huge collections of spoken British English and in a small corpus
of 'spoken-type' texts from German EFL textbooks, she investigates the
differences observed between English as it is used in natural communicative
situations and the type of English pupils are confronted within a foreign
language teaching context. To overcome the discrepancies found between the
'real' spoken English and the so-called 'school' English, she argues that
if linguists, teachers, and textbook writers aim at achieving a greater
degree of naturalness or authenticity in English language teaching, corpus
evidence must be taken more seriously.

Gabriel Ozón maintains the focus on verbs in his detailed study of
'Ditransitives, the Given Before New principle, and textual retrievability:
a corpus-based study using ICECUP'.  Exploring English double object
constructions, he tries to find out if corpus studies can help track and
confirm the divergences in the use of these constructions.

Anna-Brita Stenström represents contrastive corpus linguistics with her
study of the functionality aspects of Spanish pragmatic marker 'pues' and
its English equivalents 'cos' and 'well'.  Discussing various functions
under syntactic, discursive and pragmatic levels, she shows that 'well'
corresponds to 'pues' in most of its functions, except on the syntactic
level, where 'cos' is the only equivalent. Like 'pues', 'well' and 'cos'
have been grammaticalized, but 'cos' less so than 'well', which partly
explains its fairly restricted use. 

Section 4 reflects a recent change in the definition of 'corpus' with the
emergence of the World Wide Web. Day by day the potential of web-based text
is recognized, as one finds rare, obsolescent and brand new language use
not found in existing corpora. Several corpus linguists are engaged in
making it a more readily usable source of language data. Three 'Web and
Corpus' initiatives are presented here: WebCorp, developed by the Research
Unit headed by the editors of this book; WebphraseCount, developed by Josef
Schmied and team; and Glossanet, developed by Cedrick Fairon.  The papers
in this section focus on tools for extracting data and analyses of corpus. 

Barry Morley discusses 'WebCorp: A tool for online linguistic information
retrieval and analysis'. The WebCorp project has demonstrated how the Web
may be used as a large corpus of text for linguistic research.  Morley
presents the improved functionality of WebCorp such as the ability to
specify the web domain for search, the production of internal collocates,
alphabetical sorting on left and right context, and concordance filtering.
Andrew Kehoe also reports on WebCorp and the heuristics that he has
developed to overcome the obstacle to diachronic study of web text caused
by the absence of reliable date-marking. He discusses 'Diachronic
linguistic analysis on the web with WebCorp'.  The WebCorp project has
demonstrated how the Web may be used as a source of linguistic data. He
discusses the dating mechanisms available on the Web and the date query
facilities offered by standard web search engines, assessing their
usefulness for linguistic analysis and describing how the WebCorp system
has been adapted to support diachronic analysis.

In 'New ways of analysing ESL on the WWW with WebCorp and WebPhraseCount',
Josef Schmied discusses how software tools can be developed to interface
with search engines and help linguists to make use of the world-wide web in
their work.  He demonstrates the potential of WebPhraseCount, a tool
devised to measure the relative frequency of individual aspects of language
use across the English language varieties on the web.  He shows how tools
like WebCorp and WebPhraceCount can be used by advanced language learners
as well as linguists interested in variation in English world-wide. 

In ''I'm like, ''Hey, it works!'': Using GlossaNet to find attestations of the
quotative (be) 'like' in English-language newspapers'', Cédrick Fairon &
John V. Singler discuss another automatic web text retrieval and analysis
system called GlossaNet, which downloads certain newspaper web sites
executing complex linguistic queries. They discuss how GlossaNet monitors
newspapers analysing the texts using the programs and linguistic resources
of a corpus parser. 

The papers in Section 5 'Corpus Linguistics and Grammatical Theory' raise
some of the theoretical concerns which attest to the maturity of the field,
emerging in the light of extensive empirical observation and experience. 
In 'Corpus linguistics and English reference grammars', Joybrato Mukherjee
reviews some major English reference grammars like the new Cambridge
Grammar of the English Language (CamGr), the comprehensive Grammar of the
English Language (CGEL), and the Longman Grammar of Spoken and Written
English (LGSWE). He discusses major conceptual and methodological
differences between these grammars and asks how far these need to be
informed by corpus data. 

He argues that the combination of CGEL and LGSWE provides a first important
step towards genuinely corpus-based reference grammars in that a
theoretically eclectic descriptive apparatus of English grammar is
complemented by qualitative and quantitative insights from corpus data. He
emphasizes that future corpus-based grammars need to be optimized with
regard to the transparency of corpus design and corpus analysis and the
balance between general and genre-specific language data. 

Christian Mair discusses 'Tracking ongoing grammatical change and recent
diversification in present-day standard English: the complementary role of
small and large corpora'.  He stresses the need of a closer cooperation
between the two traditions in corpus linguistics: (1) a ''small-and-tidy''
approach which emphasizes detailed philological analysis of clean corpora,
and (2) a ''big and messy'' one which stresses the advantages to be gained
from the computer-assisted analysis of vast quantities of dirty data. 
Taking example of the get-passive, he argues that there are aspects of this
well-studied and fairly common construction which cannot be investigated
even in a very large closed corpus such as the BNC, although good results
can easily be obtained from the World Wide Web.   He emphasizes that in
spite of its obvious shortcoming as a corpus, the Web is an indispensable
source of data for the study of infrequent and recent linguistic phenomena.
In the article 'but it will take time?points of view on a lexical grammar
of English', Michaela Mahlberg takes time phrases to demonstrate how a
'lexical' grammar can reveal more about the semantics of language in use
than a more surface-structural pattern grammar such as that of Hunston and
Francis (2000). 

The volume is rounded of in section 6 by Jan Aarts's 'Corpus linguistics,
grammar and theory: Report on a panel discussion at the 24th ICAME
conference', where the main focus is on the impact of corpora on English
reference grammars. The panelists address the characteristics of a
reference grammar and the corpus-linguistic methodology appropriate for the
writing of such a grammar as well as for corpus-linguistic research in
general.  This chapter provides a fitting conclusion to the volume that
provides a very perceptive overview of the field of corpus linguistics and
grammatical theory.

EVALUATION

This edited volume of papers in the area of corpus-linguistics deals
basically with the corpora of English. The book has been compiled with a
lot of thought.  It covers a lot of ground in over 400 pages, covering a
wide range of topics beginning with corpus creation to corpus analysis,
evaluation and the use of World Wide Web as Corpus; covering several fields
like EFL, ESL, contrastive studies, grammatical theory, lexicography,
semantics and socio-pragmatics; discussing various tools like WebCorp,
WebPhraseCount, Glossanet, concordancing software tool etc. The volume
shows just how very diverse and complex corpus based research can be. The
richness of the book can be accredited not only to the editors' vast
experience and knowledge in selection and arrangement of chapters in terms
of theme and style but also to the authors in presenting the development in
terms of linguistic research.  Especially, the inclusion of the report of
panel discussion is very useful to bring to light what topics are in the
current focus of a research community in corpus-linguistics.

However, there are certain shortcomings as well.  Although information can
be retrieved through various tools like WebCorp, WebPhraseCount, OEC etc.,
the volume does not discuss on dialectal corpora which could pose a
challenge to the techniques and tools for variation in spelling. As such
Frequency profiling, concordancing, n-grams and keyword methods all suffer
from problems of unreliability when applied to dialectal corpora. 
Secondly, the volume does not discuss how corpora can be used by language
learners themselves, although Schmied touches the issue lightly and
demonstrates how advanced language learners can make use of tools like
WebCorp and WebPhraseCount.  

Of course, the editors have rightly justified the title of the book 'The
Changing Face of Corpus Linguistics' acknowledging the recent change in the
definition of 'corpus' accompanying the availability of texts on the World
Wide Web; also emphasizing the maturity of the field from corpus building
to corpus analysis and evaluation. But making use of the text available on
the World Wide Web is not that simple. Although the use of the web as a
corpus is becoming more and more common these days, it raises the question
how can such large amounts of data be cleaned, encoded, annotated, stored,
and shared?  Especially, clearance of copyright for web data as well as
other corpus data is a vital issue.

Overall, the book is an extremely valuable resource not only for
professional corpus linguists but also for the beginners interested in the
area to understand the wider field of corpus linguistics including the
historical developments it has undergone.  A plus point of the book is the
inclusion of many useful figures, tables and URLs that serve to capture the
research findings in a concrete manner for the reader.  The volume is
concerned with issues relevant to linguists using corpora to carry out
purely linguistic studies, without moving much to an allied discipline,
natural language processing (NLP).

REFERENCES

Hunston, S. and G. Francis (2000). 'Pattern Grammar. A corpus-based
approach to the lexical grammar of English'. Amsterdam: Benjamins.

Leech, G.N. (1966). 'English in Advertising'. London: Longman. 

ABOUT THE REVIEWER

Kalyanamalini Sahoo works on computational morphology and South Asian
languages for the Zi Corporation, Calgary, Canada. She is primarily
interested in computational morphology and syntax.

-----------------------------------------------------------
LINGUIST List: Vol-17-3579