15.681, Review: Text/Corpus Linguistics: Hickey (2003)

LINGUIST List linguist at linguistlist.org
Tue Feb 24 19:10:45 UTC 2004


LINGUIST List:  Vol-15-681. Tue Feb 24 2004. ISSN: 1068-4875.

Subject: 15.681, Review: Text/Corpus Linguistics: Hickey (2003)

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Ogasawara <naomi at linguistlist.org>
 ==========================================================================
What follows is a review or discussion note contributed to our Book
Discussion Forum.  We expect discussions to be informal and
interactive; and the author of the book discussed is cordially invited
to join in.

If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for review." Then contact
Sheila Dooley Collberg at collberg at linguistlist.org.

=================================Directory=================================

1)
Date:  Sat, 31 Jan 2004 09:20:41 +0100
From:  Stefan Th. Gries <STGries at sitkom.sdu.dk>
Subject:  Corpus Presenter

-------------------------------- Message 1 -------------------------------

Date:  Sat, 31 Jan 2004 09:20:41 +0100
From:  Stefan Th. Gries <STGries at sitkom.sdu.dk>
Subject:  Corpus Presenter

AUTHOR:    Hickey, Raymond
TITLE:     Corpus Presenter
SUBTITLE:  Software for language analysis [...]
PUBLISHER: John Benjamins
YEAR:      2003

Announced at http://linguistlist.org/issues/14/14-2813.html


Stefan Th. Gries, University of Southern Denmark

NOTES:
An easier-to-read PDF file with this review which also offers
screenshots to exemplify some points can be found at:
http://people.freenet.de/Stefan_Th_Gries/research/CP_review.pdf

Italics are indicated here by underscores before and after a word.

DESCRIPTION/SUMMARY OF THE BUNDLE

1) The book
Part I (p. 1-27) is a brief introduction into corpus linguistics. This
part provides an overview over some corpus-linguistic terminology
(types of corpora, tagging, corpus headers, etc.) and a brief section
entitled 'examining corpora', which introduces a few basic notions
such as _concordance_, _tagging_ and _lexical cluster analysis_ (each
with about two paragraphs). P. 14 to 27 discuss very briefly a few
sample analyses of corpus data (mainly Irish English data); these
involve frequency data for the presence or absence of particular
linguistic forms, relative frequencies of for collocations and
relative frequency data in an investigation of an author's style.

Part II of the book (p. 28-183 consists of descriptions of the modules
or programs of the Corpus Presenter suite (henceforth CP). Much of
this part corresponds to the help files of the software bundle.

Part III of the book consists of several appendices.  Appendix 1 and 2
of the book (p. 184-9) provide information about the installation;
while Appendix 3 (p. 190-200) lists a set of common commands,
i.e. commands which are found in several parts of CP. Appendix 4
(p. 201-4) describes the file interface of CP, most of this will be
familiar to most users of Windows. Appendix 5 (p. 205-7) gives some
troubleshooting information and Appendix 6 (p. 208-9) introduces three
additional dataset files describing three corpora from the ICAME
CD-ROM. Finally, this part contains a glossary of short definitions of
corpus-linguistic and a few statistical terms.

Part IV of the book (p. 237-76) is a description of A Corpus of Irish
English, which is followed by a general bibliography, a glossary for
this corpus and a combined subject/name index.

2) The software
CP offers multiple functions for compiling, annotating and
processing corpora.

- CP can search for strings in texts in order to output them
as a concordance:
   -- CP allows for detailed specification of strings and
corpus text settings including those for case-sensitive
searches, sentence and word delimiters, punctuation signs;
   -- in addition, one can look for single expressions,
lists of expressions and larger expressions (by specifying
the left and the right part of a complex expression as well
as their maximal distance);
   -- an especially useful option is the possibility to
include special characters and symbols from every installed
font into the search pattern;
   -- one can specify stop words and output options such as
full sentence or x words to left/right of the search word;
   -- one can specify Cocoa settings to include only files
with particular attributes into the search;

- CP can generate collocation frequencies for words in a span of
max. eight words around a search/node word;

- CP can generate some text statistics (word counts) as well as
(regular or reverse) word lists of individual words and/or clusters of
up to 8 words;

- CP can perform search and replace operations on files to alter texts
(e.g. tagging and normalizing files for lemmatization);

- CP can collate files and compile corpora by collecting and
manipulating files of various sorts, organize the files and their
contents hierarchically and add header information (e.g. Cocoa) for
future searches;

- CP can convert files to different file types, manipulate their
attributes (e.g. date stamps, extensions) and do lots of file handling
operations (cut, copy, past, duplicate, merge, etc.);

- CP offers a few useful functions that go beyond some limitations of
the Microsoft Windows OS such as a cumulative clipboard or an undo
buffer for deleted text.  The program comes with many different help
files, FAQ files and a brief tutorial with 44 slides and
written/read-aloud instructions (.WAV format).

CRITICAL EVALUATION

The CP suite is a set of programs that offers a vast range of
possibilities for working with corpus data. It was mainly tested on a
notebook computer (Pentium III 1000 with a 20GB hard disk and 256MB
RAM running an English Windows XP Professional; some additional tests
were performed on a desktop computer (Athlon XP 1800+ with a 40GB hard
disk and 640MB RAM) running a German Windows 2000 (both systems are
completely updated in terms of Microsoft Service Packs etc.). The
program was tested by myself alone, but in order to make the
evaluation slightly more objective, I also asked a colleague for her
opinion on some issues. In order to discuss some of CP's properties, I
will make reference to a few concordancing programs, namely WordSmith
Tools 3 (WST), MonoConc Pro 2.2 (MCP) and WinConcord 2.0 (WC).

1) Speed and power of CP

The author (henceforth RH) stressed that "a special fast retrieval
mode has been incorporated into _Corpus Presenter_ to minimize the
time one has to wait for returns to be made during searches"
(<CP_GUIDE.RTF>). However, at least my own experiments do not support
this assessment (with one exception mentioned below all time-taking
experiments were performed after immediately after a system reboot).

- Concordancing example 1: On the above-mentioned notebook computer,
searching the Brown Corpus for the word form _best_ using the most
powerful 'text retrieval' level lasted an astonishing 35.7 seconds
(and some 35.94 seconds for _after_; CP Main Programme's own output)
even though all settings concerning Unicode were optimized for the
processing of plain ASCII files. By contrast, MCP took about 2 seconds
to load the file and 4 seconds to produce the concordance.

- Concordancing example 2: On the above-mentioned notebook computer,
searching 674 files from the BNC part A (without tags) for the word
form _best_ once took 1030.48 seconds (with several applications open
but unused in the background) and 334.21 with no other applications
running (CP Main Programme's own output) even though all settings
concerning Unicode were again optimized for the processing of plain
ASCII files. By contrast, WST took about 57 seconds to produce the
desired concordance ...

- Concordancing example 3: On the above-mentioned notebook computer,
simply finding out how that my British National Corpus (BNC) directory
contains 4,054 .TXT files required 47 seconds (CP Flash) - both MCP
and WST need about a second or less.

- Word list examples: Making a simple word list of the Brown Corpus
(without the reversed list) required 469.46 seconds with CP Main
Programme (and I canceled CP Flash after 30 minutes!) but only 11
seconds with MCP. In this connection, it is worth pointing out that
the program has an upper limit of 32,000 words for word lists. RH does
claim that this is "virtually ample for every corpus" (p. 59), but the
word list for the one-million Brown Corpus already had about 10,000
entries so it is easy to find corpora whose word lists will exceed
this limit: the word list of the one- million FROWN Corpus mentioned
in RH's book itself has 50,000+ lines (processing time with CP Main
Programme: 781.41 seconds; processing time with WST: 7 seconds), and
the word list of the 100-million-words BNC available from
A.Kilgarriff's website at http://www.itri.bton.ac.uk/~Adam.Kilgarriff/
has 938,000+lines ...

- Merging files: With CP File Manager, merging 15 text files (with an
overall size of 6,758KB) required 4:08 minutes.  All in all, thus, CP
is rather slow, especially when compared to other contemporary
programs.

2) Ease and convenience of use of CP

My own impression of the usability of CP is rather negative,
especially when compared to the other three corpus programs that I
work regularly with for teaching and research mentioned above. While I
do admit that the range of functions is large and that I may not be
able to do justice to all features the suite has to offer, I am not
very happy with a variety of features. My main concerns are as
follows:

2a) The modules of CP

The CP suite comes with a wide variety of different modules and is
intended to bring together modules to carry out a huge number of
different tasks into a single suite, which basically sounds like a
good idea. If CP and other similar programs such as WST, MCP and WC
were located on a 'modularity scale', then MCP and WC would have the
simplest structure such that all commands can be accessed from a
single window with one menu bar; by contrast, WST is a suite with
three modules doing different corpus jobs plus four modules for file
handling etc.; and CP is a suite of 27 modules in five
groups. Compared to the other programs, CP's structure, thus, appears
relatively complex, an impression that was involuntarily confirmed by
some unguided experimentation: while I could use many capabilities of
WST, MCP & WC without having looked at any documentation, now after
several years experience doing corpus-linguistic research with
different corpus programs and Perl scripts I was unable to do a simple
corpus search with CP without having looked at the documentation
provided with the bundle.

A related point of criticism is that many of the modules serve
purposes for which even the most modestly equipped (corpus) linguist
probably already has resources available that can perform (most of)
what is needed. For example, for the potential buyer it is worth
pointing out that more than half of the 27 modules provided on the
CD-ROM are applications that strongly resemble Microsoft Windows,
Microsoft Office or OpenOffice products:

- CP Slide, a program which "will group any set of files into a list
which one can page through like slides on a projector (from one to the
next, without interruption, on a clear screen)" (<CP_GUIDE.RTF>), a
set of functions many of which Microsoft PowerPoint or OpenOffice
Impress can perform;

- CP Browser, a web browser, which provides functions most of which
Netscape Navigator, Internet Explorer etc. already provide;

- CP File Manager, CP FileManager Lite and CP Quick Backup, a program
"similar to the file manager but slightly different in its
organization" (<CP_GUIDE.RTF>), all allow you to perform various file
manipulation and storing operations; most, though not all, of these
can of course be performed by the regular Windows Explorer or other
(freeware) programs; the same holds for the module CP Find Text;

- CP Diary: a program that is intended to remind you of important
dates and allows you to have a yet-to-do list, i.e. it offers part of
the functionality of Microsoft Outlook etc.;

- CP Jotter "provides a small and very quick version of the fuller
text editors of [CP]" and, thus, does the same thing as <Notepad.EXE>
on every Windows system (or TextPad or UltraEdit or ...); in addition,
CP also has a command 'view returns storage' which provides yet
another window where you can enter data for later storage just like
<Notepad.EXE>; and there's also CP Text Editor and CP Text Tool, which
are text editing utilities ...; - as if the previously mentioned text
editing modules were not enough, there is also CP Word Processor,
which does the same things as Microsoft Word or OpenOffice Writer;

- CP Easy Chart "will generate a pie, bar or line chart from any
series of input numbers" (<CP_GUIDE.RTF>), which is of course what one
normally uses Microsoft Excel / OpenOffice Calc for;

- CP Database Editor and the separate Database Manager serve the
purpose of processing database file (e.g. .DBF), a function for which
again most people use Microsoft Excel or OpenOffice Calc;

- CP Internet Editor allows you to edit your homepage(s) and,
therefore, does the same thing as Microsoft Frontpage or any other
freely available HTML editor;

- CP Control Centre: a small module that gives you access to a variety
of system setting options most of which are already accessible from
the Windows Control Panel ...;

My further discussion of CP below will not address all of these
modules at the same level of detail since many of the modules and/or
their functions are not relevant in a more narrowly defined
corpus-linguistic sense; in addition, many options of these
'non-corpus-linguistic' modules I have tested were not superior in
functionality to their Windows/Office counterparts anyway. For
example, the possibilities to generate charts with CP Easy Chart's
chart options appear to be much less sophisticated than, say,
Microsoft Excel's options especially since the latter can generate
graphs directly from automatically updated pivot tables without the
whole lot of manual effort required by CP Easy Chart. Also (a minor
point though), many of the modules themselves contain commands which
are nice little gimmicks but which add little to the linguistic
functionality/utility of this corpus processing suite.  Examples for
these include the possibility to access a calculator or the time/date
from several modules, the possibilities of adjusting color and/or
wallpaper or font settings for many modules, the possibility to access
CP Jotter from some modules' menus, the option to view the RH's CV
etc. - these options of course probably don't really hurt, but they do
of course inflate the number of commands beyond what is necessary and
easily/intuitively handable ...

2b) CP and Windows

Another important usability issue is concerned with the way CP
integrates into, or makes use of the capabilities of, the Microsoft
Windows operating system. While RH emphasizes that the program is
designed "afresh, utilizing to a maximum the possibilities of the
newer operating systemW (p. 28), I would not quite agree to this
assessment. Consider, for an admittedly painfully detailed example,
the installation of CP:

Upon double-clicking <setup.exe>, the program copies some files onto
the hard disk and opens a window with (i) three installation options
(installing CP, installing a database software and installing a sample
corpus of Irish English) and (ii) a huge "Installation Advice
Text". Among other things, this text explains, firstly, that the
program is installed into the folder <C:\Corpus Presenter> and - if
the program is installed elsewhere - that the links to the 27 modules
must be altered manually!

Secondly, the installation process is split into two different
steps. (This information is provided in the advice text twice, once
before a list of the contents of the CD-ROM and once again after this
list; the confusion is increased by the fact that, in the second
occurrence of the otherwise identical text segment, a different path
is used.) The first step is called on by clicking on "Installing
Corpus Presenter" so that all the program files are copied to one's
hard disk; you cannot specify where the files are installed unless you
manipulate Windows settings about default program directories. In the
second step, Windows system files are copied. Surprisingly, you are
prompted where these system files should be installed, and if you
decide to install CP into a different directory (e.g. <D:\>), then the
system files are copied to the system directories where they belong
anyway, but the directory which you chose for installation only
contains .EXE files for CP Programme Launcher and CP itself, but all
the other 477 files CP has installed before are still in the directory
CP mentioned before, namely, on an English windows machine,
<C:\Program Files\Corpus Presenter>.

Similar comments hold for the database manager and the sample corpus:
you need to install the database manager separately (which is ok), but
CP expects it to be located in a particular directory without spaces
in the name, and the sample corpus is simply installed to
<C:\Corpus_Irish_English> regardless of where you would want to have
it ...

Although RH explains in the final paragraph of this advice text that
many of the shortcomings are due to the Windows operating system, it
remains completely mysterious to me why the user cannot simply enter
the desired path for all to-be-installed components, and the program
organizes itself internally as it needs to and outputs the requisite
links as is customary with nearly every other Windows program I
know. The way it is now, the installation process and its result are
painful if you do not know his way around Windows quite well; your
system partition <C:\> is then cluttered with different directories
that you would perhaps have preferred to be on an 'applications'
partition or on a 'corpus' partition. In addition, the uninstallation
with the Windows control panel did not remove all parts of the
installation properly: the corpus as well as the files in <C:\Program
Files\Corpus Presenter> and <D:\Corpus Presenter> simply remained on
the hard disk.

Unfortunately, there are many more inconvenient things that falsify
the claim of the maximum of the possibilities of the newer Windows
system: - Some of the programs seem to adopt the previous color
settings of the desktop rather having own settings. Doesn't sound like
a big deal? Well, on a notebook with a black desktop it can result in
your not being able to read the black text in the overview windows of
CP Programme Launcher until you have figured the out how to change the
two available and (misleading) color settings.

- When you start CP Programme Launcher, you get to see a menu bar at
the top which is none: Rather than opening a menu with commands to
choose from, each expression in this menu-like section is already a
command in itself. For users with a well-entrenched knowledge of the
Windows system, this is at first perplexing, which is why buttons
should have been used here in the first place.

- Why are nearly all program windows opened such that they cover the
whole screen and hide all other applications? When you turn to the
help function to get information on some window, the help screen hides
the window for whose options you look for clarification. When you open
another module, it hides all other software which you might have
needed to see (e.g. to enter data from it into CP Easy Chart) And why
aren't the windows that cover the whole screen maximized so that a
click on the restore down icon would reasonably reduce the window
size. And then, some windows don't allow downsizing or maximizing at
all: CP Easy Chart does - CP Flash doesn't.

- In some programs (e.g. CP File Manager and CP Main Programme), lists
of files can be sorted by clicking on a column heading (e.g. size,
type etc.) - in others (e.g. CP Create Data Set), they cannot.

- Right-clicking does not always open a context-dependent menu
(sometimes it just does the same as left-clicking and sometimes it
just offers to perform one particular action), and in windows
consisting of several horizontically or vertically separated frames,
you can often not change the window parts' relative sizes to see more
of the more important information although this is of course standard
in all Windows applications.

- While there is a huge amount of commands available in the 'help'
menu (twenty in CP! - even Excel XP only has eight commands, as has
WST), many of them don't seem to belong there (what does benchmarking
the system, the system information, running the graphics program CP
Easy Chart or exploring CP's home directory have to do with a user
struggling with CP's many settings and looking for help?).  Also, CP
does not afford Windows users the by now familiar option of a help
index to enter key words describing your problem in order to retrieve
a list of all help topics related to this notion. Finally, not all
help texts are really useful: when I tried out the interactive tagging
function of CP Text Tool, I was confronted with the window in which I
had to enter the words and the tags for the tagging. Since I did not
immediately understand the makeup of the window (containing four text
fields, seven buttons, four fields to tick and some text information),
I clicked the help button of this window. However, instead of getting
information on how the information must be entered, I got eleven lines
of text (taken from p. 160f. of the book), nine of which explain what
tagging is and that semantic tagging is in general not possible plus
two lines telling you that you must enter maximally 512 input forms
and none of which explain the buttons or fields of the window from
which this 'help' box was accessed in the first place! (If I do the
same in some window in Excel, I get precise information on all buttons
and all fields of the respective window ...)  Also, the window offers
the option to tag forms as words or strings - neither does the
corresponding section of the book explain what a string is nor is the
word 'string' listed in the index of the whole book ...

- Windows programs usually allow the user to enter data into several
fields of an input window by jumping from data field to data field by
pressing the TAB key; CP does the same, but - at least in CP Easy
Chart, the program does not switch from one field to the immediately
adjacent and thematically related one, but arbitrarily to some other
field, which doesn't make entering data any easier ...

2c) Some other functionality quibbles

The following is a list of other shortcomings of some modules which
are not directly related to the integration into Windows; I begin with
CP Main Programme.

- If you want to save your results (of a corpus search) such that
collocates at different positions can be accessed easily, you cannot
simply choose to save it as a text file with tabs as delimiters (at
least I didn't find out how).  Instead, you must save it as a database
file (.DBF), which entails you must use CP Data editor (or, say,
Excel) to retrieve the data again and cannot use your favorite text
editor etc. first.

- If a particular search of CP's main program is interrupted, then -
unlike other concordancers - CP does not present the results obtained
so far; it presents none.

- If you wish to INcrease the number of collocates to be displayed in
a results window, you do so counterintuitively by clicking an arrow
pointing DOWNwards.

- While CP can output the collocates of a particular search word, it
is not quite easy to locate this option: all other concordancers I
know simply have a command called 'collocations' (or some equally
telling name), but in CP you have to find out somehow that the command
(in CP Main Programme) is called 'restructure return lines', which is
not only very unintuitive but also somewhat difficult to find since
(i) there is no help index (cf. above), (ii) _collocate_ and
_collocation_ cannot be found using the find function of the main help
text and (iii) the word _collocation_ only occurs twice in the whole
program folder (as determined with a grep tool), neither occurrence of
which explains this function. The only way to find this option if you
don't already know it is the index of the book where the third page
entry for _collocates_ points you to the right page in the book for
this option.  - If you want to search for bipartite expressions where
one part can be instantiated by several different forms (such as
inflectional forms of one lemma, say, _put_, _puts_, _putting_), then
you can use the option of editing an input list - but you cannot
simply edit the list by entering a few words and do a search, you must
either load an existing list or enter the list manually and save it.

- Surprisingly, CP cannot sort concordance lines according to a
user-specified position in the vicinity of the search word: you can
only sort concordance output according to the leftmost word of a cell
and the word _sort_ or any derivative is not even mentioned in the
index of the book, something I find strange for a program (suite) the
main purpose of which is handling text(s).  The module CP Create Data
Set also deserves some comments.  If you do not simply load text files
as a corpus but want to compile a corpus, CP needs information on how
the corpus is organized. You can either simply create a text file with
a particular format with any text editor providing this information
for CP by yourself or, alternatively, you can use this
module. However, although the module is explained on only three pages
in the handbook, it is relatively complex, and its output is the very
same text file description of the corpus. In other words, one must
again enter all information for each corpus file separately and
manually. In addition, several windows this module opens are not
discussed in the book or the corresponding section of the help file
(which are identical anyway) and handling the module is not always
intuitive to say the least:

- I have not been able to find out how the order of files is changed
using CP Create Data Set (other than, of course, by manually editing
the text file itself); - subheadings of a corpus must make reference
to empty dummy files; - deleting nodes from your corpus structure does
not really delete the nodes until the data set file has been saved so
you must work with empty nodes and empty files etc.  Other
shortcomings of this module are, again, due to the fact that Windows
has not been utilized fully. Why can I highlight all corpus files
which I want to assign to a lower structure level in my corpus, but
cannot also change their level assignment all in one go? Why does this
module not allow me to simply load a list of files and convert them to
a dataset by providing information as to the structure of the corpus?
(Guess what - you have to turn to a different module for this option,
namely CP Flash, but when you read the section on CP Create Data Set
to find out whether such a possibility exists, the book doesn't tell
you - you must find out for yourself some other way!) Why is it not
possible to use drag and drop options etc. to determine the structure
of the corpus? Why is there no assistant to guide you through the
creation of the corpus structure (just like Excel has a guide for
pivot/contingency tables and WST has a brief guide to generate a
concordance)? I don't know.

There are similar usability problems throughout CP. I cannot discuss
all of them here since the review is already (too?) lengthy so a few
final examples must suffice for the moment: First, CP Quick Note makes
it possible to structure a text using embedded table of contents
markers. These markers can be embedded using the very same module
... but not with the menu 'Insert' as every normal user would suspect
- rather, to insert these markers the menu you have to open is called
... 'Display'. Second, the program CP List Processor allows the user
to manipulate one or two lists such that, for example, the lists are
merged or differences between lists are shown. However, there is a
little bug in the program concerning the alphabetical output of the
program since the resulting sorting is not fully alphabetical.

Finally, let us return the interactive tagging procedure of CP Text
Tool. You open a text file containing words to tag, and you need to
have one file with words to tag and one file with tags. For the
automatic tagging function, you choose the 1-512 word forms to be
tagged with one tag, choose automatic tagging and CP Text Tool adds
the tag to all the word forms; since you can specify more than one
word to be tagged at a time, this is a huge advantage over replacing
functions of, say, Microsoft Word or TextPad. With interactive
tagging, the program goes through the corpus text, stops at every
instance of one of the to-be-tagged word forms and asks the user which
tag to add to the word form. This function is implemented a little
clumsily since you are not simply prompted to choose a tag but have to
use some more mouse-clicks and whenever you want to choose a tag other
than the default one to assign and click on 'reject' in this window,
the list of available tags is recursively added ... Thus, although the
interactive tagging function works basically ok, there is some bug
here that needs correction.

3) Nitpicking, typos etc.
This section is concerned with only minor errors and some other short
comments/questions in a simple list form.

Concerning the book:

- On p. 8., the first line of the second paragraph of section 3.5 is
garbled.

- On pp. 5f., 42 and 164f., RH mentions the normalization of corpora,
but with the exception of one example buried within a table he
restricts himself to normalizing spelling variants; the issue of
lemmatization would have deserved more emphasis here (for analyses of
author style or collocations) but it is only mentioned once in the
glossary (though, surprisingly, not in the index);

- The notion of tagging is explained briefly under the rubric of '3.2
Versions of corpora' in three paragraphs (p.  5) and once again under
'Tagging a corpus' on p. 8f. - why not put this together?

- Why are the BNC and the Cobuild Bank of English not mentioned at all
(not even in the glossary although this includes several other entries
of words not mentioned in the book; cf. below) although they are
probably the most widely distributed and available corpora?

- The brief explanations of central corpus-linguistic terms in Part I,
sections '3 Preparing corpora' and '4 Examining corpora' are very
brief, having an average length of about two to three paragraphs only
and thus cover these issues only very superficially.

- On p. 22f., RH discusses an example of collocate analysis, namely
the frequency of particular collocations for _deal_ in the London-Lund
Corpus. He states that "139 finds were reported in 15 files. 57 finds
were with _great_ as immediate left collocate and 41 finds with _good_
in the same position. The results in a visually effective form can be
shown as a pie chart [...]." However, the pie chart to which RH refers
reports percentages which do not fit the data described in the
text. 57 [_great deal_] out of 139 [* _deal_] are 41% and not the 60%
represented in the chart; the same holds for great: 41 [_good deal_]
out of 139 [* _deal_] are 29.5% and not 38.89% ...

- On p. 67, RH describes the command 'Frequently asked questions' in
the module CP Main Programme as follows: "The third text file aims at
answering typical questions which users might come up with who have
started working with the present corpus. This file should preferably
be written by someone who has been connected with the compilation of
the corpus." However, when the test corpus is loaded and one tries to
access the FAQ for this corpus using this option, what one gets is the
FAQ for the CP Main Programme, not for the corpus and it is unclear
why this should have been "written by someone who has been connected
with the compilation of the corpus."

- On p. 72, section 1.3 ('Normalising texts') consists of one
paragraph only and accidentally interrupts an otherwise coherent and
numbered description of the search parameters available in CP Main
Programme.

- On p. 73, RH gives an example of a word search using the frame
option by suggesting that "if you wished to find all instances of
negated adjectives in a text then you could enter a frame consisting
of _un_ and _able_ [...]" - obviously, this search would not produce
all negated adjectives since _inadequate_ and _impossible_ would not
be retrieved.

- On p. 81, RH states that the file extension for the output of a
concordance as a plain text file is .OUT, but in the program it's
.TXT.

- On p. 92, a sentence runs "This very is important for [...]."

Some minor comments concerning the glossary:
- Some of the explanations of statistical terms in the glossary (which
are not mentioned in the book otherwise) are far from optimal. For
example, to define an alternative hypothesis as "an assumption in
statistics that two variables are different" (p. 211) is perhaps a
little too low-level even for a glossary definition.

- Similarly, the definition of a Chi-square test (p. 213) is
grammatically incorrect ("A common test in linguistics is to determine
if the probability that a difference between sets of values is due to
chance alone.") and much too vague since the above sentence also
characterizes a t-test, a U test, an ANOVA etc. Also, I am not sure
that most scholars would subscribe to the following statement: "A
typical cut-off point for significance is p<0.001" (p. 213).

- I do not know why a corpus processing book needs glossary entries
for _cookie_, _email_, _inkjet printer_, _laser printer_, _PC_,
_RS232C_ and _TFT_.

- I would not equate _lemma_ and _lexeme_.

I already mentioned that large parts of Part II of the book are just
the help files of the program, but sometimes the book itself is also a
little redundant. For instance, a part of the general description of
CP's main module on p. 30f. is repeated verbatim in the more detailed
section on p. 48f. In addition, sometimes the names of the modules
used in the book are not identical to the names of the modules used in
CP Programme Launcher, the application from which RH recommends to
access all other modules. For example, the book has sections on CP
Easy Chart and CP Structure, but in CP Programme Launcher the very
same modules are called CP Chart Generator and CP Structured Texts
respectively. This is of course no big deal (which is why it is in
this nitpicking section of the evaluation in the first place), but,
just like the fact that the book doesn't discuss the modules in the
same order in which they are listed in CP Programme Launcher although
(i) this would be easier to follow and (ii) perfectly possible since
the listing in CP Programme Launcher is completely arbitrary, it
simply does not speak in favor of careful editing.

Unfortunately, the book is not very well organized in terms of
software learnability either, which is probably a direct consequence
of Part II of the book largely being the help files. It would have
been extremely useful if the book had provided at least one sample
analysis which is designed in such a way as to lead the beginning user
through the many modules (perhaps in combination with a website)
unlike the sample analyses in the book which make no reference at all
to how exactly they would have been generated with CP. Let me give an
example of what I have appreciated very much. One could have some
corpus files on the publisher's webpage which one would then turn into
a hierarchically structured corpus using the text editing modules and
CP Create Data Set. Then this corpus could be tagged and lemmatized
using CP Text Tool. Then one could perform a sample analysis on the
basis of this corpus, for example the collocational differences of
_strong_ and _powerful_ (to use the textbook example) with CP Main
Programme or CP Flash and use finally use CP Easy Chart and CP Slide
to prepare a presentation of the results and CP Internet Editor to
present the results on a website. For all of this, the website could
provide interim results to allow the user to check whether he has
mastered the tasks so far. Sadly, however, none of this is provided
although it would have enhanced the value and quick learnability value
of the bundle by many orders of magnitude.

Concerning the software:

- The installation advice text contains the same paragraph twice (with
different paths, though); cf. above.

- In the rubric 'Retrieving information' of the CP help, there are
three paragraphs §2.

- In several modules, when a window outputs certain results, you can
change the size of the window as such, but not the size of the part of
the window that contains the output; i.e. you get a larger window with
the same information; cf.  here for an example from CP Flash.

- The fact sheet for the installed test corpus start with "Some
essential facts abou the Test Corpus."

- Sometimes, the program uses somewhat idiosyncratic commands: instead
of clicking on 'ok' to close a window and accept what one has
entered/changed, one has to click on 'conclude.'

Lastly, although CP is a very recent program, it does not have some of
the added-value gimmicks that competing programs offer (it is only
fair to repeat here that it of course also has functions these
competitors do not have).  For example, CP does not provide
corpus-based statistics such as indices of collocational strength
etc. (like, say, Michael Barlow's Collocate). Also, although the issue
of analyzing style is brought up repeatedly in the book, CP does not
allow for the automatic identification of key words in texts (unlike
WST).

CONCLUSION

All in all, I am the first to admit that CP is a program that offers
many functions that can be useful for the compilation, annotation and
processing of corpora. I also freely admit that the evaluation of
usability is by definition a relatively subjective task. I also
believe, however, that many of the flaws I have pointed above would
render the program much more difficult to use than competing
products. From what I have seen, the (only) positive side I have been
able to detect is in fact the large number of functions, i.e. the
'what the program does'. But, the negative sides of CP are the 'how
the program does it': (the larger part of) the program is

- difficult, inconvenient and counterintuitive to handle, sometimes
violating even elementary usability issues;

- overloaded with many redundant functions (containing four text
editors alone) that are part and parcel of regular operating systems
and office software;

- painfully slow to execute even some of the most basic concordancing
tasks.

Note in this connection that many functions of CP (other than the
hierarchical corpus compilation functions of course) are available as
(parts of regular) office suites and freeware programs. In addition,
the software book is largely identical to the many help files that
come with the software and sloppily edited in many ways. Although I
had been waiting quite some time for the program after having it seen
announced as commercially available soon, I am rather disappointed
with the final result and hope that the most frustrating bugs will be
considered for an update soon.

ABOUT THE REVIEWER

Stefan Th. Gries is Associate Professor at the Department of Business
Communication and Information Science at the University of Southern
Denmark. His research interest mainly lies with corpus linguistics and
linguistic methodology, esp. the syntax-lexis interface as well as
corpus-based, quantitative approaches to word-formation processes
(e.g.  blending and suffixation), syntactic variation (dative
movement, particle movement etc.) and semantic issues (near synonyms,
word senses etc.). He is currently co-editing two volumes on corpora
in cognitive linguistics and is also one editor-in-chief of a new
journal, Corpus Linguistics and Linguistic Theory, to be launched in
2005.


---------------------------------------------------------------------------

If you buy this book please tell the publisher or author
that you saw it reviewed on the LINGUIST list.

---------------------------------------------------------------------------
LINGUIST List: Vol-15-681



More information about the LINGUIST mailing list