15.1087, Disc: Re: Reply to review of Corpus Presenter

LINGUIST List linguist at linguistlist.org
Fri Apr 2 04:10:42 UTC 2004


LINGUIST List:  Vol-15-1087. Thu Apr 1 2004. ISSN: 1068-4875.

Subject: 15.1087, Disc: Re: Reply to review of Corpus Presenter

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Sheila Collberg, U. of Arizona
	Terence Langendoen, U. of Arizona

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Sarah Murray <sarah at linguistlist.org>
 ==========================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
=================================Directory=================================

1)
Date:  Wed, 24 Mar 2004 23:42:18 +0100
From:  "Stefan Th. Gries" <STGries at sitkom.sdu.dk>
Subject:  Re: 15.9891, Hickey's response to my review of his program/book

-------------------------------- Message 1 -------------------------------

Date:  Wed, 24 Mar 2004 23:42:18 +0100
From:  "Stefan Th. Gries" <STGries at sitkom.sdu.dk>
Subject:  Re: 15.9891, Hickey's response to my review of his program/book


For a more reader-friendly version of my review, please go to
http://people.freenet.de/Stefan_Th_Gries/Research/CP_review.pdf .

Apart from all communication on this matter (including this text),
this page provides some screenshots of the CP and, in the interest of
locating the discussion of different modules of CP, highlights all
occurrences of CP's program names in blue). In what follows I would
like to briefly comment on some of the points raised in Hickey's
response to my review. The first part is concerned with aspects of
Hickey's book/software (henceforth CP), the second with the more
general tone of Hickey's response.

I agree with Hickey that my review did not cover all aspects of the
book/software that one could have mentioned. The book alone comprises
nearly 300 pages and the software offers an extremely vast range of
functions (as I also state in my review), so every review - especially
when due in maximally six weeks time - is by necessity selective even
though the review is already extraordinarily long: Given the vast
range of functions the program has to offer, I think it is only
natural that not all features can be discussed (to the satisfaction of
the author).

Then, it goes without saying that I accept responsibility for, and
regret, all errors or misrepresentations that are not due to my
selective emphasis although below I also have something to say about
how they arose in the first place. However, it is necessary to set
straight some of the points that Hickey criticizes about my review and
I will also demonstrate below that some points of critique mentioned
by Hickey are due to his not having read the review thoroughly enough.

As a first example, Hickey complains about my having neglected the
presentation of corpora, the retrieval techniques and the Corpus of
Irish English, but on the one hand, the review does mention that CP
can be used to compile and annotate corpora hierarchically and it
mentions the installation of the Corpus of Irish English. On the other
hand, while I certainly agree that I could have discussed these issues
more prominently, I found it more important to discuss the way CP
deals with corpora in general rather than about one corpus in
particular and consider this a reviewer's legitimate option.

Second, Hickey objects to my comparing CP with other software. I do
not see why this is problematic. Sure, CP is a tool that is very
different from competing products such as MonoConc Pro 2.2, WordSmith
Tools 3/4 (WST) and WinConcord 2 - no doubt about that - but (i) the
by far largest part of the review is not concerned with comparing CP
to competing products anyway (WST is mentioned 11 times (on [9 pages
with 6,544 words]) and (ii) I do not see any reason why I as a
reviewer should not be allowed to compare selected aspects of a
program with competing applications. Moreover, I do not know how well
Hickey knows the competing products he refers to: (a) He simply
affirms that CP offers the most flexible retrieval tools without
offering any evidence (and MonoConc Pro's tretrieval options are in
fact extremely versatile). (b) Where I give evidence about the speed
of the program, Hickey simply states he doubts my evidence (which, as
I mention in my review, are the statistics CP itself outputs!) but
does not provide a single argument or perhaps a comparable figure to
support his point. Lastly, while I do agree that some may consider
speed less relevant nowadays, my own experience is different:
searching the BNC or merely larger parts of it for regular expressions
of varying degrees of complexity or determining significant collocates
for tens of thousands of adjectives can be so time-consuming that
speed sometimes matters quite a lot. Be that as it may, reporting
performance statistics cannot really be wrong by definition ...

Third, let me now turn to the so-called factual misrepresentations. As
for one, Hickey states that, contrary to what I wrote, "[i]t certainly
is possible to sort concordance returns on words to the left and right
of the keyword (this can be done for up to 8 words each side of the
keyword)." If the function 'restructure return lines' is activated for
a particular concordance, CP outputs a bipartite window, the lower
part of which provides absolute and relative frequencies of collocates
of some defined span (1, 2, 3, 4, 6, 8 words on one or two flanks -
why not also 5 and 7?). The upper part of this window now provides a
part of each concordance line in tabular form such that the line is
split into as many slots as were previously defined. It is true that
this part of the output can then be sorted (if you want to sort
according to the first word of the right of the node/search word, you
must click on its column name, which is "Item 1" just like the name of
the first word to the left) and copied to the clipboard (and only then
to disc), but it is not the complete concordance line that can be
dealt with this way, it is only the previously defined part of the
concordance line that is sorted accordingly, which is why context
further away from the node/search word cannot be accessed this
way. Thus, unless I have missed some other function, I do not see my
statement disproven.


In this connection, Hickey states "Gries does not like my terminology
- 'restructure return lines' - but as a native speaker of English I
beg to maintain that this is an acceptable description of this
function." While Hickey simply skips over my point of critique that
this function is too difficult to locate since no help index is
available and no entries for "sort" or "restructure return lines" (!
;-) ) exist in the index of the book, he is completely right: I do not
like his terminology. I pointed out some other idiosyncratic names of
functions in his program, and I just leave it to the readers to decide
whether it is really just due to chance that most, if not all, other
programs use the command "sort" for sorting as do some programming
languages (e.g. Python and R language, which also has "order").


In order to address the only other case of "factual
misprepresentation" Hickey cares to mention (in spite of the multitude
of errors he implies there are), I have to quote him again at
length. He states "Gries thinks that the analysis of style is not
treated in Corpus Presenter, but the special text editor, CP Text
Tool, has a function for Lexical Clustering analysis which does
precisely that. It will allow users to determine the occurrence of
stylistic features in a flexible manner and so help them answer such
questions as text authorship. Lexical Clustering is mentioned on
several occasions, including the various guides available within the
Launcher so Gries should have seen this is if he had looked at the
material properly."


Unfortunately, Hickey himself has not cared to read my review with the
necessary attention. Here's what I said in my review: "Lastly,
although CP is a very recent program, it does not have some of the
added-value gimmicks that competing programs offer (it is only fair to
repeat here that it of course also has functions these competitors do
not have). For example, CP does not provide corpus-based statistics
such as indices of collocational strength etc. (like, say, Michael
Barlow's Collocate). Also, although the issue of analyzing style is
brought up repeatedly in the book, CP does not allow for the automatic
identification of key words in texts (unlike WST)." As it turns out,
CP can really not output collocations statistics, but more
importantly, while it can output lexical clusters in the way that WST
can, it cannot compute key words as defined in WST, which I explicitly
referred to. Key Words in WordSmith takes two corpora (one 'research
corpus', one 'reference corpus'), checks the frequencies of all words
in both texts and then outputs key words sorted by their p-values
(where key words are words which are significantly overrepresented
within the research corpus as compared to the reference corpus
[measured in terms of Chi-square tests or the Log-likelihood test]).
Hickey's Lexical Cluster Analysis does not do this, and I have not
been able to locate any other such function in his program, which is
why this claim of his remains as much in need of support as many
others.


Let me turn to the final factual point. I pointed out before that my
main problems with CP do not derive from its functionality, i.e. what
the program can do. Let me state it as directly as possible: the
functionality of CP is great, it can do more than any other corpus
program I have ever seen. My main quibble is with usability, i.e. how
the program lets you do it. I do not wish to bore the readers with all
the details of the original review but refer them to it instead, but
let me just say two sentences about Hickey's complaint that I devote
too much space to criticizing the many utilities CP offers. First, I
think it is only fair to point out to the potential buyer that many of
the twenty-something modules the program contains just do what the
operating system or other freely available software can already
do. For some people, this may be an interesting argument against the
bundle, and thus this is something that should be pointed out in a
review. Second, the usability of a program is not enhanced by crowding
it with many modules or functions one is later invited to delete or
neglect - rather, a program should offer its functionality in such a
way that the user can make intuitive choices from a reasonably small
set of alternatives: a help menu with twenty different entries that
even includes a command to benchmark the system is simply not the most
usable way to design a program, and the fact that no other program
goes to similar extremes testifies to this point.


As to the book, Hickey objects to my lack of appreciation of the
structure of the book and that there were good reasons for it. I
cannot substantially comment on this point since Hickey does not
mention the good reasons he alludes to but just states that, if I
don't like the structure, he can't help it. This is doubtlessly
correct, but neither does it constitute a rational argument nor do I
see why I as a reviewer should not be entitled to criticize the
structure of a book (not to forget the many editing errors / typos)
especially when I also outline a constructive proposal as to how a
from my point of view didactically more feasible structure may look
like, as I do at the end of my review. I am, however, very happy to
learn that my review has - in spite of all its limitations - already
resulted in some bugs being fixed.


Let me finally say something about the general tone of his reply.
First, I (and at least three other colleagues who have read his reply)
cannot fail to notice the ad hominem undertone underlying (parts of)
his reply. I do not see in what way it is relevant to a discussion of
the merits (or lack of them) of my review that I am "one Stefan
Gries", "a German academic at the University of Southern Denmark", or
that Hickey has never heard of me before. Similarly, Hickey states
that "Corpus Presenter works properly and fulfils the functions which
it claims to perform (Gries acknowledges this, if only begrudgingly)".
What is the purpose of salting his reply by attributing such emotional
states to me? Neither did CP perform all the functions Hickey claimed
it to perform - remember Hickey's own statements about the bugs he
fixed as a reply to my review? why is there already an upgrade if CP
already performed all functions as intended? - nor is there any
statement in my review that can straightforwardly be interpreted as
begrudgingly if one has not already built up some prejudices. Quite
the contrary: I mentioned clearly that I had been looking forward to
the program quite some time after having it seen announced as
commercially available soon. And as usual, note that Hickey simply
attributes this emotional state to me, but - as before - does not cite
any sentence whatsoever to support this attribution. I would have
welcomed a more sober exchange than the one that has now actually
taken place, but it is instructive in this connection to not just
consider the reply Hickey posted to the LinguistList, but also the
reply he had sent to me personally earlier, in which a compound
involving a vulgar German verb for to defecate plays a prominent role
in characterizing my review (go to
http://people.freenet.de/Stefan_Th_Gries/Research/CP_review.pdf to
access the original review with example screenshots and all following
communication). This will therefore be my final statement in this
matter.


Stefan Th. Gries
University of Southern Denmark

---------------------------------------------------------------------------
LINGUIST List: Vol-15-1087



More information about the LINGUIST mailing list