Corpora: Summary: Using corpora to correct grammar

Mon Oct 15 13:03:00 UTC 2001

Thanks to the following people who responded to my query about using large
corpora as a database to help correct student essays.

Pete Whitelock
Mike O'Connell
Gosee Bouma
John Milton
Oliver Mason
Tom Vanallemeersch
Tony Berber Sardinha

Not only do the replies address the question of the methodological issues
involved, but there is also good insight into some of the practical issue
of creating the lists of n-grams that would form the database.

Since nearly all of the replies were sent directly to me, rather than to
the list, I'll re-post them here.

From: Pete Whitelock <pete at sharp.co.uk>
From: Gosee Bouma <gosse at let.rug.nl>

What you suggest has been tried, for English, at the Educational Testing
Service - see:

Martin Chodorow and Claudia Leacock. 2000. An unsupervised method for
detecting grammatical errors. In Proceedings of the 1st Annual Meeting of
the North American Chapter of the Association for Computational Linguistics,
140-147.

Their web page is at
http://www.etstechnologies.com/scoringtech-scientists.htm

----------------------------------------------------------------

From: Mike O'Connell <Michael.Oconnell at Colorado.EDU>

   A similar project using Latent Semantic Analysis at CU-Boulder uses
latent semantic indexing to permit comparisons of essays to known
standards. The basic technology is derivative of work on information
retrieval, but they're also trying out language models using ngrams of
various lengths for different purposes.  They have a research project
called I believe summary street that does automatic grading of essays by
comparing them to a model based on graded essays, although I don't think
they've formally applied the ngram language model type approach that
you're describing in any published work.
   So, while there are similarities, the projects aren't exactly the same,
but anyway just FYI.

----------------------------------------------------------------

From: John Milton <lcjohn at ust.hk>

I tried somewhat the opposite approach, but ran into some of the same
problems that I would using your approach, at least with the type of
errors produced by my students. I originally thought it would be a good
idea to extract a set of gramatically impossible collocations from a large
(25 million word) corpus of Cantonese-speakers' English texts and flag
these whenever a student produced them. The trouble is that the corpus
that I have (essays written by students graduating from secondary school
in Hong Kong with barely passing grades in English) contained very few of
'illegal' short collocations. Their biggest problem is in the use of
particles (e.g. "This will benefit to me."). It's easy to figure out the
intralingual confusion that results in this type of error, but it's
difficult in practise to reliably flag (e.g., they also drop the
aux verb and produce "This benefit to me."). I wrote a toy program
anyway to see what type of reliability I might get, and in fact got too
many false positives and false negatives for it to be useful. Attacking
these types of problems would require an authorable grammar checker whose
analysis, especially parsing, is really reliable. I know of no such
program currently available (e.g. I tried L&H's Chinese grammar checker a
few years ago).

Nevertheless, I'd be interested in hearing how your approach works with
the types of strings your students produce.

----------------------------------------------------------------

From: Oliver Mason <oliver at clg.bham.ac.uk>

Just an additional idea: you could do the same not only with words, but
also POS tags, or even chunked phrase tags, which might give you a
wider coverage of grammatical errors.  Depending on the granularity of
the tagset certain errors such as word order or noun-verb agreement
might be detectable.

You could then flag something as "verb expected", or "plural noun
expected" when there is a mismatch.  Not sure how exactly that would
work, as it is more likely to be probabilities instead of absolute
judgements.

Anyway, your project sounds like a very interesting idea.

----------------------------------------------------------------

From: Tom Vanallemeersch <tom.vanallemeersch at lantworks.com>

you may solve the problem of looking up word clusters in a way which doesn't
need an "industrial strength" database - though a lot of memory will still be
needed. This can be done using a so-called Patricia array, which is created by
storing a number of text positions in an array and sorting them on the string
starting at the text position.

In your case, you'd have to store the text positions at which words start, and
sort these positions on the word starting at the position (e.g. if the word
"house" is at position 15400 and the word "single" at position 10000, then
position 15400 would precede position 10000 in the array). Then, using binary
search, it is possible to find in the Patricia array whether a word or word
group (any length) is present in the corpus, and what the frequency of the
word/word group is.

There are some memory issues:
- given a 50 million word corpus, the Patricia array will need 50 million x 4
bytes = more or less 200 MB; so you would need at least 256MB RAM if you use a
PC
- when creating or searching the array, access to the corpus is needed; this
may imply that the program which creates/searches the array may need to do a
lot of lookups on the hard disk (unless you can store the whole corpus in RAM,
which seems difficult); this is less a problem for lookup than for creating the
array

----------------------------------------------------------------

From: Tony Berber Sardinha <tony4 at uol.com.br>

I've been thinking about such a tool for a long time - a collocation / pattern
checker would be a great tool for language learners.

I tried this back in 1995/1996 for a paper I presented at the Aston Corpus
Seminar. At the time extracting n-grams was very hard on a Windows/DOS-based PC
but today this is much simpler using a program such as WordSmith Tools
(with the
clusters options activated) or scripts such as that incorporated in the Brill's
tagger distribution (bigram-generate.prl, which can be adapted to find 3, 4, 5
etc grams and then run using DOS versions of awk and Windows ActivePerl).

More recently, I compared the frequency of 3-grams in two learner corpora, one
of texts written by Brazilian EFL students and the other of essays written by
students from an Anglo-Brazilian bilingual school. The comparison was carried
out using lists of the most frequent 3-grams in English, representing 10
frequency bands. I then used Unix-like tools unning in DOS (uniq, grep, etc) to
find how many 3-grams of each frequency band were present in each corpus. This
is similar to the kind of frequency analysis that Paul Nation's 'range' program
produces.

One practical problem I see in doing this kind of work is extracting n-grams
from a large corpus such as the BNC on a PC - WordSmith Tools will stop
processing the corpus on my machine (P III 550MhZ, 128 Mb RAM) after a few
million words. One possible solution is to split the corpus into samples,
extract the clusters from those samples and then join the lists with the 'Merge
Lists' function. The disadvantage here is that clusters with a frequency that's
below the cut-off point (eg 1) in two or more separate corpus samples will not
be included in the final merged list, resulting in inexact frequencies for the
whole corpus. The other practical problem is that WordSmith Tools will not
allow
you to pull out n-grams of frequency 1 in the learner data, since the minimum
frequency is 2, but this can be overcome by 'cheating' a little: just
choose the
corpus texts twice, and so n-grams which originally had a frequency of 1 will
then have a frequency of 2, and will thus be included in the wordlist.

A problem of a more 'conceptual' kind is of course that clusters formed by
adjacent words only will not represent the full range of pre-fabs in English or
Spanish, and so perfectly acceptable patterns in learner compositions may be
marked as 'suspect' because they did not match any n-grams in the native
language reference corpus.

====================================================
Mark Davies, Associate Professor, Spanish Linguistics
4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
309-438-7975 (voice) / 309-438-8083 (fax)
http://mdavies.for.ilstu.edu/

** Corpus design and use / Web-database programming and optimization **
** Historical and dialectal Spanish and Portuguese syntax / Distance
education **
=====================================================