Corpora: Using corpora to correct grammar

Mark Davies mdavies at ilstu.edu
Wed Oct 10 16:14:40 UTC 2001


Is anyone aware of projects that have used very large corpora as a database
to help correct compositions written by learners of that language?

For example, you might have a 40-50 million word corpus of, let's say,
Spanish.  First you'd extract all of the 1, 2, and 3 word clusters in the
corpus and import this into a database.  (Sounds hard, but it's doable --
I'm working on something like that right now).  Then you'd have a web-based
form, for example, where Spanish students could input their 500-1000 word
composition.  The script would compare every single word and every two and
three word cluster in the composition and see if these appear in the
database from the multi-million word corpus.  At the one word level, it
would simply be like a spell checker.  At the two and three (and more)
level, it would be like a modified grammar checker (except that the
database has access to a frequency listing for each bigram or trigram,
whereas a grammar checker works on more abstract rules).

If the specific two or three word string matches a record in the database
(from the multi-million word corpus), then it's marked "OK".  If not -- or
if it appears at a frequency in the corpus below a certain threshold --
then the two or three word string is marked as "suspect",  In this case,
the script would then look up other forms of the same lemma (and perhaps
synonyms for the words as well) in the two or three word strings of the
database, and suggest some of these as options to the students.

Of course, the generative grammar idea is that there are an infinite number
of sentences, so you wouldn't want to try this on 5 or 10 or 15 word
strings -- chances are they wouldn't match anything in the 40-50 million
word corpus.  At the level of two and three word strings, though, I think
you'd find a much narrower range of entries in the database, and therefore
your ability to predict that these are "suspect" would be much
better.  (Actually, the number of unique trigrams in a 50 million word
corpus ends up being about 20-30 million distinct strings, so it's not THAT
limited.  And obviously, it requires an "industrial strength" database
(Oracle, SQL Server, etc) to efficiently handle a database this size.)

The reason for wondering about such a project is that I'm teaching a
mid-level Spanish composition course this next year, and I'd like to have
some way to automate correction of some of the most common, low-level
errors ("tener un buen tiempo", "yo dije a ella", etc).  Higher level stuff
(disjoint agreement, semantics at the sentential level, etc) would be way
beyond the capability of such a program.  But I do think that it does have
some potential for low-level, narrow-clause type of phenomena.

Anyway, have there been projects similar to this in the past?  If so, any
references would be appreciated.  I'll summarize for the list if there's
interest.  Thanks in advance.

Mark Davies

====================================================
Mark Davies, Associate Professor, Spanish Linguistics
4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
309-438-7975 (voice) / 309-438-8083 (fax)
      http://mdavies.for.ilstu.edu
** Historical and dialectal Spanish and Portuguese syntax **
** Corpus design and use / Web-database scripting /  Distance education **
=====================================================



More information about the Corpora mailing list