[Corpora-List] token clustering tool - summary
Murk Wuite
Murk at polderland.nl
Wed Jun 16 14:58:39 UTC 2004
Dear all,
A few weeks ago I posted the following message to the corpora list:
"Does anyone know of a tool (or algorithm), preferably available freely for research purposes, that takes as its input a corpus only and produces as its output clusters of tokens that occur close to each other relatively often?"
Quite a few people responded (thanks!). This message serves as a summary of my findings.
I had deliberately posed a somewhat vague question, as I believed this would maximize the number of responses. Predictably, 'precision' suffered, i.e. some responders referred me to tools that performed another task than the one I had in mind. I will leave these tools out of the following discussion. So, what exactly is it that I díd want the tool to do?
I've written a tool that analyzes novel Dutch compounds (it outputs their constituent structure). I'm using it as part of a spell-checker: if a token in a text can not be matched to an item in the spell checker's lexicon, it might be a novel compound (as opposed to English compounds, in Dutch the constituents of compounds are not separated by spaces). However, sometimes a spelling error will result in a novel compound, and I have been trying to build a tool that recognizes when a novel compound is really a misspelling. The likelihood that a token that can be analysed as a novel compound was intended as such (as opposed to the token being a misspelling) would be boosted when it could be ascertained that the text surrounding the novel compound contains tokens that tend to co-occur with one or more constituents of the novel compound. Here, the 'token clustering tool' could be of use: it would have been run on a (tokenized, tagged and lemmatized) corpus and (ideally) built clusters of tokens like [army_N,war_N,fight_V], [football_N,score_V,referee_N] etc. I.e. the token clustering tool would have examined a corpus and made clusters of tokens that were used 'near each other relatively often'.
A crucial factor in building clusters would be the size of the window in which the tool looks for co-occurring tokens. The newspaper corpus I had available contained markers indicating article boundaries. Ideally, I would be able to set window size to 'one article'.
Preferably, it would also be possible for clusters to share particular tokens, so "bank" could be a member of both a cluster like [money_N,financial_ADJ,bank_N] and [river_N,water_N,bank_N].
Eventually, I found a suitable (and free) tool, or rather a way of combining elements of multiple tools to suit my needs. It was suggested to me by Amruta D. Purandare and Ted Pedersen of the University of Minnesota, who gave the following instructions:
"1. The N-gram Statistics Package (http://www.d.umn.edu/~tpederse/nsp.html) creates the list of word pairs that co-occur in some window from each other and their association scores. Run programs count.pl, combig.pl and statistics.pl in order. The output of statistics will be the list of word pairs that co-occur in some window and their association scores as computed by tests like log-likelihood, mutual information, chi-squared test etc.
2. Give the output of step 1 to wordvec.pl in SenseClusters Package (http://senseclusters.¬sourceforge.net/). This program will create a word-by-word association matrix that shows the co-occurrence vector of each word.
3. Cluster these word vectors with (give the output of step 2 to) vcluster program in CLUTO http://www-users.cs.umn.edu/~karypis/cluto/ to get clusters of words."
This does indeed produce clusters of words, but words will never be shared between different clusters. I solved this problem by abandoning steps 2 and 3. This way, no actual clustering is performed, but for every word pair in the corpus co-occurring in some window, an association score is calculated. The test that, to me, seemed to render the most sensible association scores was t-score, but I have not evaluated this formally. Using these association scores, I could compute an composite association score for every compound, expressing the strength of the association between the compound's constituents and the tokens surrounding the compound. Classifying compounds on their 'intendedness' went better once these composite association scores were available, but not as much as I had hoped.
Some other tools I was suggested also were capable of the kind of clustering I needed, but only using the method described in the three steps above was I able to cluster using a corpus of considerable size. My corpus consisted of articles from Dutch newspapers, and when I performed step 1 above with the 'window size' parameter set to equal article size, the largest subset of the corpus I was able to run the programs on consisted of 38095 lemmatized nouns, verbs and adjectives from 200 articles. The association scores thus computed did not make a lot of sense, and so I increased the size of the corpus and decreased window size to 5. This way, I had just enough computer memory available to run the programs on 1447987 lemmatized nouns, verbs and adjectives from 7500 articles. This produced the association scores I used in my research.
Before I succeeded in clustering using the above 3 steps, I tried out the programs available through http://www.lboro.ac.uk/research/mmethods/research/software/stats.html, suggested to me by Maarten Jansonius of the University of Louvain-la-Neuve (Belgium). None of them were useable for the task at hand. Maarten also pointed out WordSmith Tools (commercial software). Version 4, while still in beta, can be used freely for about a month, but I haven't tried it out.
Other tools I was suggested that might be useful, but which I didn't try out are the following:
- Steven Bird of the University of Melbourne noted it would be easy to write the kind of program I needed using NLTK (Natural Language Toolkit, http://nltk.sourceforge.net/).
- Eric Atwell of Leeds University suggested the WEKA machine-learning toolkit, downloadable free from http://www.cs.waikato.ac.nz/ml/weka/. However, he points out that "WEKA seems to have problems with large corpus datasets".
- Normand Peladeau of Provalis Research pointed out some commercial software (www.simstat.com¬/wordstat.htm).
Once more I would like to thank everyone who responded to my original post. If you have any questions or suggestions, please do not hesitate to mail me.
Best wishes,
Murk Wuite
MA student at the Department of Language and Speech, Katholieke Universiteit Nijmegen, The Netherlands
More information about the Corpora
mailing list