[Corpora-List] Yet another Web1T tool

Deniz Yuret dyuret at ku.edu.tr
Mon Nov 26 16:55:02 UTC 2007


Hi,

I finally cleaned up one of the tools I used to access Web1T data for
the last SemEval competition.  You can download it at:

http://www.denizyuret.com/src/glookup

It is similar to some of the other tools out there, you give it a list
of ngram patterns containing arbitrary wildcards, it outputs their
counts.  Some extra features:

1. It outputs up to three different counts for patterns with wildcards
- the total count, the number of unique ngrams matching, and the
number of unique right words (necessary for Kneser-Ney discounting).
2. Optionally it will also output the ngrams that matched the pattern
and their counts.  This is useful if you are trying to figure out the
most likely words that go in a given context.

It usually takes a couple of hours to make one pass through the whole
compressed Google ngram database and output the counts.  I also have
on-line access software which use binary search and search-engine
style indexing (allowing you to use wildcards).  I can try cleaning
those up too if there is any interest.

best,
deniz

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list