[Corpora-List] Get 1T: Web 1T processing software version 0.2.2

Toby Hawker toby at it.usyd.edu.au
Fri Oct 26 01:46:50 UTC 2007


Get 1T: Web 1T processing software version 0.2.2

Get 1T is a new software tool which allows many pre-computed queries,
including simple wildcard patterns, to be run over Web 1T, the trillion
word n-gram corpus, in a single pass. It is released and available under
the GPL for POSIX systems.

The authors of Get 1T are pleased to announce our first major public
release: 0.2.2, available at http://get1t.sourceforge.net/

FEATURES

Version 0.2.2 of Get 1T allows users to construct an input file
containing n-gram queries to be run over the Web 1T corpus, and then the
counts for those queries to be extracted in a single run. Get 1T can run
many millions of queries in a single pass over the corpus on recent
desktop machines.

Queries are placed in a single input file and passed to the program via
a command line option. Query format is as follows:

"word1 word2 word3" will exactly match the 3-gram "word1 word2 word3"
(case insensitive matching is the default and can be turned off via a
command-line option)

"word1 <*> word3" will match any 3-gram with "word1" as the first token
and "word3" the last token, "<*>" being the wildcard character.

Multiple wildcards in a single query, such as "word1 <*> <*>", are
supported.

AVAILABILITY

Get 1T is freely available, distributable and modifiable under the GNU
General Public Licence (GPL) version 2 or later.

SUPPORTED PLATFORMS

Get 1T is targeted at POSIX systems with gzip in the path. It has been
tested under Linux and Mac OS X.

Users will need to compile it from source (it has been tested with
recent versions of GCC).

FUTURE RELEASES

In forthcoming releases of the software we plan to add the following
features:
 - a second tool which can build a lossy (hash-based) compression of the
  corpus so that it will fit in RAM on modern machines, so that you can
  get *approximate* frequencies for a given n-gram using on-the-fly
  queries without massive infrastructure requirements
 - the capacity to use the corpus in uncompressed form, trading off disk
  space for speed
 - more powerful query specifications
 - co-occurrence counting, to find the frequency with which two tokens
  occur in the same n-gram

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list