[Corpora-List] Get 1T: Web 1T processing software version 0.2.2
Toby Hawker
toby at it.usyd.edu.au
Fri Oct 26 01:46:50 UTC 2007
Get 1T: Web 1T processing software version 0.2.2
Get 1T is a new software tool which allows many pre-computed queries,
including simple wildcard patterns, to be run over Web 1T, the trillion
word n-gram corpus, in a single pass. It is released and available under
the GPL for POSIX systems.
The authors of Get 1T are pleased to announce our first major public
release: 0.2.2, available at http://get1t.sourceforge.net/
FEATURES
Version 0.2.2 of Get 1T allows users to construct an input file
containing n-gram queries to be run over the Web 1T corpus, and then the
counts for those queries to be extracted in a single run. Get 1T can run
many millions of queries in a single pass over the corpus on recent
desktop machines.
Queries are placed in a single input file and passed to the program via
a command line option. Query format is as follows:
"word1 word2 word3" will exactly match the 3-gram "word1 word2 word3"
(case insensitive matching is the default and can be turned off via a
command-line option)
"word1 <*> word3" will match any 3-gram with "word1" as the first token
and "word3" the last token, "<*>" being the wildcard character.
Multiple wildcards in a single query, such as "word1 <*> <*>", are
supported.
AVAILABILITY
Get 1T is freely available, distributable and modifiable under the GNU
General Public Licence (GPL) version 2 or later.
SUPPORTED PLATFORMS
Get 1T is targeted at POSIX systems with gzip in the path. It has been
tested under Linux and Mac OS X.
Users will need to compile it from source (it has been tested with
recent versions of GCC).
FUTURE RELEASES
In forthcoming releases of the software we plan to add the following
features:
- a second tool which can build a lossy (hash-based) compression of the
corpus so that it will fit in RAM on modern machines, so that you can
get *approximate* frequencies for a given n-gram using on-the-fly
queries without massive infrastructure requirements
- the capacity to use the corpus in uncompressed form, trading off disk
space for speed
- more powerful query specifications
- co-occurrence counting, to find the frequency with which two tokens
occur in the same n-gram
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list