[Corpora-List] Web service for Web1T?

lec3jrw at leeds.ac.uk lec3jrw at leeds.ac.uk
Thu Oct 25 20:45:59 UTC 2007


Hello,

Back-of-an-envelope calculations suggest to me that, with judicious use of
hashes, you ought to be able to squash the entire database down to less than
30GB; but you'd only realistically be able to query it on the frequency of a
given n-gram... which was the original task suggested I think. If you wanted to
ask "which n-grams have frequency x", or "how many n-grams contain the word y
at position z" the indices get bigger than that very quickly. But again,
judicious use of hashes might go a long way. There might be some mileage to be
had by using a tree structure whereby each 4-gram, for example, is stored as a
reference to two 2-grams; that might achieve some compromise on index size
versus query time.

As an aside - it also seems to me that throwing away those n-grams with fewer
than 40 occurrences might not be as useful, for many tasks, as throwing away
those whose frequencies are within a given threshold of what would be expected
by chance, i.e. given the known frequencies of their component (<n)-grams.
Smaller might in fact be better in a sense.

Justin Washtell


Quoting idcl <idcl at idcl.co.uk> on Thu 25 Oct 2007 15:22:28 BST:

> Paul is quite right.
>
> I've loaded up the singletons and all the tuples with count > 40 into a SQL
> server DB.  I've trimmed out tuples with punctuation in and I have ignored
> case.  With all those constraints, I'm looking at a 14GB database.  Had I
> allowed case sensitivity and not constrained the count, I would guess the DB
> would be up around half a terabyte.  Which is quite large.
>
> Iain
>
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> Deane, Paul
> Sent: 24 October 2007 14:35
> To: Adam Kilgarriff; corpora at uib.no
> Subject: Re: [Corpora-List] Web service for Web1T?
>
> I'm not aware of any thus far, though it will be great when someone does
> it.  As it happens, we're working on making use of the Google database
> at ETS for internal research purposes.  It takes a LOT of machine to
> index and serve up that data with any speed, and the scaleups to making
> it available as a database with fast turnarounds on queries are pretty
> sizeable.
>
> Paul Deane
>
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
> Of Adam Kilgarriff
> Sent: Wednesday, October 24, 2007 9:20 AM
> To: corpora at uib.no
> Subject: [Corpora-List] Web service for Web1T?
>
> Has anyone set up a web service for Google's Web1T database (eg, at the
> simplest, user inputs an n-gram of English and get back its frequency or
> NULL)
>
> Adam
>
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
> --------------------------------------------------
> This e-mail and any files transmitted with it may contain privileged or
> confidential information.
> It is solely for use by the individual for whom it is intended, even if
> addressed incorrectly.
> If you received this e-mail in error, please notify the sender; do not
> disclose, copy, distribute,
> or take any action in reliance on the contents of this information; and
> delete it from
> your system. Any other use of this e-mail is prohibited.
>
> Thank you for your compliance.
> --------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list