[Corpora-List] Limiting queries to online database

Mark Davies Mark_Davies at byu.edu
Mon May 14 17:46:56 UTC 2012


Bill,

>> Last week over I found over 8 million queries for co-occurrences of apparently random pairs of word forms (e.g. 'entertainer mussel') coming from several IP addresses in Beijing.

I had a similar showdown 2-3 days ago as well. Looks like someone in China is looking for some n-grams data, but isn't willing to pay for it. Why not just purchase the Google n-grams data, or something like the data from http://www.ngrams.info/?

>> 1. What constitutes a reasonable number of queries per day to tolerate from a single robot user, after which access would be denied or limited?  

For my corpora (http://corpus.byu.edu) I have different limits, depending on whether they are, for example, a professor or grad student in languages/linguistics, a teacher, or "non-researcher" (which is what these bots would be). They range from about 100-1000 queries per day.

2. How can I implement such access restrictions?  I am using the Nginx server, MySQL / Sphinx and PHP on a Debian Linux platform.  I know how to block an IP address completely, but have no good strategy for simply limiting such traffic.

For my corpora, once an IP address (or a machine) has done about 20 queries, they have to "register" (takes 20-30 seconds). After that, all queries are logged via their email address and password. All queries are logged in a database. Once they reach the daily limit, access is turned off. It's not a very fun solution for some users (although 500-1000 queries a day is enough for most), but it does stop the worse abuse from the types of bots that you have mentioned. 

Best,

Mark D.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of William H Fletcher [fletcher at usna.edu]
Sent: Monday, May 14, 2012 11:26 AM
To: corpora at uib.no
Subject: [Corpora-List] Limiting queries to online database

Hello,

My site http://phrasesinenglish.org/ provides query interfaces to databases
derived from the BNC. The past week performance has suffered from
extraordinarily high query traffic from a handful of IP addresses.  I am
seeking advice on what is a realistic limit to queries from one user and how
how to limit traffic from a single IP efficiently.

In the past my policy was to place no limits on number of queries or size of
datasets returned on the assumption that this generous approach facilitates
research.  Occasionally I have hit a sevrer with thousands of queries, but
at a maximum pace of 1-2 per second. Most users on the site submit at most a
few dozen queries per day.  On rare occasions I have seen short bursts of
say 20,000-60,000 queries from a single IP address.

Last week over I found over 8 million queries for co-occurrences of
apparently random pairs of word forms (e.g. 'entertainer mussel') coming
from several IP addresses in Beijing.  Now, over the last day or so there
have been almost 6 million queries from one IP address in Seoul (110-120 per
second).  It's a valuable stress-test for my server, but I fear the
degradation of response times will drive away regular users.

1. What constitutes a reasonable number of queries per day to tolerate from
a single robot user, after which access would be denied or limited?

2. How can I implement such access restrictions?  I am using the Nginx
server, MySQL / Sphinx and PHP on a Debian Linux platform.  I know how to
block an IP address completely, but have no good strategy for simply
limiting such traffic.

Many thanks in advance for any feedback you can give.

Regards,
Bill Fletcher


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list