The Baidu search engine has a voracious appetite for content and crawls one of my sites aggressively. It’s bad enough having to deal with load generated by bots from large technology companies with vast resources, but it’s another thing entirely when those bots crawl from dozens of IP addresses simultaneously and routinely browse thousands of URLs disallowed in my robots.txt
. After months of periodic alerts from my server about high resource usage — not to mention complaints from my users about the site being “slow” — I’ve finally had enough.
I used nginx’s map
and limit_req
modules to selectively apply a rate limit to Baidu’s requests and now the server’s resource usage has improved. Blocking these requests entirely would have been more simple, but I didn’t feel comfortable completely stifling the flow of information, and this powerful, flexible technique could be useful in the future anyways.
No Respect for robots.txt
According to the Baiduspider help center Baidu respects the robots exclusion protocol — I even found a testing tool that confirms my robots.txt
file is valid. Curiously, my site’s daily access log shows a client that identifies itself as “Baiduspider” making 2,500 requests to URLs that are disallowed by my robots.txt
:
# cat /var/log/nginx/access.log | grep -c Baiduspider
8912
# cat /var/log/nginx/access.log | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
2521
Not that I need a reason, but I designate these URLs as off limits to bots on purpose: they’re dynamic search pages with countless permutations of input parameters that generate equally countless pages of results. Crawling these pages adds nothing of value to Baidu’s search index and wastes resources on my server in the process. If it’s content they want, my robots.txt
actually provides a nice, neat sitemap that conveniently lists all content pages in machine-readable XML. Sadly, upon looking at my access logs I see Google, Bing, and Yandex bots each requesting this sitemap several times per day, but Baidu doesn’t even request it once.
Baidu doesn’t respect my server resources or indexing preferences. This is not responsible harvesting!
Note: it might be possible to submit your sitemap to Baidu’s webmaster tools — if you can navigate their site in Chinese!
Mapping and Request Limiting
The nginx map
module works similarly to an if–then–else construct, allowing you to set the value of one variable depending on the value of another variable. I will combine this with a clever use of the limit_req
module to create a rate limit that only punishes Baidu.
Add the following code to the global http
block of the nginx configuration (not in a server block):
map $http_user_agent $limit_bots {
~Baiduspider 'baidu';
# requests with an empty key are not evaluated by limit_req
# see: http://nginx.org/en/docs/http/ngx_http_limit_req_module.html
default '';
}
limit_req_zone $limit_bots zone=badbots:1m rate=1r/m;
This mapping uses the nginx built-in variable $http_user_agent
as the source and a new variable $limit_bots
as the target. Depending on whether or not the request’s user agent matches the regular expression, $limit_bots
will either be set to the value “baidu” or, by default, an empty string.
Next, a limit_req_zone
called “badbots” is created with a limit of one request per minute. The clever trick here is the use of $limit_bots
as the zone’s key, because requests with an empty key are not evaluated, and thus not subject to rate limiting.
The last step is to assign this zone to a location
block in an nginx server
block:
location / {
# rate limit for poorly behaved bots, see limit_req_zone below
limit_req zone=badbots;
# Send requests to Tomcat
proxy_pass http://tomcat_http;
}
After checking the configuration syntax with nginx -t
and reloading the daemon with nginx -s reload
you will be ready to test the new mapping.
Test the Configuration
My favorite tool for testing HTTP requests is the Python-based command line utility httpie. I find it much easier to set and view request and response headers using httpie than a web browser’s developer tools.
To test the new mapping and rate limiting configuration I send two requests to the server while mimicking part of the Baidu bot’s user agent:
$ http --print h https://mysite.org/ User-Agent:'Baiduspider'
HTTP/1.1 200 OK
$ http --print h https://mysite.org/ User-Agent:'Baiduspider'
HTTP/1.1 503 Service Temporarily Unavailable
Great success! The first request succeeds with an HTTP 200, and the second fails with an HTTP 503.
You can adjust the configuration further depending on your specific use case, for example adding more patterns to match in the mapping configuration, changing the error response code, or allowing bursts of requests in the limit_req_zone
. See the references below for more information.