Fighting Bots Via Their Bad Requests

7 June 2009, 9:45 am

Last week I started looking through my “page not found” (404) errors. It’s been interesting to say the least, with nearly 800 bad requests since then, most of which have been various attacks in trying to mislead people. I address most of these attacks or probes by re-routing the request to a robot block system based on Daniel’s Webb’s Bot-trap – A Bad Web-Robot Blocker. In the code blocks listed below, I have listed a sample of the commands I have entered into my site’s .htaccess file. Warning: Messing with your .htaccess can break your web site. Be Careful!

Scanning for Remote File Include Vulnerabilities. There are many bugs in many diferent programs out there on the web. So when someone scans my server for oen of these things, they aren’t up to any good. Common requests include errors.php, contact.php, advanced1.php. There is (was) a compromised server at
- http://www.eyepro.net//assets/images/id1.txt
- http://www.eyepro.net//assets/images/master-id.txt
- http://www.graal-plus.zp.ua//images/roxx.jpg
- http://grupowh.com/sqli/fx29id.txt
- http://www.ecobook.or.kr/ecobook/data/ecobook/1132289642/copyright.txt
- http://www.centrsoft.ru/logo.jpg
- http://www.cookieez.com/image.jpg
- http://largeface.com/gnuboard4/style/sid.txt
- http://harvestusa.org///administrator//includes/id1.txt
Someone at these IP addresses was scanning:
- 80.67.20.178
- 62.193.227.12
- 212.193.241.25
- 70.245.218.25
- 203.185.28.194
- 200.30.136.59
- 74.55.117.34
- 74.36.117.160
- 94.23.200.54
- 211.189.18.73
These probes are now being dealt with in real time.
See Fx29ID cmd for a little more information.

RewriteRule errors.php /robot-trap/ [L]
Requests for my favicon.ico in a directory other than the root. The only thing in common with the requests is they are all using GoogleToolbar 6.1; Windows 6.0; MSIE 7.0. This is a silly request, is there ever a reason for a favicon to be anywhere other than the root of the site? I’ve now dealt with these as well.
RewriteRule .+/favicon.ico /favicon.ico [L]
There were a very few requests for a malformed URL that was actaully part of my server’s file system. That’s been fixed.
RewriteRule var/www/html / [L]
I discovered a Java-based browser (probably a spam bot seeking email addresses) that kept tripping one URL that wasn’t being rewritten correctly. So I fixed my rewrite rule.
- 213.93.203.217, Java/1.6.0_04
- 24.132.227.22, Java/1.6.0_13
- 77.211.115.58, Java/1.6.0_04
- 82.192.63.216, Java/1.6.0_13
- 84.124.194.76, Java/1.6.0_13
Now to decide how to deal with these “web browsers.” Based on the traffic activity (one request every 1-2 seconds, no images or style sheets requested at all) should I even allow these browsers to access my web site? I’ve also decided to block access to my web site by Java user agents. See How To Block Java User-Agents for someone else’s similar approach to the Java problem.

RewriteCond %{HTTP_USER_AGENT} Java.* RewriteRule ^(.*)$ /robot-trap/ [F]
For some bizarre reason, the MSNBot (shouldn’t that be the BingBot now?) is sending requests including an anchor. Their requests are showing up as “GET /url-stuff-in-here/#respond HTTP/1.1”. I can’t see how the hash symbol (pound symbol) is being logged at all, I can’t reproduce the problem. Helpfully, MSN does include with their browser agent, a URL that can be used for help. “msnbot/2.0b (+http://search.msn.com/msnbot.htm)”
Just try visiting that msn.com URL. You get bounced over to a site at live.com, which is now Microsoft’s bing search engine. Using Firefox 3, OmniWeb, or Opera, after “signing in” I kept getting into a loop where it would ask me to join the community. I couldn’t get past that point. I finally fired up Virtual Box and used IE8 under Windows 7. Sigh. And Microsoft wonders why people loathe them so. (Remember Bing stands for But It’s Not Good.)
Speaking of Microsoft, some poor souls are still using IE 6, with the Discussion bar turned on. Discussion bar requests (/_vti_bin/ or /MSOffice/) are now being redirected to the I6 page at BrowserUpgrade.info.
RewriteRule _vti_bin http://www.browserupgrade.info/ie6/ [R=301,L]
A few web browsers were badly broken, and were requesting things like http://www.planetmike.comhttp://www.planetmike.com/2006/08/. I fixed this.
RewriteRule www.planetmike.com(.*) $1 [R=301,L]
A few bots were attempting to poison my referral logs, or to otherwise do Bad Things by requesting multiple times files on my site with a space in the file name. And to make things look legitimate, they set the referral to be a search at Google.
RewriteCond %{HTTP_REFERER} google\.com RewriteCond %{REQUEST_URI} .*\ .* RewriteRule ^.*$ /robot-trap/space.php [L]
A few requests for labels.rdf. This is a standard for labelling how family-friendly your web site is. See the Family Online Safety Institute Labelling Page for information. This is the next generation site labelling method after the PICs method which was used in the ’90’s. For now I’ll let these requests return a 404.
Bad BrowserAgents. No browser agent at all? Denied! An agent of “anonymous” is not allowed either. Especially when your IP address is registered to Korea or Brazil.
BrowserMatchNoCase "^$" spambot=1 BrowserMatchNoCase "^anonymous$" spambot=1 Order deny,allow deny from env=spambot
And there have been a few links on my own site that were misspelled, or had other stupid mistakes. Those have been fixed.

This is a start, I’ll post updates as more unique cases appear.

Filed under Site-details, Technology, Web-design

Comments are closed | Permalink

2 Comments

John Duckworth says:

December 8, 2011 at 7:36 am

Very well written article most informative.
Using 301 vs. RewriteRule « PlanetMike's Technology Journal says:

November 15, 2012 at 8:57 am

[…] few weeks ago, I wrote out a series of steps aimed at Fighting Bots Via Their Bad Requests. After watching my logs since then, I’ve noticed I made an incredibly stupid mistake. Bad […]

Journal of PlanetMike

Fighting Bots Via Their Bad Requests

2 Comments

Categories

Recent Posts

Categories