PlanetMike.com

Blog

Michael Clark's journal of important and not-so-important thoughts.

You are currently browsing the PlanetMike’s Technology Journal weblog archives for September, 2006.



Support Me

Please support Michael Clark

Last 10 Articles


Categories


Archives


MonthChunks


Archive for September, 2006

SnapBot Appears to be a Broken, Bad Spider

Friday, September 29th, 2006 9:51 am

This appeared in one my web site’s server logs:

38.98.19.116 - - [10/Sep/2006:02:33:46 -0400] “GET /2005/08/28/postname/feed:http://www.example.com/comments/feed/ HTTP/1.0″ 404 12824 “-” “Snapbot/1.0″

Ugh! Tons of 404s as this badly behaved spider bot added the site’s feed URL for comments to the end of each URL. SnapBot is apparently related to Snap.com. I’ve emailed Snap.com asking for clarification of their spider’s intentions, and what their identifer is for my robots.txt file.

The other related issue is they are using multiple servers to do their spidering. One IP address requests robots.txt, another requests a page, then yet another requests the content of the page (CSS and images). I feel like that is bad, but I’m pondering it. I’ll need to look to see how Google handles images. I think Google doesn’t do anything with images, it wants to read only the text content.

38.98.19.105 - - [07/Sep/2006:07:18:22 -0400] “GET /robots.txt HTTP/1.0″ 200 - “-” “Snapbot/1.0″
38.98.19.121 - - [07/Sep/2006:07:18:22 -0400] “GET / HTTP/1.0″ 200 25168 “-” “Snapbot/1.0″
ip.add.re.ss - - [07/Sep/2006:07:18:47 -0400] “GET / HTTP/1.1″ 200 7399 “-” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:48 -0400] “GET /images/filename1.jpg HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:48 -0400] “GET /images/photos/filename2.jpg HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/style.css HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickbgcolor.jpg HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickbg.jpg HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickheader.jpg HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss - - [07/Sep/2006:07:18:49 -0400] “GET /wp-content/themes/sitename/images/kubrickfooter.jpg HTTP/1.1″ 304 - “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”

It also seems very wrong to continue to spider and index the page when the server returns a 404 error. And note the multiple IP addresses.

38.98.19.80 - - [17/Sep/2006:00:28:51 -0400] “GET /2006/02/26/bad-file-name/ HTTP/1.1″ 404 3552 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″
38.98.19.68 - - [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/style.css HTTP/1.1″ 200 9814 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″
38.98.19.69 - - [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickbg.jpg HTTP/1.1″ 200 875 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″
38.98.19.67 - - [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickbgcolor.jpg HTTP/1.1″ 200 353 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″
38.98.19.84 - - [17/Sep/2006:00:28:52 -0400] “GET /favicon.ico HTTP/1.1″ 200 10134 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″
38.98.19.85 - - [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickheader.jpg HTTP/1.1″ 200 29681 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″
38.98.19.81 - - [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickfooter.jpg HTTP/1.1″ 200 3439 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1″

For more information on SnapBot, see Snapbot and Snap.com - My Last Word and Verdict and SnapBot and the Linux Firefox Revelation

If you're new here, you may want to subscribe to my RSS feed. This allows you to read my newer articles without having to visit the site again. Thanks for visiting! Mike

Spamhaus Lawsuit spam

Thursday, September 28th, 2006 12:43 am

I received some spam on Wednesday afternoon from someone really ticked at Spamhaus. It came to my abuse email address. It came from a throwaway Yahoo email address. Comments are supposed to go to yet another yahoo email address. I reported the addresses to Yahoo. The last line of the message was “Committee to stop Spamhaus censorship and Blackmail” If the CTSSCAB is so concerned by this, they’d include some real contact info. A domain name. A phone number. Typical spammer method of operation. Idiots. The message came from VS-516_VDS-102 ([64.46.36.172]).

Bitacle’s User-agent String

Wednesday, September 27th, 2006 5:22 pm

The Bitacle web thief was using this web agent identifier up through 25/Sep/2006:04:05:00 -0400. After that point, they identify their RSS crawler as “Mozilla/5.0 (X11; U; Linux i686; en-EN; rv:1.8.0.4) Gecko/20060614 Fedora/1.5.0.4-1.2.fc5 Firefox/1.5.0.4 pango-text” This is based on them using this IP address: 81.172.117.28.

Copyright Information

Wednesday, September 27th, 2006 5:10 pm

Just so everyone is clear, the blog entries and other content on PlanetMike.com is copyrighted by Michael Boyd Clark, and should not be posted to any other web site. The RSS feeds I provide are for the use of the readers of my blog, not to make it easier to steal my content and put it on your own web site.

Please look carefully at the web address in the URL field of your browser. It should read ‘http://www.planetmike.com’. In case you see a web address containing the word ‘bitacle’ or ‘bitacle.org’, you’re not looking at the original page on which this text was posted. If this is the case, the text you are reading right now might be incorrect or out of date. After I place a post on my weblog, I always try to keep published information up to date, or incorporate additional information, which I receive from readers. You will never find this information on bitacle.org.

Bitacle.org copies the content of weblogs without permission of the author, the holder of copyrights or the licensee. By visiting bitacle.org, you create income for the people who run bitacle.org, at the expense of me and other owners of a weblog, without permission and often without respecting copyrights and/or terms of use as in a license. So please, next time you want to view my posts, do so by using the web address of my weblog, which is ‘http://www.planetmike.com’. Please make a bookmark of my weblog’s address, if you would like to visit it again.

The Value of e-SocietyRobot?

Tuesday, September 26th, 2006 8:57 am

One of my web sites has been spidered by the e-SocietyRobot spider. It’s web site is at http://www.yama.info.waseda.ac.jp/~yamana/es/, slightly more legible using Babelfish. e-SocietyRobot is not a search engine. e-SocietyRobot hit 4,549 pages, no MP3 files luckily, but still used 51MB of traffic. But it is some unknown research project attempting to spider the web. They have no plans on making their indexed pages available. So should I try to block off that robot? I of course want the search engines to spider my sites. But I don’t want to help some anonymous “research” project. Maybe they are spammers. Maybe they are going to use my site to feed into some splogs.

On a related issue, I wish that spiders would give an accurate referrer. Even if the referrer was another page in my own site, it would be useful to know where they are coming from. Does anyone know why they don’t?

Re: Verizon needs to stop jerking me around

Saturday, September 23rd, 2006 8:05 am

In Verizon needs to stop jerking me around Derek pegs why junk mail sucks. And a bonus, he pegs why Verizon sucks.

Just yesterday my office’s T1 line went down. I call the helpful UUNet number to discover Verizon owns UUNET, or MCI, or whoever the heck it is. I have more important things to do than keep up with the name of the company currently providing service to the service I have had for over 5 years now. It used to be when the line went down (which was rare) I’d call the 800 #, a technician would answer the phone, he could ping my router, and if the ping didn’t work, tada, a ticket was opened.

Now how does Verizon handle the same problem, my T1 is down? The phone is answered by a “customer service representative.” She isn’t helpful. Friendly, but not helpful. She takes down my trouble description. And enters it into some computer. Then an engineer will look at the problem to see if it really is down. Then they will start working on it. I should expect a call back in about an hour. Yes, an hour. Totally unacceptable. It actually took about 90 minutes before I got a callback. Luckily it was fixed by that point. Tough, on Monday I start searching for a new broadband provider.

And yes, the 800# first thing says to try to get help on their web site. The same thing they say when I called in about my DSL being down a few months ago. I hate systems that are created by someone that doesn’t use the system.

Happy OneWebDay!

Friday, September 22nd, 2006 5:50 am

How has the web impacted your life? It sure has had an impact on mine. I’m a full time webmaster, a career that didn’t even exist 15 years ago. I have tens of thousands of people visit PlanetMike.com every month, something that surely wouldn’t have happened if I were printing a dead tree magazine of some sort. My wife and I review Washington DC area theatre, and have thousands of people visit the site. We got started for under $100. A dead tree magazine reviewing local theatre? Mailing out the same info? Updated once per month, with a lead time of two to three weeks? Not very relevant or likely to be useful.

So, Happy OneWebDay to everyone!

Update on Verizon’s Supplier Surcharge

Sunday, September 17th, 2006 8:15 pm

In Verizon BS: Supplier Surcharge I talked about Verizon’s increased prices of their DSL offerings. They’ve apparently decided to drop the new fee. Good. Prices should be going down, not up. The USA is lagging behind bandwidth availability when compared to the rest of the developed world.

Dear Valued Verizon Online Customer,

Effective immediately, Verizon Online is dropping its previously announced plans to impose a DSL Supplier Surcharge. We are eliminating this surcharge in response to customer concerns. The supplier surcharge has not been included in your bill.

Thank you for choosing high speed Verizon Online DSL. We appreciate and value your business.

Sincerely,

Verizon Online
Broadband Customer Care Team

Subscribe by RSS

Use my RSS feed to stay up to date


WordPress Plugins


Most Popular Posts


Stuff


Copyright © 1997-2008 Michael Boyd Clark
PlanetMike’s Technology Journal is proudly powered by WordPress
Entries (RSS) and Comments (RSS).