« Cluster map propagation in Amazon Dynamo | Main | Microsoft "hits back" at Google with re-launch of 4-year old Newsbot »

Web robot names considered, and rejected

Google's is "Googlebot"
Yahoo's is "Slurp"
Cuill's is "Twiceler"

It makes sense have a friendly robot user agent, so nervous webmasters won't ban it. You don't want to call your crawler 'sitejacker' or something.. Unfortunately my favorite candidates were:

Crawlhammer
Webraker
Lurchy
Client9

hmmm. :-|

"Oh no! It's CrawlHammer!!"

If even in your heart you hide the urls ... there it shall rake for them...

...

Does anyone know what the purpose of a '+' in front of an url in the robots user-agent is? Some sites put in the '+', others don't...

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)

Gigabot/3.0 (http://www.gigablast.com/spider.html)

Comments (9)

Mark:

My favorite robot name was the old Interpix robot, "iSpi"

This crawled for images for the Image Search feature that they supplied to Yahoo in the middle late 1990s.

A bunch of Google old-timers came together today on an email thread to discuss the background on the '+'. I'll spare you the story and just let you know that you don't need to put a plus sign in the user-agent.

Thanks Matt! But I'd still love to hear the story... :)

I'd recommend something like Slimey, the worm that Oscar the Grouch watches over, but webmasters might be a bit leery of worms as well. :)

Let's dissect what the fears generally are:
1. It might go the way of Cuill and take down the damn webserver (we had to ban Cuill's IP range for doing this).
2. It might just be a scraper.

So, if you can get something that conveys the "I'll go slowly and not steal from you" message, win for you.

How about...
Safeslug
Snaildex
Charlotte (you know, from Charlotte's Web)

Here are some of my favorites from our logs:

DuckDuckBot/1.0 I'll play this with my kids this weekend.

focuseekbot, Do you pronounce that the F-U seek bot?

Following the + in the URL meme, how about ++ before https?
CityTwist/0.1;++https://

I would use funny as my approach to gaining brand recognition. Call it the FART crawler. It will be on Yahoo! News tomorrow morning :))

Well it seems, the decision is made. Today I saw a visit from Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/) .

When I first saw it on my logs I was suspicious and thought "yet another content thief", but the name and the landing-page are indeed friendly enough, to let this crawler crawl. :-)

hahahaha I reached this blog because of the ScoutJet name on my logs. It's cool! Congrats :)

I gave my crawler the name SBSearch as I lacked the imagination to come up with a real name. I think friendly is a good way to go and a good explanation on what the bot does on that url you include.

I was actually considering naming it SecretAgent at first but decided it would be too scary.

Simon Byholm
Secret Search Engine Labs

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on April 16, 2008 9:29 AM.

The previous post in this blog was Cluster map propagation in Amazon Dynamo.

The next post in this blog is Microsoft "hits back" at Google with re-launch of 4-year old Newsbot.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33