Webmasters these days get very touchy about letting new spiders walk all over their sites. There are so many scraper bots, email harvesters, exploit probers, students running Nutch on gigabit university pipes, and other ill-behaved new search bots that some site owners nervously huddle in forum bunkers anxiously scanning their logs for suspect new vistors, so they can quickly issue bot and ip bans.
Cuill, the search startup from ex-googlers anticipated to launch soon seems to have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".
A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.
Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.
My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. :) But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal? :)