July 19, 2010

If blekko sees its shadow, 6 more weeks of beta

blekko has (finally!) entered private beta...TechCrunch has the details on a preview of our new search engine.

Am I insane for trying to build a new search engine from scratch? Maybe... but blekko is pretty cool anyway. :-)

blekko is introducing a novel search syntax we call slashtags. Using simple tags to refine your queries (e.g. /date, /demblogs, /people, /health, /satire, etc.), you can quickly filter search results to just the sites you want, change the way results are sorted, and more.

We have hundreds of slashtags you can get started with on blekko, plus we have a toolbox to let you make and share your own.

Furthermore, we intend to be fully open about our crawl and rank data for the web. We don't believe security through obscurity is the best way to drive search ranking quality forward. So we have a set of tools on blekko.com which let you understand what factors are driving our rankings, and let you dive behind any url or site to see what their web search footprint looks like.

So what took so long? It turns out that it's really friggin hard to build a search engine from scratch. Especially a good one. We've built our system from the ground up, with a multi-billion page index, a prioritized crawl, and new ranking and anti-spam technology. We also have a ton of classifiers firing on every url we crawl, many of which are powering built-in slashtags.

Drop me a note if you'd like to get on the list for our private beta. It really is a beta, we still have some bugs to shake out before opening the site completely. But we could use your help to shake out the system before our launch.

I'll leave you with blekko's founding principles. We call them the Web Search Bill of Rights:

  1. Search shall be open
  2. Search results shall involve people
  3. Ranking data shall not be kept secret
  4. Web data shall be readily available
  5. There is no one-size-fits-all for search
  6. Advanced search shall be accessible
  7. Search engine tools shall be open to all
  8. Search & community go hand-in-hand
  9. Spam does not belong in search results
  10. Privacy of searchers shall not be violated

Here is my co-founder Mike's take on why blekko is cool.

Smile for the camera, Rich

June 18, 2010

Smile for the camera, Rich

June 12, 2010

New Dog

Finally got the yard fenced in, plus hard lobbying from the kids ... so 5 years since winston, I got a new dog.

November 12, 2009

The future of business journalism

Pay for play? Sigh.

From: Sally Bailey <Sally.bailey@amgl.co.uk>
Date: Thu, Nov 12, 2009 at 6:42 AM
Subject: Blekko Raises $2.5 Million - Front Cover Proposal - ACQ Magazine

Good afternoon Rich, I hope you are well.

Many thanks for your time a short while ago and your offer of assistance on this matter. We have discussed this internally and would like to elevate this to a front cover position.

The recent $2.5 million raised obviously marks an important addition to your company portfolio. The Blekko brand is highly regarded and stands out in a competitive sector.

We wish to propose elevating the coverage to the front cover of the magazine, if of interest to you. I am sure this greater exposure will be appetising considering our "penetration" into the sector. Within the report we will be discussing the deal itself but also examining the wider market.

You may already be familiar with the magazine but for your convenience I have included below the FLEXIPAGE link to our most recent issue:

http://view.vcab.com/...

The cover opportunity includes:

  • A Front Cover Headline
  • Contents page reference
  • Editorial content within the magazine over a set number of pages
  • Electronic reproduction of the coverage
  • A hard copy of the complete edition (plus further copies as required)

There are 3 options available depending on the amount of editorial content you desire:

  • Option 1 – 1 page of editorial - £940.00 +VAT
  • Option 2 – 2 pages of editorial - £1940.00 +VAT
  • Option 3 – 4 pages of editorial - £2940.00 +VAT
If you would like to proceed with the front cover report please reply confirming which option you are interested in and the applicable cost.
Your thoughts are needed urgently as these are popular positions with limited space available.

If you have any queries at all please so not hesitate in contacting me. I will await your feedback.

Kindest Regards

Sally Bailey - ACQ Magazine

Mainland UK Switchboard - 0044 870 242 7021
Mainland UK Facsimile - 0044 870 242 7023
Email – sally.bailey@amgl.co.uk

Website - www.amgl.co.uk

October 15, 2009

blekko is hiring software engineers


blekko is building a disruptive general-purpose web search engine. We are hiring software engineers.

Web search is not only one of the most important technologies of our time, but it is also incredibly fun to work on because it requires cutting-edge algorithms from a wide range of disciplines. It is one of the hardest startup challenges today – but the monetization is much higher than anything else on the web, and there are fewer credible competitors than most people think.

Our team has founded multiple successful startups and held leadership positions at major tech companies such as Google, Sun and Netscape/AOL. We have funding from top-tier venture investors and a roster of highly prominent Silicon Valley angels including Marc Andreessen, Ron Conway, and two early Googlers.

Our crawl/index/search/query code is implemented on top of a distributed storage system that supports integrated map job execution, data replication, scalability, and fault tolerance. The programming model is similar to Google BigTable, but the application-level code tends to be more high-level and pleasant to work with than a typical high performance distributed application.

We are looking for talented software engineers who enjoy working on big systems, appreciate the productivity wins of interpreted languages and good API design, want to work on advanced search applications at web scale, and are:

  • Highly productive coders, self-motivated and able to learn new skills quickly
  • Intellectually curious and more pragmatic than theoretic
  • Comfortable in a small-company, startup environment

Pluses:

  • (In descending order of importance): UNIX/Linux, Perl, C/C++, Javascript, HTML/CSS
  • Search, particularly web search
  • Large-scale distributed systems (e.g., Map/Reduce, Hadoop, distributed filesystems, clustered databases)
  • Deep systems knowledge of operating systems, I/O, and networks
  • Applied math, statistics and/or machine learning, particularly as applied to ranking and classification
  • Degree in computer science or related area, especially masters and PhD
  • Industry experience, especially in startups or domain-relevant Internet companies
  • Interest in potential leadership opportunities as the company scales

Blekko is located in Redwood Shores, California across from the main Oracle campus. If interested, please contact blekkojobs@blekko.com

July 28, 2009

There’s No Such Thing As A Google Killer

Google is an amazing story. In a little more than 10 years, they have built not only a multi-billion dollar company that employs thousands of people, but also the world’s strongest brand. This is an anomalous story that perhaps may never be repeated.

So let’s just get this out of the way: there is no such thing as a Google killer. No company is going to play David to their Goliath and slay them with a well-aimed stone from a slingshot. Google is here to stay.

Why do I bring this up? I am one of the founders of a search start-up. One that recently raised money from a couple of great venture capital firms. So whenever anything is printed about us, or even comes up in causal conversation, the term “Google killer” gets bandied about. Again, I think you’re as likely to see a Google-killer as you are to find Sasquatch or the Loch Ness monster.

So why join a search start-up then? Because I don’t believe to be successful in this business you need to be a Google-killer. In fact, trying to be a Google-killer is probably the one sure way not to succeed.

If you were to start a soft-drink company, would you be a Coke killer? Would you create a product that tasted exactly like Coke and put it in a red can? Of course not, that would be product suicide. You make something that tastes different and package it differently – Snapple, Red Bull, Vitamin Water.

We think the same about search. Google isn’t going anywhere. We think there are a lot of problems that search isn’t addressing right now that it could be. And that’s where we want to play. Own a category or die. So no, we’re not a Google killer. But stay tuned for more…

June 1, 2009

Bingram BetaHoo - poking at a few Bing queries

I like Bing! Bing.com is live and it looks really cool. Very fast, clean UI, strong navigational results, nice extra features like the hover panes, aggressive title relevance, plus all the vertical sub-engines. People like it.

That said it's brand new and we all want to kick the tires.

Search engines are built out of a lot of layered systems. One part can be working great but be subverted by another part that has a gap. Like any product there are always bugs to be fixed and improvements to be made. So launch day isn't the final word on relevance. But it's interesting to survey a variety of results to poke around.

  • Overall the navigational results seem very strong.

  • Bing is doing aggressive title rewriting to boost perceived relevance. Google has done some of this for a while - note the title change on the same url based on the query - [skrentablog] vs. [rich skrenta].

    The "Skrenta, Rich" title came from dmoz.

    Bing is going farther. Sometimes it makes the result look better than Goog's, e.g. [san carlos art and wine fair]. But others are odd, like result #3 for [mike arrington]. That funny-looking title looks like it came from anchortext.

  • Bing's indexing of *.blogspot.com seems really limited. For instance [radish king] doesn't turn up radishking.blogspot.com. Site:blogspot.com on bing returns an estimate of just 560k results. Compared to Google (340m) and Yahoo (230m), Bing's blogspot index seems tiny. Other blogspot sites I've gone looking for are missing too. I wonder if this is some kind of rank or index penalty given the large amount of blogspot spam, or if there is some other issue with their crawl.

  • [michael arrington] vs. [mike arrington]. TechCrunch is #2 for Michael Arrington, but is way down at the bottom of the page for Mike Arrington. This seems to be the fault of the section-ized results; it's under a heading called "Mike Arrington Blog". As others have noted I'm not a big fan of sections or universal search style sections on result pages. It's unfortunate to see a strong result for the query get pushed that far down.

  • Bing, like Google, returns Dogpile and AltaVista for [search engine]. (Yahoo looks like they manually pinned a couple of results for this query.)

Overall the few bugs I've seen are relatively minor issues in the scheme of the entire product and I'm sure will eventually be addressed by the Bing engineers. It's so cool to have a powerful new engine out with interesting results. Kudos, Microsoft!

April 21, 2009

Topix passes USA Today to become #1 online site for Gannett, Tribune and McClatchy

Four years after our deal to sell a majority of Topix to the top three US newspaper companies, Topix becomes the #1 online property for Gannett, Tribune and McClatchy.

Congrats to the Topix team on the fantastic recent site growth!

April 9, 2009

blekko's ambient cluster health visualization

When you have several hundred servers in a cluster, knowing the state and health of all of them can be a challenge. Traditional pager alert systems can often either log too many events, which makes people tune them out, or they miss non-fatal but still serious server sickness, such as degraded disk/cpu/network performance or subtle application errors.

This becomes especially true when the cluster and application are designed for high availability. If the application is doing its best to hide server failures from the user, it's often not apparent when a serious problem is developing until the site fails in a more public or obvious way.

We called these "analog failures" at Topix. There was a fairly complicated chain of processing for incoming stories that had been crawled. Crawl, categorize, cluster, dedup, roboedit, push to front ends, and push to incremental search system. Once an engineer mistakenly deleted half of the sources from our crawl, and it took us a disturbingly long time to notice. The problem was that, while overall we had half as many stories on the site, most pages still had new stories coming in, so we didn't notice that anything was wrong.

Sometimes a server has a messed up failure, like its networking card starts losing 50% of its packets, but stuff is still getting through. Or a drive is in the process of failing, and its read/write rate is 10% of normal, but it hasn't failed enough to be removed from service yet. The cpu overheated and is running at a fraction of its normal speed. There seem to be limitless numbers of unusual ways that servers can fail.

At blekko, there are dozens of stats we'd ideally like to track per host:

  • How full are each of the disks?
  • Are there any SMART errors being reported from the drives?
  • Are we getting read or write errors?
  • What is the read/write throughput rate? Sometimes failures degrade the rate substantially, but the disk continues to function
  • What is the current disk read latency?
  • Is packet loss occurring to the node?
  • What is the read/write network throughput?
  • What is the cpu load?
  • How much memory is in use?
  • How much swap is being use?
  • How big is the kernel's dirty page cache?
  • What are the internal/external temperature sensors reading?
  • How many live filesystems are on the host vs. dead disks?

Others stats pertain to our cluster datastore:

  • How many buckets are on each host?
  • Is the host above or below goal for its number of buckets?
  • What is the outbound write lag from the host?
  • What is the maximum seek depth for a given path/bucket?
  • Do we have three copies of every bucket (R3)?
  • If we're not at R3, how many bucket copies are occurring?
  • For running mapjobs, what is their ETA + read/write/error rate?
  • Are the ram caches fully loaded?
  • Are we crawling/indexing, what is the rate compared with historical?

The first step is to start putting the stats you want to be able to see into a big status table. But at 175 hosts, the table is kind of long, and it's hard to spot developing problems in the middle of the table.

So we have been experimenting with mapping system stats onto different visualizations, so we can tell at a glance the overall state of hundreds of servers, and spot minor problems before they grow.

A table with 175 rows is pretty long, but you can fit 175 squares into a very small picture. This table shows overall disk usage by host. The color of the tile shows the disk usage: red is 90%, orange is 80%, yellow is 70%, blue is below 60%. Dead filesystems on the node are represented by grey bars inside the tile. The whole grid is sorted worst-to-best, so it's easy to see the fraction of hosts at a given level of usage.

Our datastore uses a series of buckets (4096 in our current map) to spread the data across the servers. Each bucket is stored three times. If we have three copies of every bucket, we're at "R3". This is the standard healthy state of the system.

Because fetch/store operations will route around failures, it's not at all apparent from the view of the application if some buckets do not have three copies, and the cluster is degraded. So we have a grid of the buckets in our system, color coded to show whether there are 0/1/2/3 copies of the bucket.

In the above picture, the set of buckets in red have only 1 copy. The yellow buckets have 2 copies, and the green have three. We have a big monitor with this display in our office, if it ever shows anything but a big green "3" folks notice and can investigate.

For variety we've experimented with other ways to show data. This display is showing the fraction of a path in our datastore which has been loaded into the ram cache. Ram cache misses will fall back to disk, so it's not necessarily apparent to the user if the ram cache isn't loaded or working. But the disk fetch is much slower than the ram cache, so it's good to know if some machines have crashed and the ram cache isn't at 100%.

Other parts of the display are standard graphs for data aggregated across all of the servers. These are super useful to spot overall load issues.

We're still experimenting with finding the best data to collect and show. But the ambient displays so far are a big win. Obvious issues are immediately visible to everyone in our offfice. And people will walk by and look at the deeper graphs and sometimes spot issues. Taking the data from being something where you would have to proactively type a cli command or click around on some web forms, to displays that engineers will stop and look at for a few minutes on their way to/from getting a coffee or soda has been big improvement in our awareness and response to cluster issues.

April 8, 2009

Bryn turned me into a muppet

March 14, 2009

The news medium has a message: "Goodbye"

Every so often there's a story about about a technophobe executive so out of touch a secretary has to print out their email every morning so they can read it on paper and dictate replies.

That's what the print newspaper is, of course. Why on earth would you print all that stuff out? Over a hundred pages, most of which you're not going to read, with the crease down the middle of the front page photo, story jumps everywhere, a carbon-footprint disaster to produce, distribute and recycle. It's absurd.

Back in 1980 newspapers were the main way that bytes flowed into people's homes. Radio and TV for audio/video, but the newspaper delivered the bytes that were read like the text-based web.

I once worked out some rough back-of-napkin estimates on the number of text bytes in the paper. It was only delivered once during the day, but if you average the bytes across the entire 24 hour period it came out to be about the rate of a 300 baud modem. The newspaper was the internet.

It was mostly one way - except for all those classified ads and the letters to the editor. It was really a lot more like AOL, since it was centrally controlled and edited.

But it did represent the sole text byte pipe into the home. And so it contained every content vertical, all in one package. National news, world news, local community sections. Little league scores and the NFL. Weather, stock tables, TV listings, home sales. Advertising, both national, local and personal. Games and political commentary and the police blotter. Everything.

Fortified by the high cost of the printing press and the limited radius of delivery trucks there was a natural local monopoly to these things. And indeed, they were a wonderful business, a so-called license to print money. Huge fortunes were made.

That's all over now of course. The subsidy that classifieds supplied for bureaus in distant cities is gone. The class of professional reporters as we know them is going to be smaller and funded differently.

I was at the TechCrunch office welcoming party last night, and was struck by how unassuming the offices were. This was the big move up, of course. They were still unpacking after moving out of Mike Arrington's house. But it was a small office with a few desks scattered around, a handful of computers. I've toured the massive AP newsroom, rebuilt in 2004 to cater to every desire of a journalist. The Reuters newsroom had pods that look like they were inspired by Norad in Wargames, with circular banks of monitors around central stations, all showing live feeds or charts from various sources. The old Mercury News offices were vast.

TechCrunch was a modest affair by comparison. So this is where it all happens..., I thought. This is what the modern business press looks like now.

Get used to it.

November 22, 2008

Detecting spam from http headers?

Greg Linden describes a paper about finding spam simply by inspecting the returned http headers:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.

After running web crawls for the past year and finding all manner of spam, I have to say I'm skeptical this technique would really catch much spam on the actual web. Among the top 10 http header features they identify as spam-predictors are:

  • Accept-Ranges: bytes
  • Content-Type: text/html; charset=iso-8859-1
  • Server: Fedora
  • X-powered-by: php/4
  • 64.225.154.135

These are pretty standard-looking headers. Let's look at some actual spam though and see if we can see anything funny.

$ curl -I http://www.fancieface.com/
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 19:13:11 GMT
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b
Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT
ETag: "2081cc-ba62-48fdc22e"
Accept-Ranges: bytes
Content-Length: 47714
Content-Type: text/html

Very spammy site, but totally vanilla heaaders. How about some rolex watch spam:

$ curl -I http://superjewelryguide.com/300.html
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 17:48:26 GMT
Server: Apache
X-Powered-By: PHP/5.2.6
Content-Type: text/html

Again, pretty vanilla. Plus this technique isn't going to work at all for spam hosted within trusted domains. Here's some cialis spam smeared onto a my.nbc.com page:

$ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6
X-Powered-By: PHP/5.1.6
Wirt: (null)
Content-Type: text/html
Expires: Sat, 22 Nov 2008 19:16:33 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 22 Nov 2008 19:16:33 GMT
Content-Length: 0
Connection: keep-alive
Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000;
        expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/

but very fishy headers! :-)

It's incredibly difficult to get a high quality random sample of the web. You can't factor crawler strategy bias out of the sample, and any small sample is not necessarily going to very representative.

If the researchers did find good coverage with quirky headers and even individual ip addresses, I suspect that the crawl they're using may be over-weighted in pages from a few servers that spewed out a lot of urls/virtual hosts.

November 21, 2008

Thank heaven for tax refunds

In 2000 before the dot-com meltdown I bought a few cases of french bordeaux. Even though I like bordeaux, it half-seemed like a silly purchase at the time, but when the wine arrived I was happy because the bordeaux had risen in value since I purchased it, but due to the stock market death-spiral my accounts had gone down in the meantime. win, sorta.

Unfortunately there was also a bmw 540 that I decided was too indulgent to buy and passed on. Afterward I kicked myself -- it would have been free. I would have exercised some netscape options I had to buy it. I held onto them, eventually they declined in value until they were worthless. I should have bought the car!

I saw a joke circulating at the time that beer would have yielded a better return than some stocks. The beer bottles could be returned for the 5 cent deposit, but stocks became worthless. Plus you would get to drink the beer.

Now we're going through it again, but even worse. The banker line now is that it's not the return on your capital that you should be worried about, it's the return of your capital.

I just got a state of California tax refund check. Normally it's ineffecient to pay too much withholding, essentially lending the government your money interest-free until tax time. In this case though it turned out to be a decent investment. :-|

November 14, 2008

Cold calls, cold response

Every few days cold-calling salespeople show up at our office unnannounced to pitch us on insurance, lease deals, laser toner, office supplies, voip plans, bottled water, etc.

We have an open office. So when they enter, 11 people immediately look up at them. This can apparently be somewhat intimidating, based on their flummoxed reactions. They usually ask for a business card so they can call us later. I sometimes offer them mine, since my card doesn't have a phone number on it. Then they beat a hasty retreat.

Lately we've been trying a new tactic - not acking their presence when they come in. There's no receptionist (of course), and it's not clear who they should attempt to speak with. None of us really want to listen to their pitch or take their flier anyway, so playing the game of chicken with the other folks in the office sort of emerged as a default behavior. Who will be the first to crack at their nervousness, make eye contact, and thus become the dupe left holding the flier or handing out their business card?

I almost feel sorry for them. Almost!

November 2, 2008

Lucy on Elections

It's hard being a campaign worker.
We're completely at the mercy of our candidate.
We do all the work, and the candidate gets all the credit.
We ring doorbells, and make the posters, and build up the candidate's image.
And then he says something stupid, and ruins everything we've done.

The next time I do any campaigning, it's gonna to be for myself!

      -- Lucy, You're (not) elected, Charlie Brown

October 29, 2008

Retro Conservation Advertising

The modern green/eco movement is bringing back the idea of eating local, having a garden, saving energy, etc. and pointing out the links between items (like bottled water and oil).

But we've been here before. Check out these WWI gov't posters.


"Don't waste paper - a pound of paper wasted is a pound of fuel wasted"


"Keep the home garden going"

Check out all the detailed instructions in that one. Public education indeed.

More posters...

October 23, 2008

What's up Rich

If blogging is dead it must be time to start Skrentablog up again. Apologies for letting the blog go dormant the last little while, I've had my head down in technology. Quick update: 200 servers, 11 employees, lots of code. Crawl, index, test, repeat.

We hired a naming firm to come up with a better name than 'blekko', they did a great job. Down to two candidates. Testing them.

We built a wicked cluster platform to run our stuff. It's kind of like bigtable from the top-down api view but is an integrated design, vs. the layered impedance mismatches with stuff like gfs/chubby. No masters, all swarm algos. We crawl/index/serve into structured storage. It's very fast, has integrated mapjobs, and is really easy to program on top of. I'll post more details about it in the future.

More posts to come, I promise.

May 1, 2008

blekko is hiring

blekko is building a new search engine from scratch and I'm looking to hire a few more coders.

Search is an absolutely fascinating problem to work on for a bunch of reasons. For one thing you have to scale the thing before getting the first user. You can't just start with a server or two and add more when the users come. Step 1 is to copy the internet onto your cluster. Step 2 is to analyze it..

The componentry is remarkably deep.

Search is like 7 hard problems wrapped into a stack. Distributed systems, html analytics, text analytics/semantics, anti-spam, AI/ML, frontend/UI. And scale... Apart from the sexy high end algos there are also the boring 10-year old system libraries and off-the-shelf tools that crack under stress and sometimes need a look. You open the hood and wonder how the thing ever worked in the first place...

Plus there is always something fresh and new every day mining through the vast sordidness of the many billions of pages on the web. You expect to be amazed at the endless varieties of crazy porn domains and new approaches to webspam. But there are equal horrors in the small, finding pathological charset issues, previously-undiscovered abominable server implementations, psychopathic website owners. The web is a reactive fuzz test.

I know there are some great coders out there reading this blog who would have blast working on some of the pieces here that need to get built. This is a great opportunity to join an experienced team early building a big system from the ground up. If you think you might be interested, send me an email and we can chat.

fyi our interviews always have coding tests. Primarily we are looking for folks who love to write code and are good at it. :)

How Fake Luxury Conquered the World

The legend says that once upon a time there was a General Motors. This General Motors, GM for short, had a car and a brand for every need, along the plan developed by the great Alfred Sloan prior to the Second World War. There were Chevrolets for regular folk, Pontiacs for the cautious old people (and, thanks to John Z. Delorean's development of the 1964 GTO, for angry young people as well), Buicks and Oldsmobiles for doctors and successful businessmen, and Cadillacs at the very top, for the most successful men in the land.
...
It would have stayed that way forever, but one day a mysterious yet important man at GM had a mysterious yet important idea: Executives should drive cars from their own division!

Which leads to every division of GM building their own version of the Cadillac.

Read more: How Fake Luxury Conquered The World

(thanks Bryn for the tip)

Categories