<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
   <channel>
      <title>Skrentablog</title>
      <link>http://www.skrenta.com/</link>
      <description></description>
      <language>en</language>
      <copyright>Copyright 2009</copyright>
      <lastBuildDate>Mon, 01 Jun 2009 16:17:03 -0800</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <item>
         <title>Bingram BetaHoo - poking at a few Bing queries</title>
         <description><![CDATA[I like Bing!  Bing.com is live and it looks really cool.  Very fast, clean UI, strong navigational results, nice extra features like the hover panes, aggressive title relevance, plus all the vertical sub-engines.  <a href="http://www.techcrunch.com/2009/06/01/apparently-bing-is-something-of-a-hit/">People like it.</a>

<p> 
That said it's brand new and we all want to kick the tires.
<p>

Search engines are built out of a lot of layered systems.  One part can be working great but be subverted by another part that has a gap.  Like any product there are always bugs to be fixed and improvements to be made.  So launch day isn't the final word on relevance.  But it's interesting to survey a variety of results to poke around.

<p>

<ul>

<li>Overall the navigational results seem very strong.
<p>

<li>Bing is doing aggressive title rewriting to boost perceived relevance.
Google has done some of this for a while - note the title change on the same url
based on the query - 
    [<a href="http://www.google.com/search?q=skrentablog">skrentablog</a>] vs.
    [<a href="http://www.google.com/search?q=rich+skrenta">rich skrenta</a>].

<p>

The "Skrenta, Rich" title came from dmoz.
<p>
Bing is going farther.  Sometimes it makes the result look better than Goog's, e.g. 
    [<a href="http://www.bing.com/search?q=san+carlos+art+and+wine+fair">san carlos art and wine fair</a>].

But others are odd, like result #3 for [<a href="http://www.bing.com/search?q=mike+arrington">mike arrington</a>].
That funny-looking title looks like it came from anchortext.
<p>

<li>Bing's indexing of *.blogspot.com seems really limited.  For instance [<a href="http://www.bing.com/search?q=radish+king"">radish king</a>] doesn't turn up radishking.blogspot.com.
Site:blogspot.com on bing returns an estimate of just 560k results.  Compared to Google (340m) and Yahoo (230m), Bing's blogspot index
seems tiny.  Other blogspot sites I've gone looking for are missing too.  I wonder if this is some kind of rank or index penalty given the
large amount of blogspot spam, or if there is some other issue with their crawl.

<p>

<li>
    [<a href="http://www.bing.com/search?q=michael+arrington">michael arrington</a>] vs.
    [<a href="http://www.bing.com/search?q=mike+arrington">mike arrington</a>].
TechCrunch is #2 for Michael Arrington, but is way down at the bottom of the page for Mike Arrington.  This seems to be the fault of the section-ized results; it's under a heading called "Mike Arrington Blog".  As others have noted I'm not a big fan of sections or universal search style sections on result pages.  It's unfortunate to see a strong result for the query get pushed that far down.

<p>


<li>Bing, like Google, returns Dogpile and AltaVista for [<a href="http://www.bing.com/search?q=search+engine">search engine</a>].
(Yahoo looks like they manually pinned a couple of results for this query.)

<p>


</ul>

<p>

Overall the few bugs I've seen are relatively minor issues in the scheme of the entire product and I'm sure will eventually be addressed by the Bing engineers.
It's so cool to have a powerful new engine out with interesting results.  Kudos, Microsoft!

<p>
]]></description>
         <link>http://www.skrenta.com/2009/06/bingram_betahoo_poking_at_a_fe.html</link>
         <guid>http://www.skrenta.com/2009/06/bingram_betahoo_poking_at_a_fe.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Mon, 01 Jun 2009 16:17:03 -0800</pubDate>
      </item>
            <item>
         <title>Topix passes USA Today to become #1 online site for Gannett, Tribune and McClatchy</title>
         <description><![CDATA[Four years after our deal to sell a majority of Topix to the top three US newspaper companies, <a href="http://blog.topix.com/archives/000231.html">Topix becomes the #1 online property for Gannett, Tribune and McClatchy</a>.
<p>
Congrats to the Topix team on the fantastic recent site growth!

<p>]]></description>
         <link>http://www.skrenta.com/2009/04/topix_passes_usa_today_to_beco.html</link>
         <guid>http://www.skrenta.com/2009/04/topix_passes_usa_today_to_beco.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Tue, 21 Apr 2009 08:54:08 -0800</pubDate>
      </item>
            <item>
         <title>blekko&apos;s ambient cluster health visualization</title>
         <description><![CDATA[<img src="/blekko-status/monitors.jpg"><p>

When you have several hundred servers in a cluster, knowing the state and health of all of them can be a challenge.
Traditional pager alert systems can often either log too
many events, which makes people tune them out, or they miss non-fatal but still serious server sickness, such as
degraded disk/cpu/network performance or subtle application
errors.  <p>
This becomes especially true when the cluster and application are designed for high availability.  If the
application is doing its best to hide server failures from
the user, it's often not apparent when a serious problem is developing until the site fails in a more public or obvious way.  <p>

We called these "analog failures" at Topix.  There was a fairly complicated chain of processing for incoming stories that had been crawled.  Crawl, categorize, cluster, dedup,
roboedit, push to front ends, and push to incremental search
system.  Once an engineer mistakenly deleted half of the sources from our crawl, and it took us a disturbingly long
time to notice.  The problem was that, while overall we had
half as many stories on the site, most pages still had new
stories coming in, so we didn't notice that anything was wrong.  <p>

Sometimes a server has a messed up failure, like its networking card starts losing 50% of its packets, but stuff
is still getting through.  Or a drive is in the process of
failing, and its read/write rate is 10% of normal, but it
hasn't failed enough to be removed from service yet.  The
cpu overheated and is running at a fraction of its normal
speed.  There seem to be limitless numbers of unusual ways
that servers can fail.  <p>

At blekko, there are dozens of stats we'd ideally like to track per host:


<ul>
    <li>How full are each of the disks?
    <li>Are there any SMART errors being reported from the drives?
    <li>Are we getting read or write errors?
    <li>What is the read/write throughput rate?  Sometimes failures degrade the rate substantially, but the disk continues to function
    <li>What is the current disk read latency?
    <li>Is packet loss occurring to the node?
    <li>What is the read/write network throughput?
    <li>What is the cpu load?
    <li>How much memory is in use?
    <li>How much swap is being use?
    <li>How big is the kernel's dirty page cache?
    <li>What are the internal/external temperature sensors reading?
    <li>How many live filesystems are on the host vs. dead disks?
</ul>

<p>

Others stats pertain to our cluster datastore:

<ul>
    <li>How many buckets are on each host?
    <li>Is the host above or below goal for its number of buckets?
    <li>What is the outbound write lag from the host?
    <li>What is the maximum seek depth for a given path/bucket?
    <li>Do we have three copies of every bucket (R3)?
    <li>If we're not at R3, how many bucket copies are occurring?
    <li>For running mapjobs, what is their ETA + read/write/error rate?
    <li>Are the ram caches fully loaded?
    <li>Are we crawling/indexing, what is the rate compared with historical?
</ul>

<p>

<img src="/blekko-status/bstat10.png"><p>


The first step is to start putting the stats you want to be
able to see into a big status table.  But at 175 hosts, the
table is kind of long, and it's hard to spot developing
problems in the middle of the table.  <p>

So we have been experimenting with mapping system stats onto
different visualizations, so we can tell at a glance the
overall state of hundreds of servers, and spot minor
problems before they grow.  <p>

<img src="/blekko-status/bstat11.png"><p>

A table with 175 rows is pretty long, but you can fit 175 squares into a very small picture.  This table shows overall
disk usage by host.  The color of the tile shows the disk
usage: red is 90%, orange is 80%, yellow is 70%, blue is
below 60%.  Dead filesystems on the node are represented by
grey bars inside the tile.  The whole grid is sorted worst-to-best, so it's easy to see the fraction of hosts at
a given level of usage.  <p>

<img src="/blekko-status/bstat6.png"><p>

Our datastore uses a series of buckets (4096 in our current map) to spread the data across the servers.  Each bucket is
stored three times.  If we have three copies of every
bucket, we're at "R3".  This is the standard healthy state of the system.  <p>

Because fetch/store operations will route around failures, it's not at all apparent from the view of the application if
some buckets do not have three copies, and the cluster is
degraded.  So we have a grid of the buckets in our system,
color coded to show whether there are 0/1/2/3 copies of the
bucket.  <p>
<img src="/blekko-status/bstat5.png"><p>

In the above picture, the set of buckets in red have only 1 copy.  The yellow buckets have 2 copies, and the green have
three.  We have a big monitor with this display in our
office, if it ever shows anything but a big green "3" folks
notice and can investigate.  <p>

<img src="/blekko-status/bstat12.png"><p>


For variety we've experimented with other ways to show data.  This display is showing the fraction of a path in our
datastore which has been loaded into the ram cache.  Ram
cache misses will fall back to disk, so it's not
necessarily apparent to the user if the ram cache isn't loaded or working.  But the disk fetch is
much slower than the ram cache, so it's good to know if some
machines have crashed and the ram cache isn't at 100%.  <p>

<img src="/blekko-status/bstat14.png">
<img src="/blekko-status/bstat15.png"><p>

Other parts of the display are standard graphs for data
aggregated across all of the servers.  These are super
useful to spot overall load issues.  <p>
We're still experimenting with finding the best data to collect and show.  But the ambient displays so far are a big win.  Obvious issues are immediately visible to everyone in
our offfice.  And people will walk by and look at the deeper
graphs and sometimes spot issues.  Taking the data from
being something where you would have to proactively type a
cli command or click around on some web forms, to
displays that engineers will stop and look at for a few
minutes on their way to/from getting a coffee or soda has
been big improvement in our awareness and response to cluster
issues.
]]></description>
         <link>http://www.skrenta.com/2009/04/blekkos_ambient_cluster_health.html</link>
         <guid>http://www.skrenta.com/2009/04/blekkos_ambient_cluster_health.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Thu, 09 Apr 2009 11:39:49 -0800</pubDate>
      </item>
            <item>
         <title>Bryn turned me into a muppet</title>
         <description><![CDATA[<a href="http://www.flickr.com/photos/skrenta/3424759754/">
<img src="/images/muppet-skrenta-1-m.jpg">
</a>
]]></description>
         <link>http://www.skrenta.com/2009/04/bryn_turned_me_into_a_muppet.html</link>
         <guid>http://www.skrenta.com/2009/04/bryn_turned_me_into_a_muppet.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Wed, 08 Apr 2009 11:51:51 -0800</pubDate>
      </item>
            <item>
         <title>The news medium has a message: &quot;Goodbye&quot;</title>
         <description><![CDATA[Every so often there's a story about about a technophobe executive so out of touch a secretary has to print out their email every morning so they can read it on paper and dictate replies.  <p>

That's what the print newspaper is, of course.  Why on earth would you print all that stuff out?  Over a hundred pages, most of which you're not going to read, with the crease down the middle of the front page photo, story jumps everywhere, a carbon-footprint disaster to produce, distribute and recycle.  It's absurd.  <p>

Back in 1980 newspapers were the main way that bytes flowed into people's homes.  Radio and TV for audio/video, but the newspaper delivered the bytes that were read like the <a href="http://www.marksonland.com/2009/02/my_web_is_text_based.html">text-based web.</a> <p>

I once worked out some rough back-of-napkin estimates on the number of text bytes in the paper.  It was only delivered once during the day, but if you average the bytes across the entire 24 hour period it came out to be about the rate of a 300 baud modem.  The newspaper <b>was</b> the internet.  <p>

It was mostly one way - except for all those classified ads and the letters to the editor.  It was really a lot more like AOL, since it was centrally controlled and edited.  <p>

But it did represent the sole text byte pipe into the home.  And so it contained every content vertical, all in one package.  National news, world news, local community sections.  Little league scores and the NFL.  Weather, stock tables, TV listings, home sales.  Advertising, both national, local and personal.  Games and political commentary and the police blotter.  Everything.  <p>

Fortified by the <a href="http://www.shirky.com/weblog/2009/03/newspapers-and-thinking-the-unthinkable/">high cost of the printing press</a> and the limited radius of delivery trucks there was a natural local monopoly to these things.  And indeed, they were a wonderful business, a so-called license to print money.  Huge fortunes were made.  <p>

That's all over now of course.  The subsidy that classifieds supplied for bureaus in distant cities is gone.  The class of professional reporters as we know them is going to be smaller and funded differently.  <p>

I was at the TechCrunch office welcoming party last night, and was struck by how unassuming the offices were.  This was the big move up, of course.  They were still unpacking after moving out of Mike Arrington's house.  But it was a small office with a few desks scattered around, a handful of <a href="http://www.flickr.com/photos/ztil301/2400128938/"><img src="/images/ap-newsroom-s.jpg" align=right border=0></a> computers.  I've toured the massive AP newsroom, <a href="http://www.allbusiness.com/services/business-services-miscellaneous-business/4703802-1.html">rebuilt in 2004</a> to cater to every desire of a journalist.  The Reuters newsroom had pods that look like they were inspired by Norad in Wargames, with circular banks of monitors around central stations, all showing live feeds or charts from various sources.  The old Mercury News offices were vast.  <p>

TechCrunch was a modest affair by comparison.  <i>So this is where it all happens...</i>, I thought.  This is what the modern business press looks like now.  <p>

Get used to it.
]]></description>
         <link>http://www.skrenta.com/2009/03/the_news_medium_has_a_message.html</link>
         <guid>http://www.skrenta.com/2009/03/the_news_medium_has_a_message.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Sat, 14 Mar 2009 07:49:18 -0800</pubDate>
      </item>
            <item>
         <title>Detecting spam from http headers? </title>
         <description><![CDATA[Greg Linden <a href="http://glinden.blogspot.com/2008/11/detecting-spam-just-from-http-headers.html">describes a paper about</a>
 finding spam simply by inspecting the returned http headers:
<blockquote><i>
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.
<p>
We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.
</i></blockquote>
<p>

After running web crawls for the past year and finding all manner of spam,
I have to say I'm skeptical this technique would really catch much spam
on the actual web.  Among the top 10 http header features they identify
as spam-predictors are:

<ul>
<li>Accept-Ranges: bytes
<li>Content-Type: text/html; charset=iso-8859-1
<li>Server: Fedora
<li>X-powered-by: php/4
<li>64.225.154.135
</ul>

<p>

These are pretty standard-looking headers.  Let's look at some actual spam though and see if we can see anything funny.

<p>

<pre>
$ curl -I http://www.fancieface.com/
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 19:13:11 GMT
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.12 OpenSSL/0.9.6b
Last-Modified: Tue, 21 Oct 2008 11:51:10 GMT
ETag: "2081cc-ba62-48fdc22e"
Accept-Ranges: bytes
Content-Length: 47714
Content-Type: text/html
</pre>

<p>

Very spammy site, but totally vanilla heaaders.  How about some rolex watch spam:
<p>
<pre>
$ curl -I http://superjewelryguide.com/300.html
HTTP/1.1 200 OK
Date: Sat, 22 Nov 2008 17:48:26 GMT
Server: Apache
X-Powered-By: PHP/5.2.6
Content-Type: text/html
</pre>
<p>

Again, pretty vanilla.  Plus this technique isn't going to work at all for
spam hosted within trusted domains.  Here's some cialis spam smeared onto
a my.nbc.com page:

<p>
<pre>
$ curl -I http://my.nbc.com/blogs/GaryRobinson/main/2008/10/13/cialis-cheapest-cialis-pills-here
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2 PHP/5.1.6
X-Powered-By: PHP/5.1.6
Wirt: (null)
Content-Type: text/html
Expires: Sat, 22 Nov 2008 19:16:33 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sat, 22 Nov 2008 19:16:33 GMT
Content-Length: 0
Connection: keep-alive
Set-Cookie: pers_cookie_insert_nbc.com_app1_prod_80=1572983360.20480.0000;
        expires=Sat, 22-Nov-2008 23:16:33 GMT; path=/
</pre>
<p>
but very fishy headers!  :-)

<p>

It's incredibly difficult to get a high quality random sample of the web.
You can't factor crawler strategy bias out of the sample, and any small
sample is not necessarily going to very representative.
<p>
If the researchers did find good coverage with quirky headers and even
individual ip addresses, I suspect that the crawl they're using may
be over-weighted in pages from a few servers that spewed out a lot of
urls/virtual hosts.

<p>

]]></description>
         <link>http://www.skrenta.com/2008/11/detecting_spam_from_http_heade.html</link>
         <guid>http://www.skrenta.com/2008/11/detecting_spam_from_http_heade.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Sat, 22 Nov 2008 11:22:51 -0800</pubDate>
      </item>
            <item>
         <title>Thank heaven for tax refunds</title>
         <description><![CDATA[In 2000 before the dot-com meltdown I bought a few cases of french bordeaux.  Even though I like bordeaux, it half-seemed like a silly purchase at the time, but when the wine arrived I was happy 
because the bordeaux had risen in value since I purchased it, but due to the stock market
death-spiral my accounts had gone down in the meantime.  win, sorta.

<p>

Unfortunately there was also a bmw 540 that I decided was too indulgent to buy and passed on.  Afterward I kicked myself -- it would have been free.
I would have exercised some netscape options I had to buy it.  I held 
onto them, eventually they declined in value until they were worthless.
I should have bought the car!

<p>

I saw a joke circulating at the time that beer would have yielded a better 
return than some stocks.  The beer bottles could be returned for the
5 cent deposit, but stocks became worthless.  Plus you would get to drink the beer.

<p>
Now we're going through it again, but even worse.  The banker line now is that it's not the return <i>on</i> your capital that you should be worried about, it's the return <i>of</i> your capital.
<p>

I just got a state of California tax refund check.  Normally it's ineffecient 
to pay too much withholding, essentially lending the government your money interest-free until tax time.  In this case though it turned out to be a decent investment.  :-|





]]></description>
         <link>http://www.skrenta.com/2008/11/thank_heaven_for_tax_refunds.html</link>
         <guid>http://www.skrenta.com/2008/11/thank_heaven_for_tax_refunds.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Fri, 21 Nov 2008 11:15:31 -0800</pubDate>
      </item>
            <item>
         <title>Cold calls, cold response</title>
         <description><![CDATA[Every few days cold-calling salespeople show up at our office unnannounced to pitch us on insurance, lease deals, laser toner, office supplies, voip plans, bottled water, etc.

<p>

We have an open office.  So when they enter, 11 people immediately look 
up at them.  This can apparently be somewhat intimidating, based on their
flummoxed reactions.  They usually ask for a business card so they can
call us later.  I sometimes offer them mine, since my card doesn't
have a phone number on it.  Then they beat a hasty retreat.

<p>

Lately we've been trying a new tactic - not acking their presence when they
come in.  There's no receptionist (of course), and it's not clear who they
should attempt to speak with.  None of us really want to listen to their
pitch or take their flier anyway, so playing the game of chicken with the
other folks in the office sort of emerged as a default behavior.  Who will
be the first to crack at their nervousness, make eye contact, and thus
become the dupe left holding the flier or handing out their business card?

<p>

I almost feel sorry for them.  Almost!



]]></description>
         <link>http://www.skrenta.com/2008/11/cold_call_cold_response.html</link>
         <guid>http://www.skrenta.com/2008/11/cold_call_cold_response.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Fri, 14 Nov 2008 19:45:54 -0800</pubDate>
      </item>
            <item>
         <title>Lucy on Elections</title>
         <description><![CDATA[<img src="/images/lucy_s.jpg" align=right>
<blockquote><i>
    It's hard being a campaign worker.<br>
    We're completely at the mercy of our candidate.<br>
    We do all the work, and the candidate gets all the credit.<br>
    We ring doorbells, and make the posters, and build up the candidate's image.<br>
    And then he says something stupid, and ruins everything we've done.<p>
    The next time I do any campaigning, it's gonna to be for myself!<p>
&nbsp; &nbsp; &nbsp; --  Lucy, You're (not) elected, Charlie Brown
</i></blockquote>

]]></description>
         <link>http://www.skrenta.com/2008/11/lucy_on_elections.html</link>
         <guid>http://www.skrenta.com/2008/11/lucy_on_elections.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Sun, 02 Nov 2008 08:28:52 -0800</pubDate>
      </item>
            <item>
         <title>Retro Conservation Advertising</title>
         <description><![CDATA[The modern green/eco movement is bringing back the idea of eating local, having a garden, saving energy, etc. and pointing out the links between items (like <a href="http://www.acterra.org/greenteams/bottledwater.html">bottled water and oil</a>).

<p>

But we've been here before.  Check out these WWI gov't posters.  

<p>
<a href="http://docsouth.unc.edu/wwi/41879/50.html"><img src="/images/dont-waste-paper.jpg"></a><br>
"Don't waste paper - a pound of paper wasted is a pound of fuel wasted"
<p>

<a href="http://docsouth.unc.edu/wwi/41864/50.html"><img src="/images/food-dont-waste.jpg"></a></p><p>

<a href="http://docsouth.unc.edu/wwi/41907/50.html"><img src="/images/food-home-garden.jpg"></a><br>
"Keep the home garden going"
<p>

<a href="http://docsouth.unc.edu/wwi/41943/50.html"><img src="/images/save-coal.jpg"></a><p>
Check out all the detailed instructions in that one.  Public education indeed.
<p>

<a href="http://docsouth.unc.edu/wwi/posters.html">More posters...</a>


]]></description>
         <link>http://www.skrenta.com/2008/10/retro_conservation_advertising.html</link>
         <guid>http://www.skrenta.com/2008/10/retro_conservation_advertising.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Wed, 29 Oct 2008 03:40:00 -0800</pubDate>
      </item>
            <item>
         <title>What&apos;s up Rich</title>
         <description><![CDATA[If <a href="http://www.techcrunch.com/2008/07/13/jason-calacanis-first-new-email-post/">blogging</a> is
<a href="http://www.wired.com/entertainment/theweb/magazine/16-11/st_essay">dead</a> it must be time to start Skrentablog up again.  Apologies for letting the blog go dormant the last little while, I've had my head down in technology.  Quick update:  200 servers, 11 employees, lots of code.  Crawl, index, test, repeat.
<p>
We hired a naming firm to come up with a better name than 'blekko', they
did a great job.  Down to two candidates.  Testing them.
<p>
We built a wicked cluster platform to run our stuff.  It's kind of like
bigtable from the top-down api view but is an integrated design, vs. the 
layered impedance mismatches with stuff like gfs/chubby.  No masters, all
swarm algos.  We crawl/index/serve into structured storage.  It's very 
fast, has integrated mapjobs, and is really easy to program on top of.
I'll post more details about it in the future.
<p>
More posts to come, I promise.
]]></description>
         <link>http://www.skrenta.com/2008/10/whats_up_rich.html</link>
         <guid>http://www.skrenta.com/2008/10/whats_up_rich.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Thu, 23 Oct 2008 08:43:42 -0800</pubDate>
      </item>
            <item>
         <title>blekko is hiring</title>
         <description><![CDATA[blekko is building a new search engine from scratch and I'm looking to hire a few more coders.

<p>

Search is an absolutely fascinating problem to work on for a bunch of reasons.  For one thing you have to scale the thing before getting the first user.  You can't just start with a server or two and add more when the users come.  Step 1 is to copy the internet onto your cluster.  Step 2 is to analyze it..
<p>

The componentry is remarkably deep.

<p>
Search is like 7 hard problems wrapped into a stack.  Distributed systems, html analytics, text analytics/semantics, anti-spam, AI/ML, frontend/UI.  And scale... Apart from the sexy high end algos there are also the boring 10-year old system libraries and off-the-shelf tools that crack under stress and sometimes need a look.  You open the hood and wonder how the thing ever worked in the first place...
<p>

Plus there is always something fresh and new every day mining through the vast sordidness of the many billions of pages on the web.  You expect to be amazed at the endless varieties of crazy porn domains and new approaches to webspam.  But there are equal horrors in the small, finding pathological charset issues, previously-undiscovered abominable server implementations, psychopathic website owners.  The web is a reactive <a href="http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-your-browser-and-your-web-server/">fuzz test</a>.

<p>

I know there are some great coders out there reading this blog who would have blast working on some of the pieces here that need to get built.  This is a great opportunity to join an experienced team early building a big system from the ground up.  If you think you might be interested, send me an email and we can chat.
<p>

fyi our interviews always have coding tests.  Primarily we are looking for folks who love to write code and are good at it.  :)

]]></description>
         <link>http://www.skrenta.com/2008/05/blekko_is_hiring.html</link>
         <guid>http://www.skrenta.com/2008/05/blekko_is_hiring.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Thu, 01 May 2008 12:11:09 -0800</pubDate>
      </item>
            <item>
         <title>How Fake Luxury Conquered the World</title>
         <description><![CDATA[<blockquote>
The legend says that once upon a time there was a General Motors. This General Motors, GM for short, had a car and a brand for every need, along the plan developed by the great Alfred Sloan prior to the Second World War. There were Chevrolets for regular folk, Pontiacs for the cautious old people (and, thanks to John Z. Delorean's development of the 1964 GTO, for angry young people as well), Buicks and Oldsmobiles for doctors and successful businessmen, and Cadillacs at the very top, for the most successful men in the land.<br>
    ...
<br>
    It would have stayed that way forever, but one day a mysterious yet important man at GM had a mysterious yet important idea: <b><i>Executives should drive cars from their own division!
</b></i>
</blockquote>
<p>

Which leads to every division of GM building their own version of the Cadillac.

<p>
Read more: <a href="http://www.speedsportlife.com/2008/04/29/avoidable-contact-11-how-fake-luxury-conquered-the-world/
">How Fake Luxury Conquered The World</a>
<p>
(thanks Bryn for the tip)



]]></description>
         <link>http://www.skrenta.com/2008/05/how_fake_luxury_conquered_the.html</link>
         <guid>http://www.skrenta.com/2008/05/how_fake_luxury_conquered_the.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Thu, 01 May 2008 11:19:55 -0800</pubDate>
      </item>
            <item>
         <title>Microsoft bias in MSN search results, surprise</title>
         <description><![CDATA[I was looking to see what search sites might 
have a particular bug that I (ahem) came across and 
was trying the search for the number 0 in various
places.  There is a pretty good <a
href="http://en.wikipedia.org/wiki/0_(number)">Wikipedia
page</a> about zero.  Zero has a rich and interesting
history and there are many other potentially
reasonable results.

<p>

But I was surprised to see MSN search had demoted their good results below
some crappy ones from MSDN:
<p>
<img src="/images/msn-0.png" width=450>
<p>
Lame!  Falling into an inferior lex position and a 
lower overall relevance page to boost their own network
results...give em credit for being old school.  :)

<p>
...
<p>

I found my bug on Yahoo Search.  I had tried a lot of smaller
engines first because I didn't think a major would have 
this bug.  <b>You can't search for 0 on Yahoo.</b> You
can search for all the other numbers, but not 0 ...

<p>

Why?..  Because 0 is <i>false</i>.  It suggests Yahoo is using a scripting language to front
their search form, and a programmer did something like <code>if ( $query )</code> rather than <code>if ( $query ne '' )</code>.

<p>
]]></description>
         <link>http://www.skrenta.com/2008/04/microsoft_bias_in_msn_search_r.html</link>
         <guid>http://www.skrenta.com/2008/04/microsoft_bias_in_msn_search_r.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Thu, 24 Apr 2008 07:45:00 -0800</pubDate>
      </item>
            <item>
         <title>Hypertable architecture talk Wednesday in Palo Alto</title>
         <description><![CDATA[Doug Judd will be discussing the internals and architecture of Hypertable tomorrow in Palo Alto at 6:30pm. 
<p>
<blockquote><i>
Hypertable is an open source, high performance, distributed database modeled after Google's Bigtable. It differs from traditional relational database technology in that the emphasis is on scalability as opposed to transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, tables can smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity hardware. Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS, GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend technology. In this presentation, Doug will give an architectural overview of Hypertable. He will describe some of the key design decisions and will highlight some of the places where Hypertable diverges from the system described in the Bigtable paper.
</i></blockquote>
<p>
<a href="http://www.zvents.com/palo-alto-ca/events/show/81854980-sdforum-software-architecture-modeling-event-architecting-hypertable">More details</a>.

]]></description>
         <link>http://www.skrenta.com/2008/04/hypertable_architecture_talk_w.html</link>
         <guid>http://www.skrenta.com/2008/04/hypertable_architecture_talk_w.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">main</category>
        
        
         <pubDate>Tue, 22 Apr 2008 12:51:56 -0800</pubDate>
      </item>
      
   </channel>
</rss>
