« My compiler vs. the monkeys | Main | Beautiful presentations: Jon Bentley's quicksort video »

Kosmix releases Google GFS workalike 'KFS' as open source

Search startup Kosmix has released a C++ implementation of the Google File System as open source. This parallels the existing Hadoop/HDFS project which is written in Java. The Kosmix team has deep engineering talent, including a strong track record, and having recently built a web-scale crawler and search engine from scratch. Google has a set of tools that the rest of the industry needs in order to compete...it's cool that folks are stepping up to the task and leveraging the open source model to try to provide some balance.

KFS arrives with an impressive set of features for an alpha release:

  • Incremental scalability - New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.

  • Availability - Replication is used to provide availability due to chunk server failures.

  • Re-balancing - Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.

  • Data integrity - To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.

  • Client side fail-over - During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.

  • Language support - KFS client library can be accessed from C++, Java, and Python.

  • FUSE support on Linux - By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.

  • Leases - KFS client library uses caching to improve performance. Leases are used to support cache consistency.

Every startup that scales beyond a single machine needs platform technology to build their application and run their cluster. If enough folks adopt the code and contribute, the hope is that it could become something like the gcc/linux/perl of the cluster storage layer.

TrackBack

Listed below are links to weblogs that reference Kosmix releases Google GFS workalike 'KFS' as open source:

» SearchCap: The Day In Search, September 28, 2007 from Search Engine Land: News About Search Engines & Search Marketing
Below is what happened in search today, as reported on Search Engine Land and from other places across the web.... [Read More]

Comments (8)

I've had the luxury of playing with a pre-release of KFS and can attest to the quality of engineering behind it. The two primary developers of KFS (Sriram and Blake) both hail from NetApp and know a thing or two about filesystems. For the record, I worked with these guys at Kosmix for a couple of years.

We're building a Bigtable-inspired distributed database and were able to smoothly integrate with KFS. KFS is a huge offering to the open source community.

Is there a KFS vs. MogileFS [1] (which is also open source, and has been for a while now) comparison, yet?

[1] http://www.danga.com/mogilefs/

Tim Cullen:

Seems to me with a single meta-data server that single point of failure provides a glaring weakness in this system.

Sriram Rao:

Tim,

You are right in that, the metadata server is currently a single-point of failure. However, to protect against losing the filesystem in case of metaserver node going down, simple mechanisms such as periodically rsyncing the metaserver's checkpoint/logs to remote machines can be done. In such cases, you may lose the last few updates to the filesystem, but at least you won't lose the entire data.

That said, our thoughts are that we can add "shadow" master(s) to the system and improve resiliency. This is something we are thinking about for a future release.

Sriram

Is it allowed to use the Google File System for your own open source software? Why should Google allow it?

KFS also has a interoperability layer with hadoop, which should help as well.

Clement:

Will KFS support more that 20000 node cluster.. because it has the persistence connect for identifying the dead nodes. But i am afraid if 20000 sockets are in open at master what will happen to the master performance?

ashraf:

Hi

I am a beginner in KFS. Can you give me some link which tells how to set up KFS with hadoop step by step ?

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on September 27, 2007 10:12 PM.

The previous post in this blog was My compiler vs. the monkeys.

The next post in this blog is Beautiful presentations: Jon Bentley's quicksort video.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33