« We Worship MD5, the GOD of HASH | Main | RSS reader shares for Skrentablog »

Crypto vs. the working coder

Working in security tends to make people jumpy and nervous. Most security coders don't understand any of the the crypto internals of the tools they use, so they must rely on a handful of trusted experts like Schneier to tell them what's safe. Even so the algorithms last for about 10-15 years before they're broken.

Kids poke hole in protocols that spent years peer-reviewing their way through the IETF. Implementations are about as secure as swiss cheese, but it doesn't matter since the commercial success of a security product has more to do with its channel marketing strategy than actual security. Rumors surface that some Chinese mathematicians have wrecked part of the functional toolkit we've used for the last decade in all of our products, and it's time to pack up the tents and move, again.

So a culture of nit-picking and paranoia surrounds crypto stuff. If you are using a security algorithm, so the thinking goes, it must be because there is a threat. And if there is a threat, the algorithm must be made perfectly secure.

That may be an appropriate way to think for security products. But it turns out that security techniques are often useful in general programming. MD5 is a great checksum, much better than CRCs. If you have 500 nodes in a cluster, each with some disks, yes I will guarantee you that read/write corruption can occur and get into your app. TCP packets do arrive corrupted, even though they're not supposed to.

Yes, Jenkins is faster. But it's only a 32-bit hash, whereas MD5 is 128. Yes, Whirlpool is more secure. But I don't need a 512 bit checksum. MD5 is a great compromise.

Salts and HMAC are great. But you know what? The reality is that 9 out of 10 websites store your password in the clear. It would be nice if we could get the run-of-the-mill programmer to at least understand how to hash a password before trying to scare them off with the the more advanced stuff. Otherwise they're going to throw up their hands and say their app doesn't really need to be secure anyway.

You can't say MD5 without a geek chorus shouting "It's broken, you must not use it for anything." When regular programmers don't understand the basic utility of these fat hash functions they're missing out though. The fog of confusion hanging over the security space does't benefit Joe coder who could make practical use of these tools in general applications.

The message from security folks is that you shouldn't be using any of their algs for non-secure applications. If you use their stuff, you have to go all the way.

But that's bunk. The engineering tolerances for crypto security are way beyond what the typical application needs for general purpose utility out of these functions. MD5 is a great general-purpose hash. There is useful stuff in between the extremes of a crappy CRC and and SHA-512.

So MD5 away to make your stateless GUIDs and be happy. :)

Comments (10)

Jim McCoy:

If you need 128 bits or other large hash, you can also do something simple like take a SuperFastHash or a Jenkins hash and run it four times, with a standard salt for each iteration, and concatenate the result. This gives you a 128 bit hash function which is still faster than md5 (even though you ran it four times) and which is trivial to parallelize so you gain a linear speed-up on multi-core systems.

Another, smarter, option is to take a hash like Tiger and truncate the output. Tiger is only a percent or two slower than md5 on 32-bit hardware and is significantly faster on 64-bit hardware. By using a truncated version of what is normally a 192 bit hash you also have a backwards-compatible path for the future if the security of your 128 bit version is called into question. At it stands now, when someone develops the inevitable pre-image attack on md5 then most of the use-cases you proposed for md5 are suddenly looking very shaky.

t stratton:

Where did you get the "9 out of 10 websites store your password in the clear"?

I think I can believe it, but my boss is a bit skeptical...

While I agree we do still see some instances of clear text storage of passwords, I would have to say that it's nowhere near 90%. This is purely my own impression, but I would say it's closer to 20%-30% in all but trivial (my dog's web site for example) web sites.

Hmmmm...ran some time tests (on an ancient P4 3.0GHZ).

Hsieh hash ran 4 times on 1024 bytes was approx 10% slower than a single MD5.

I get Tiger times about 55% slower than MD5.

Jenkins however has his "hashlittle2" which basically gives you a 64 bit hash for the same speed as 32 bits. I ran that twice to get 128 bits and it came in about 2X the speed of MD5.

Does that make hashlittle2 run twice with different salts "better" than MD5?

So far, in this discussion, we've mentioned MD5, SHA1, SHA256, CRC32, Hsieh, Jenkins, the (underappreciated) Torek 5381 hash, and Whirlpool.

And you've provided 5 examples of settings to use hash functions in:

* UUID generation

* File integrity checking

* Password hashing

* Hash table indexing

* Random number generation

My challenge to you: pick one of these settings where you believe MD5 is the best choice, and defend the choice. Here, I'll help by tipping my hand:

* The best solution for generating UUIDs is to generate actual UUIDs, for instance by taking the SHA1 of time and PRNG output. Using content digests as a UUID is like using a last name as the primary key in a database.

* The best solution for offline file integrity monitoring is a secure hash function, such as SHA256. File integrity monitoring is checksumming with an adversary. If you don't have an adversary, just use CRC.

* The best solution for password hashing is adaptive hashing, a la Mazieres OpenBSD blowfish hash. You can tweak a setting to make a dictionary search of Mazieres hash take years.

* The best solution for hash table indexing is one of Jenkins, Hsieh, or Torek, any of which are faster than MD5. The point of a hash table is speed.

* The best solution for random generation again depends on whether you have an adversary. If you do, you use the PRNG your crypto service provider offers, and nothing else. If you don't, you use your OS's best RNG; for instance, /dev/urandom.

Is MD5 EVER the best choice?

... I mean, you called it the GOD of HASH.

Jim McCoy:

It seems I had been looking at the Tiger vs. MD5 numbers from a dual core box instead of a single core. Wei Dai has some nice comparisons (http://www.cryptopp.com/benchmarks.html and another link on the page gives P4 numbers for single-core performance.) Depending on the application and runtime environment two hashlittle2 runs seems better than MD5 and on a mutli-core box I suspect multiple Hsieh hashes would end up beating either choice. OTOH, a single MD4 might just be faster than all of them and it will definitely smoke MD5 on a single-core box.

As always, it depends on the application. The problem with MD5 is that it is not the fastest 128bit hash, it is not a secure 128bit hash, and it performs worse on multiple cores or 64bit boxes, so one has to start asking "what is it good for?"

For all of the use cases you presented previously there are better options available for each specific use case. Rabin gives better collision resistance, MD4 or multiple smaller hashes are faster, a "real" PRNG is available without much effort at all, and we all know that there are more secure hashes for applications that need it. The only thing I see MD5 has working in its favor is ubiquity - every language and every architecture has a version available.

Jim McCoy:

BTW, which hash implementations were you using? I highly recommend LibTomCrypt if you care about performance...

> ... I mean, you called it the GOD of HASH.

I blog to help others and also to learn. As it turns out both are aided by getting folks to actually read the stuff. Please pardon the necessary devices. :)

This thread has helped me realize why I need 128 bit hashes, and not 32 bit ones or 512 bit ones. It's because I want 64 bit UIDs and the birthday paradox means I need 128 bits of good hash to have 64 bits of non-colliding material.

If you've dealt with web scale problems you know why I might want to dedup 100B urls or hold a handle for every photo on the web or whatever. If you think you're going to trie them all or make a secondary index or something well good luck with your postgres dentist admin system, I'm sure it's fun to work on.

I would never actually use md5 for a small in-memory hash table, sorry that was badly worded on my part. I should have been more clear about my context. More along the lines of big DHT stuff, not replacements for the hash or dict language support which I assume everyone is already using and doesn't need replacement.

The problem with MD5 is that it is not the fastest 128bit hash, it is not a secure 128bit hash, and it performs worse on multiple cores or 64bit boxes, so one has to start asking "what is it good for?" ... The only thing I see MD5 has working in its favor is ubiquity - every language and every architecture has a version available.

This is the definition of a generalist: okay for a lot of uses, but probably not optimized for any particular case.

We often call the default 'sort' alg in our local langs despite the qsort implmentation not always being the optimal solution for ordering those five items in @foo. But it's there, the items are going to come out sorted, and exec differences are negligible.

aha, I get it now...

All the speed-optimized hashes we have are small ... they're 32 bits. 32 bits is useless, it's not enough for a checksum anymore, the odds of error slipping though are too high given modern data sizes and rates. So we have 128+ bit hashes that were designed to crypto requirements -- and discarded when they fell -- and little 32 bit hashes that have been optimized to heck for speed, but are too small. It's time for fast 128-bit general hashes.

OTOH, a single MD4 might just be faster than all of them and it will definitely smoke MD5 on a single-core box.
You know Jim I had looked at MD4 and wondered given my requirements if the 70% speed advantage over MD5 might make it the ideal hash. I gotta read up more on MD4.

P.S. Your dentist system is gonna be slow!

Bob says to me in email:

Jenkins claims that Hsieh's hash is fast but has funnels. He did like it though, and designed a new hash incorporating the good ideas. It is difficult to sort out the claims because these guys are not exact in which versions of the functions they are talking about, but lookup3.c is probably about as fast, and possibly a better hash. Of course both of these are still only 32 bits.

I actually was using lookup3.c for the Jenkins hash, not his original one. Maybe that is why it was doing so well against Hsieh. Given all of this I would say Jenkins lookup3.c would be my choice over Hsieh since it's as fast, Jenkins claims no funnels, and has a 64 bit version that's just as fast.


md5 is as fast as CRC-32! It has been optimized to hell.

At Topix, I linked Bob Jenkin's lookup3 in with perl, and it was about twice as fast as md5. At the time, I didn't think that it was significant enough speed savings to switch.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


This page contains a single entry from the blog posted on August 17, 2007 9:56 AM.

The previous post in this blog was We Worship MD5, the GOD of HASH.

The next post in this blog is RSS reader shares for Skrentablog.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33