Kids poke hole in protocols that spent years peer-reviewing their way through the IETF. Implementations are about as secure as swiss cheese, but it doesn't matter since the commercial success of a security product has more to do with its channel marketing strategy than actual security. Rumors surface that some Chinese mathematicians have wrecked part of the functional toolkit we've used for the last decade in all of our products, and it's time to pack up the tents and move, again.
So a culture of nit-picking and paranoia surrounds crypto stuff. If you are using a security algorithm, so the thinking goes, it must be because there is a threat. And if there is a threat, the algorithm must be made perfectly secure.
That may be an appropriate way to think for security products. But it turns out that security techniques are often useful in general programming. MD5 is a great checksum, much better than CRCs. If you have 500 nodes in a cluster, each with some disks, yes I will guarantee you that read/write corruption can occur and get into your app. TCP packets do arrive corrupted, even though they're not supposed to.
Yes, Jenkins is faster. But it's only a 32-bit hash, whereas MD5 is 128. Yes, Whirlpool is more secure. But I don't need a 512 bit checksum. MD5 is a great compromise.
Salts and HMAC are great. But you know what? The reality is that 9 out of 10 websites store your password in the clear. It would be nice if we could get the run-of-the-mill programmer to at least understand how to hash a password before trying to scare them off with the the more advanced stuff. Otherwise they're going to throw up their hands and say their app doesn't really need to be secure anyway.
You can't say MD5 without a geek chorus shouting "It's broken, you must not use it for anything." When regular programmers don't understand the basic utility of these fat hash functions they're missing out though. The fog of confusion hanging over the security space does't benefit Joe coder who could make practical use of these tools in general applications.
The message from security folks is that you shouldn't be using any of their algs for non-secure applications. If you use their stuff, you have to go all the way.
But that's bunk. The engineering tolerances for crypto security are way beyond what the typical application needs for general purpose utility out of these functions. MD5 is a great general-purpose hash. There is useful stuff in between the extremes of a crappy CRC and and SHA-512.
So MD5 away to make your stateless GUIDs and be happy. :)
Comments (10)
If you need 128 bits or other large hash, you can also do something simple like take a SuperFastHash or a Jenkins hash and run it four times, with a standard salt for each iteration, and concatenate the result. This gives you a 128 bit hash function which is still faster than md5 (even though you ran it four times) and which is trivial to parallelize so you gain a linear speed-up on multi-core systems.
Another, smarter, option is to take a hash like Tiger and truncate the output. Tiger is only a percent or two slower than md5 on 32-bit hardware and is significantly faster on 64-bit hardware. By using a truncated version of what is normally a 192 bit hash you also have a backwards-compatible path for the future if the security of your 128 bit version is called into question. At it stands now, when someone develops the inevitable pre-image attack on md5 then most of the use-cases you proposed for md5 are suddenly looking very shaky.
Posted by Jim McCoy | August 17, 2007 3:04 PM
Posted on August 17, 2007 15:04
Where did you get the "9 out of 10 websites store your password in the clear"?
I think I can believe it, but my boss is a bit skeptical...
Posted by t stratton | August 17, 2007 4:47 PM
Posted on August 17, 2007 16:47
While I agree we do still see some instances of clear text storage of passwords, I would have to say that it's nowhere near 90%. This is purely my own impression, but I would say it's closer to 20%-30% in all but trivial (my dog's web site for example) web sites.
Posted by Phil Peterman | August 17, 2007 5:10 PM
Posted on August 17, 2007 17:10
Hmmmm...ran some time tests (on an ancient P4 3.0GHZ).
Hsieh hash ran 4 times on 1024 bytes was approx 10% slower than a single MD5.
I get Tiger times about 55% slower than MD5.
Jenkins however has his "hashlittle2" which basically gives you a 64 bit hash for the same speed as 32 bits. I ran that twice to get 128 bits and it came in about 2X the speed of MD5.
Does that make hashlittle2 run twice with different salts "better" than MD5?
Posted by Rich Skrenta | August 17, 2007 5:28 PM
Posted on August 17, 2007 17:28
So far, in this discussion, we've mentioned MD5, SHA1, SHA256, CRC32, Hsieh, Jenkins, the (underappreciated) Torek 5381 hash, and Whirlpool.
And you've provided 5 examples of settings to use hash functions in:
* UUID generation
* File integrity checking
* Password hashing
* Hash table indexing
* Random number generation
My challenge to you: pick one of these settings where you believe MD5 is the best choice, and defend the choice. Here, I'll help by tipping my hand:
* The best solution for generating UUIDs is to generate actual UUIDs, for instance by taking the SHA1 of time and PRNG output. Using content digests as a UUID is like using a last name as the primary key in a database.
* The best solution for offline file integrity monitoring is a secure hash function, such as SHA256. File integrity monitoring is checksumming with an adversary. If you don't have an adversary, just use CRC.
* The best solution for password hashing is adaptive hashing, a la Mazieres OpenBSD blowfish hash. You can tweak a setting to make a dictionary search of Mazieres hash take years.
* The best solution for hash table indexing is one of Jenkins, Hsieh, or Torek, any of which are faster than MD5. The point of a hash table is speed.
* The best solution for random generation again depends on whether you have an adversary. If you do, you use the PRNG your crypto service provider offers, and nothing else. If you don't, you use your OS's best RNG; for instance, /dev/urandom.
Is MD5 EVER the best choice?
Posted by Thomas H. Ptacek | August 17, 2007 7:38 PM
Posted on August 17, 2007 19:38
... I mean, you called it the GOD of HASH.
Posted by Thomas H. Ptacek | August 17, 2007 7:39 PM
Posted on August 17, 2007 19:39
It seems I had been looking at the Tiger vs. MD5 numbers from a dual core box instead of a single core. Wei Dai has some nice comparisons (http://www.cryptopp.com/benchmarks.html and another link on the page gives P4 numbers for single-core performance.) Depending on the application and runtime environment two hashlittle2 runs seems better than MD5 and on a mutli-core box I suspect multiple Hsieh hashes would end up beating either choice. OTOH, a single MD4 might just be faster than all of them and it will definitely smoke MD5 on a single-core box.
As always, it depends on the application. The problem with MD5 is that it is not the fastest 128bit hash, it is not a secure 128bit hash, and it performs worse on multiple cores or 64bit boxes, so one has to start asking "what is it good for?"
For all of the use cases you presented previously there are better options available for each specific use case. Rabin gives better collision resistance, MD4 or multiple smaller hashes are faster, a "real" PRNG is available without much effort at all, and we all know that there are more secure hashes for applications that need it. The only thing I see MD5 has working in its favor is ubiquity - every language and every architecture has a version available.
Posted by Jim McCoy | August 17, 2007 9:17 PM
Posted on August 17, 2007 21:17
BTW, which hash implementations were you using? I highly recommend LibTomCrypt if you care about performance...
Posted by Jim McCoy | August 17, 2007 9:18 PM
Posted on August 17, 2007 21:18
> ... I mean, you called it the GOD of HASH.
I blog to help others and also to learn. As it turns out both are aided by getting folks to actually read the stuff. Please pardon the necessary devices. :)
This thread has helped me realize why I need 128 bit hashes, and not 32 bit ones or 512 bit ones. It's because I want 64 bit UIDs and the birthday paradox means I need 128 bits of good hash to have 64 bits of non-colliding material.
If you've dealt with web scale problems you know why I might want to dedup 100B urls or hold a handle for every photo on the web or whatever. If you think you're going to trie them all or make a secondary index or something well good luck with your postgres dentist admin system, I'm sure it's fun to work on.
I would never actually use md5 for a small in-memory hash table, sorry that was badly worded on my part. I should have been more clear about my context. More along the lines of big DHT stuff, not replacements for the hash or dict language support which I assume everyone is already using and doesn't need replacement.
This is the definition of a generalist: okay for a lot of uses, but probably not optimized for any particular case.
We often call the default 'sort' alg in our local langs despite the qsort implmentation not always being the optimal solution for ordering those five items in @foo. But it's there, the items are going to come out sorted, and exec differences are negligible.
aha, I get it now...
All the speed-optimized hashes we have are small ... they're 32 bits. 32 bits is useless, it's not enough for a checksum anymore, the odds of error slipping though are too high given modern data sizes and rates. So we have 128+ bit hashes that were designed to crypto requirements -- and discarded when they fell -- and little 32 bit hashes that have been optimized to heck for speed, but are too small. It's time for fast 128-bit general hashes.
You know Jim I had looked at MD4 and wondered given my requirements if the 70% speed advantage over MD5 might make it the ideal hash. I gotta read up more on MD4.P.S. Your dentist system is gonna be slow!
Posted by Rich Skrenta | August 18, 2007 10:47 AM
Posted on August 18, 2007 10:47
Bob says to me in email:
I actually was using lookup3.c for the Jenkins hash, not his original one. Maybe that is why it was doing so well against Hsieh. Given all of this I would say Jenkins lookup3.c would be my choice over Hsieh since it's as fast, Jenkins claims no funnels, and has a 64 bit version that's just as fast.
Posted by Rich Skrenta | August 18, 2007 10:53 AM
Posted on August 18, 2007 10:53