« Let's Paxos for lunch... | Main | Crypto vs. the working coder »

We Worship MD5, the GOD of HASH

For some time I had been looking for a mutual exclusion algorithm that satisfied my complete list of desirable properties. I finally found one--the N!-bit algorithm described in this paper. The algorithm is wildly impractical, requiring N! bits of storage for N processors, but practicality was not one of my requirements. So, I decided to publish a compendium of everything I knew about the theory of mutual exclusion.

The 3-bit algorithm described in this paper came about because of a visit by Michael Rabin. He is an advocate of probabilistic algorithms, and he claimed that a probabilistic solution to the mutual exclusion problem would be better than a deterministic one. I believe that it was during his brief visit that we came up with a probabilistic algorithm requiring just three bits of storage per processor. Probabilistic algorithms don't appeal to me. (This is a question of aesthetics, not practicality.) So later, I figured out how to remove the probability and turn it into a deterministic algorithm.
    -- Lamport

3N vs. N! Some folks just aren't comfortable with probablistic algorithms. Lamport here clearly knows what he is doing, but still has aesthetic problems with them.

In some people's minds, algorithms should be proveably correct at all times and for all inputs (as with defect-free programming and formal methods). Probabilistic algorithms give up this property. There is always a chance that the algorithm will produce a false result. But this chance can be made as small as desired. If the chance of the software failing is made smaller than the chance of the hardware failing (or of the user spontaneously combusting, or whatever), there's little to worry about.
    -- Bruce Schneier in Dr. Dobb's Journal

The common practical case I run into with coders is that they're unfamiliar with figuring how how big a hash they need to "not worry about" collisions. Here's the rule of thumb.

MD5 Quickie Tutorial

Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.

So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.

2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:

    18,446,744,073,709,551,616   2^64
               100,000,000,000  <2^37

(Another way of putting this is that the expected number of collisions from hasing a set of size 2^k bit strings hashed to m bit strings will be 22k-m collisions. [1])

Other MD5 tips & tricks

  • Unique ID generation

    Say you want to create a set of fixed-sized IDs based on chunks of text -- urls, for example. Urls can be long, with 100+ bytes common. They're varying sizes too. But md5(url) is 16 bytes, consistently, and you're unlikely to ever have a collision, so it's safe to use the md5 as an ID for the URL.

  • Checksums

    Don't trust your disk or your OS to properly detect errors for you. The CRC and protocol checksums they use are weak and bad data can get delivered.

    Instead, bring out an industrial strength checksum and protect your own data. MD5 your data before you stuff it onto the disk, check the MD5 when you read it.

        save_to_disk(data,md5(data))
        ...
        (data,md5) = read_from_disk()
        if (md5(data) != md5)
            read_error
    

    This kind of paranoia is healthy for code -- your module doesn't have to trust the teetering stack of plates if it's doing it's own end-to-end consistency check.

  • Password security

    Suppose you're writing a web app and you're going to have users login. They sign up with an account name and a password. How do you store the password?

    You could store the password in your database, "in the clear". But this should be avoided. If your site is hacked, someone could get a giant list of usernames and passwords.

    So instead, store md5(password) in the database. When a user tries to login, take the password they entered, md5 it, and then check it against what is in the database. The process can then forget the cleartext password they entered. If the site is hacked, no one can recover the list of passwords. Even employees are protected from casually seeing other people's passwords while debugging.

    If you don't store the password, how can you email it to someone if they forget it? Instead of emailing the user their forgotten password, instead invent a new, random password, store the md5 of it in the database, and email the new random password to the user.

    If a site can email you your original password, it's storing it in the clear in its database. Tisk, tisk.

  • Hash table addressing

    There are whole chapters of textbooks devoted to the pitfalls and difficulties of writing hash addressing algorithms. Because most of these algorithms are weak, they require you to rejigger your hash table size to be relatively prime to your original hash table size when you expand it.

    Forget that nonsense. MD5 isn't a weak hash function and you don't need to worry about that stuff. MD5 your key and have your table size be a power of 2. As an engineer, your table sizes should be powers of 2 anyway. Leave the primes to the academics.

  • Random number generation

    The typical library RNG available isn't generally very good. For the same reason that you want your hashes to be randomly distributed, you want your random numbers to actually be random, and not to have some underlying mathematical structure showing through.

    Having random numbers that can't be guessed or predicted can be surprisingly useful. MD5 based sequence numbers were a solution for the TCP sequence number guessing attacks.

    I also recall some players of an old online game who broke the game's RNG, and could predict the outcome of upcoming battles. The library RNG was known, the entire seed state was 32 bits, which was easy to plow throuh to find the seed the game was using. Solution: a stronger RNG, with more internal state, that can't be predicted.

    Here is an md5-based RNG that I wrote some time ago.

  • What if you need more than 16 bytes?

    You can use SHA1 or SHA256, which generate 160 and 256 bits of output, respectively. Or you can chain hashes together to get an arbitrary amount of output material:

        a = md5(s . '0')
        b = md5(s . '1')
    

    Because md5 is cryptographically secure, this is safe. You can make as many unique 16 byte hashes from an input string as you want.

        md5('Rich Skrenta')  = 15ddc636 023977a2 22c3423d a5e8fbee
        md5('Rich Skrenta0') = 4343e346 b4036f80 7015847d cf983010
        md5('Rich Skrenta1') = da79412d c52c47b4 fa7848e4 54f89614
    

  • I heard MD5 was broken and you should use SHA

    For cryptographic purposes, MD5 and SHA have both been broken such that a sophisticated attacker can create multiple documents that intentionally hash to the same value.

    But for practical uses like hash tables, decent RNGs, and unique ID generation, these algorithms maintain their full utility. The alternatives considered are often non-secure CRCs or hashes anyway, so a cryptographic hash weakness is not a concern.

    If you're concerned about some nefarious actor leaving data around designed to deliberately cause hash collisions in your algorithm, throw a secret salt onto the end beginning of the material that you're hashing:

            hash = md5(s . 'xyzzy')  [good point]
            hash = md5('xyzzy' . s)
    

  • Isn't MD5 overkill?

    Folks sometimes say MD5 is "overkill" for a lot of these applications. But it's good, cheap, strong, and it works. It's not going to cause you problems if you use it. You're not going to ever have to debug it or second guess it. If you have perf problems, and suspect MD5, and then go profile your code, it's not going to be MD5 that's causing your problems. You're going to find that it was something else.

    But if you feel you absolutely must leave the path and look for some faster hashes, check out Bob Jenkins' site. [Also see the Hsieh hash, it looks very good.]

  • How fast is MD5?

    About as fast as your disk or network transfer rate.

    AlgorithmSizeMB/s
    MD4128165.0
    MD512898.8
    SHA-116058.9

    These are 2004 numbers from the perl Digest implementation.

Be happy and love the MD5.

TrackBack

Listed below are links to weblogs that reference We Worship MD5, the GOD of HASH:

» Crypto vs. the working coder from Skrentablog
Working in security tends to make people jumpy and nervous. Most security coders don't understand any of the the crypto internals of the tools they use, so they must rely on a handful of trusted experts like Schneier to tell... [Read More]

» URL Shortening: Hashes In Practice from Coding Horror
I've become a big fan of Twitter. My philosophy is, when in doubt, make it public, and Twitter is essentially public instant messaging. This suits me fine. Well, when Twitter is actually up and running, at least. Its bouts... [Read More]

» You're Probably Storing Passwords Incorrectly from Coding Horror
In Rainbow Hash Cracking, I described an emerging class of password attacks known as Rainbow Tables. But I wasn't very clear exactly why the attack is so dangerous. The web is nothing if not a maze of user accounts... [Read More]

Comments (15)

Codeless:

SHA1 is not broken yet. No known SHA1 collisions have ever been created (only for implementations with fewer than 80 rounds). Have you noticed that pretty much all new software projects for the last couple of years have chosen SHA1 in preference to MD5? There's a reason for that.

Yeah of course. SHA has been the default choice over MD5 for crypto for the past 10 years. But "SHA is broken" (Schneier). Guess it depends on what you mean by "broken". If you're designing crypto, SHA isn't good enough anymore. If you're not, MD5 or even MD4 are good enough.

I's a good idea to salt password hashes too. Instead of storing 'md5(password)' store 'salt + delimiter + md5(salt + password)' where salt is a random value for each user. This avoids someone running a dictionary attack against your plain-md5 password list and compromising poorly-chosen passwords.

How cheap is it? What kind of machine did
those benchmarks from?

Saying and MD5 is cheap is probably true in
a general sense, and in programmer time.
Stating that it should be used as a general
hashing function for small data structures,
might be a little dangerous.

I highly recommend Bob Jenkins site.

Cheers..

mind:

point taken about md5 being 'good enough' for your purposes. yes, sha1 is cryptographically broken, move to sha512/sha256 ("sha2").

however i will add.. your handwaving example of using

hash = md5(s . 'xyzzy')

to overcome a malicious party who wants to cause collisions is not going to work. appending the same thing to two messages that hash the same will yield two new messages that hash the same. in fact, this is what makes it so easy to create arbitrary messages which hash to the same thing (it does not require a "sophisticated attacker" at all)

Good point, corrected.

The times came from the perl man page for Digest.
"These numbers was achieved Apr 2004 with ActivePerl-5.8.3 running under Linux on a P4 2.8 GHz CPU."

Poke:

You're going to index hash tables with MD5? Who knew you could address 2^128 bytes of RAM.

For SMALL hashtables using MD5 isn't really a win. For LARGE hashtables (hundreds of megs) they can be a big win because you use direct addressing and size your hashtable more efficiently.

Also, Hashes as primary keys in databases isn't a good idea for clustered DBs like INNODB or Oracle.

There are are some tricks here though (we use them).

They're a great primitive though since they're deterministic and you can perform content routing based on URL.

Kevin

If you need arbitrary length data, you should use the HMAC variant of MD5 or SHA rather than merely concatenating.

http://en.wikipedia.org/wiki/HMAC

Joshua

1) MD5 is expensive if you do enough of it. It becomes noticeable. When you start doing 2,000,000,000+ MD5's a day you feel it.

2) You can optimize MD5 lookups by using rainbow table s with reduction functions.

None-the-less, MD5 is good. And you can get hardware acceleration for lots of hashing functions these days.

Rich, is it just me or are you totally geeking out right now? It's like an object of great mass and density has been given sufficient angular momentum to escape the gravitational pull of your shoulders :)

Amit:

I salt my passwords too. I don't think it promotes arbitrary guesses at passwords. The point is to avert dictionary attacks. For someone to crack the password they'd have to guess the specific formatting you're using to obfuscate the straight passwords that users pick. The only way hashes will be the same is if the exact format of the salt+password is guessed. But, I agree SHA512 is the way to go as far as one-way crypto algorithms are concerned.

Also, salting helps in situations where users pick really silly passwords like "password". If you really want to insist on strict passwords, not only will you have to educate users, but you'll have to require alphanumeric passwords (including special characters) be used.

Another thing I do is if a user forgets their password, I send a random alphanumeric password which allows them to login to the system, but upon entry, I redirect them to reset their password before being allowed to use the application. This helps in situations where people are sent a temporary password and eventually forget the temporary password because it has no particular meaning to them.

Jim McCoy:

It is also worth noting that MD5 slows down significantly on 64-bit hardware. Given the fact that, as others have noted, it is overkill for most of the tasks listed and inadequate for many of the others I think you screwed up on this one Rich. MD5 is "good enough" only if you are lazy or don't really know what you are doing (in terms of not really knowing your threat model or performance requirements, not in terms of being stupid. :)

For just about every task listed there is an alternative which is better, so md5 only wins at being a generalist and not as being particularly good for any specific purpose. There is a reason that cryptographers do not use md5 and there is a reason that people who care about performance and resource usage avoid cryptographic hashes if they do not need the crypto-secure properties...

carbon.earth:

I'm not convinced that either of these (below) is more secure than the other.

> hash = md5(s . 'xyzzy') [good point]
> hash = md5('xyzzy' . s)

I thought MD5's were produced based on the whole input and any collisions occur due to rounding.

If you find an example of two values that return the same MD5 hash I'm pretty certain the resulting hashes will differ once a salt is either prepended or appended to the original value.

perhaps the commenter before was mistaking your example of:
md5(s . 'xyzzy')
with this:
md5(s) . 'xyzzy'

MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. But if you make even a tiny change to the input string, you'll get a completely different output hash.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on August 16, 2007 11:15 AM.

The previous post in this blog was Let's Paxos for lunch....

The next post in this blog is Crypto vs. the working coder.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33