For some time I had been looking for a mutual exclusion algorithm that satisfied my complete list of desirable properties. I finally found one--the N!-bit algorithm described in this paper. The algorithm is wildly impractical, requiring N! bits of storage for N processors, but practicality was not one of my requirements. So, I decided to publish a compendium of everything I knew about the theory of mutual exclusion.
The 3-bit algorithm described in this paper came about because of a visit by Michael Rabin. He is an advocate of probabilistic algorithms, and he claimed that a probabilistic solution to the mutual exclusion problem would be better than a deterministic one. I believe that it was during his brief visit that we came up with a probabilistic algorithm requiring just three bits of storage per processor. Probabilistic algorithms don't appeal to me. (This is a question of aesthetics, not practicality.) So later, I figured out how to remove the probability and turn it into a deterministic algorithm.
3N vs. N! Some folks just aren't comfortable with probablistic algorithms. Lamport here clearly knows what he is doing, but still has aesthetic problems with them.
In some people's minds, algorithms should be proveably correct at all times and for all inputs (as with defect-free programming and formal methods). Probabilistic algorithms give up this property. There is always a chance that the algorithm will produce a false result. But this chance can be made as small as desired. If the chance of the software failing is made smaller than the chance of the hardware failing (or of the user spontaneously combusting, or whatever), there's little to worry about.
-- Bruce Schneier in Dr. Dobb's Journal
The common practical case I run into with coders is that they're unfamiliar with figuring how how big a hash they need to "not worry about" collisions. Here's the rule of thumb.
MD5 Quickie Tutorial
Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.
So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.
2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:
18,446,744,073,709,551,616 2^64 100,000,000,000 <2^37
(Another way of putting this is that the expected number of collisions from hasing a set of size 2^k bit strings hashed to m bit strings will be 22k-m collisions. )
Other MD5 tips & tricks
- Unique ID generation
Say you want to create a set of fixed-sized IDs based on chunks of text -- urls, for example. Urls can be long, with 100+ bytes common. They're varying sizes too. But md5(url) is 16 bytes, consistently, and you're unlikely to ever have a collision, so it's safe to use the md5 as an ID for the URL.
Don't trust your disk or your OS to properly detect errors for you. The CRC and protocol checksums they use are weak and bad data can get delivered.
Instead, bring out an industrial strength checksum and protect your own data. MD5 your data before you stuff it onto the disk, check the MD5 when you read it.
save_to_disk(data,md5(data)) ... (data,md5) = read_from_disk() if (md5(data) != md5) read_error
This kind of paranoia is healthy for code -- your module doesn't have to trust the teetering stack of plates if it's doing it's own end-to-end consistency check.
- Password security
Suppose you're writing a web app and you're going to have users login. They sign up with an account name and a password. How do you store the password?
You could store the password in your database, "in the clear". But this should be avoided. If your site is hacked, someone could get a giant list of usernames and passwords.
So instead, store md5(password) in the database. When a user tries to login, take the password they entered, md5 it, and then check it against what is in the database. The process can then forget the cleartext password they entered. If the site is hacked, no one can recover the list of passwords. Even employees are protected from casually seeing other people's passwords while debugging.
If you don't store the password, how can you email it to someone if they forget it? Instead of emailing the user their forgotten password, instead invent a new, random password, store the md5 of it in the database, and email the new random password to the user.
If a site can email you your original password, it's storing it in the clear in its database. Tisk, tisk.
- Hash table addressing
There are whole chapters of textbooks devoted to the pitfalls and difficulties of writing hash addressing algorithms. Because most of these algorithms are weak, they require you to rejigger your hash table size to be relatively prime to your original hash table size when you expand it.
Forget that nonsense. MD5 isn't a weak hash function and you don't need to worry about that stuff. MD5 your key and have your table size be a power of 2. As an engineer, your table sizes should be powers of 2 anyway. Leave the primes to the academics.
- Random number generation
The typical library RNG available isn't generally very good. For the same reason that you want your hashes to be randomly distributed, you want your random numbers to actually be random, and not to have some underlying mathematical structure showing through.
Having random numbers that can't be guessed or predicted can be surprisingly useful. MD5 based sequence numbers were a solution for the TCP sequence number guessing attacks.
I also recall some players of an old online game who broke the game's RNG, and could predict the outcome of upcoming battles. The library RNG was known, the entire seed state was 32 bits, which was easy to plow throuh to find the seed the game was using. Solution: a stronger RNG, with more internal state, that can't be predicted.
Here is an md5-based RNG that I wrote some time ago.
- What if you need more than 16 bytes?
You can use SHA1 or SHA256, which generate 160 and 256 bits of output, respectively. Or you can chain hashes together to get an arbitrary amount of output material:
a = md5(s . '0') b = md5(s . '1')
Because md5 is cryptographically secure, this is safe. You can make as many unique 16 byte hashes from an input string as you want.
md5('Rich Skrenta') = 15ddc636 023977a2 22c3423d a5e8fbee md5('Rich Skrenta0') = 4343e346 b4036f80 7015847d cf983010 md5('Rich Skrenta1') = da79412d c52c47b4 fa7848e4 54f89614
- I heard MD5 was broken and you should use SHA
For cryptographic purposes, MD5 and SHA have both been broken such that a sophisticated attacker can create multiple documents that intentionally hash to the same value.
But for practical uses like hash tables, decent RNGs, and unique ID generation, these algorithms maintain their full utility. The alternatives considered are often non-secure CRCs or hashes anyway, so a cryptographic hash weakness is not a concern.
If you're concerned about some nefarious actor leaving data around designed to deliberately cause hash collisions in your algorithm, throw a secret salt onto the
endbeginning of the material that you're hashing: hash = md5(s . 'xyzzy')[good point] hash = md5('xyzzy' . s)
- Isn't MD5 overkill?
Folks sometimes say MD5 is "overkill" for a lot of these applications. But it's good, cheap, strong, and it works. It's not going to cause you problems if you use it. You're not going to ever have to debug it or second guess it. If you have perf problems, and suspect MD5, and then go profile your code, it's not going to be MD5 that's causing your problems. You're going to find that it was something else.
- How fast is MD5?
About as fast as your disk or network transfer rate.
Algorithm Size MB/s MD4 128 165.0 MD5 128 98.8 SHA-1 160 58.9
These are 2004 numbers from the perl Digest implementation.
Be happy and love the MD5.