Hashing is a common occurrence in digital applications, whether it be for keeping passwords safe or for digital signatures, but what type of algorithm should you use and why?
What is a Hash Anyway?
Hashing is a one-way function meaning that anything that comes out of a hash function can't be used to work out what went in. This property derives from there being an infinite number of inputs that map to any given output.
A hash function is, in essence, like making a cake: once you've mixed the ingredients, you're not going to get the eggs back out again.
Why should we do it, and when?
Hashing can be used, in general, whenever a short represent of data is required. A common application of this type of hash is in a dictionary data structure where keys are hashed and indexed into an array based on that hash. Using the full key would run into issues when indexing on strings.
Another use of hashing is to verify the integrity of data. This use is found in applications ranging from the OpenZFS filesystem to web traffic. While I won't discuss the ins and outs of guaranteeing integrity with hashes here, I will cover how they work in general terms.
Integrity verification is done by hashing data and storing the hash nearby. When data is accessed, such as by opening a file in the ZFS example, it can be hashed again and checked against the previous hash to ensure it isn't corrupt. This same technique is used to verify that HTTPS traffic arrives without being tampered.
The final use case I'll cover is maintaining confidentiality through hashing. Because hashing data makes it unrecoverable, it can be used to transmit particular data without risking leaking it.
To further illustrate this point, as I feel the wording seems counterintuitive, transferring a secret key risks the compromise of said key. While hashing alone won't disarm the dangers of replay attacks, it can be used to prove you know the key.
For later reference, you could model a transmission with integrity checking as:
m|h = m + H(m)
m is the message,
h is the hash, and
H() is a hash function.
If I were to send a message and append to it the hash, you could verify the data arrived intact, but you can't verify it came from me.
To overcome this limitation, I can append a secret to the message before hashing it, provided we both know the secret. Again, I send you a message, but this time, I append the key before hashing giving me:
m|h = m + H(m + S)
If you repeat the process by appending the secret and hashing, and if those hashes match, then the data is from someone who knows the secret.
In real-world usage, it is arguably more common to see the hash encrypted with a private key to form a digital signature. With private key encryption, you can decrypt with the matching public key and verify the hash.
A similar use case is password hashing. Just like for transmitting a secret, password hashing is used to leak as little information as possible in the event of a database breach.
I saved this one until last because it needs to be handled slightly differently. Let me elaborate: in every previous example, a high-speed hash has been desirable, but that is not the case here.
For a password hash, it's all about trading attacker time for user patience. Many hash algorithms are built for speed, allowing for millions of hashes per second. While that is good for most applications, it allows an attack to process vasts swathes of the search space.
Although a good hash should still be resilient against this attack, especially when the password is salted to avoid rainbow attacks, it's an unneccessary risk.
A slow hash function changes the cost to the attacker. Even increasing the time per hash to 250 milliseconds limits an attacker to four hashes per second. Even when an attack is scaled up, it's still not even close to enough to exhausting the search space.
Furthermore, password hashing benefits greatly from adding a random variation to the input. I alluded to this point earlier when I talked about salting passwords. Unsalted passwords are much weaker to brute forcing because the attacker doesn't need to break your hash individually. Since all the passwords in the database were stored with the same function, guessing the input is more likely to find a hash that matches.
For example, once an attacker tries
MyAwesomeP@ssword123 and finds no matches, they can be sure that that password is not in the database.
However, when a salt is in play, they can only be use that that password and salt combination are not in the database.
This difference means that to exhaustively search for a given password, the attacker needs to try it paired with every listed salt.
Another benefit of salting is that, in the unlikely event the password is cracked, it restricts the damage. Because the salt makes the hash unique even when a second user uses the password, the attacker would need to begin the process from scratch to crack another password, further driving costs up for the poor soul.
A note on salting from NIST Special Publication 800-132:
The length of the randomly-generated portion of the salt shall be at least 128 bits.
We've gone over some uses of hashes, so let's look at the functions themselves.
As a side note: of the functions listed, only Scrypt and Argon2 would be suitable for password storage. Other functions similar to scrypt and Argon2 exist and would be sufficient too.
MD5 is an older hash algorithm and is broken, don't use it.
The SHA-1 family, first published in 1995, is a family of hash functions sharing a similar construction to MD5. Since 2015, it has been considered broken.
The SHA-2 family is the second iteration of the NIST Secure Hash Algorithms standard.
As of 2020, SHA-2 stands unbroken as a hash function.
Members of the SHA-2 family share a common naming pattern of SHA-XXX, where XXX is the length of the output in bits. For example, SHA-256, SHA-384, and SHA-512 are all members of the SHA-2 family.
Some care should be given to the use of some members of the family, however, as they can be weak to length extension, but that is beyond the scope of this article.
A Performance Note
On a 64-bit architecture, SHA-384 and SHA-512 perform considerably faster than other variants due to using a 64-bit internal state. This statement holds for truncated hashes as well, so in most libraries, SHA-512 with an output size of 256-bits will often outperform SHA-256.
Using a truncated hash function also offers better resistance to length extension attacks.
SHA-3, or Keccak, is a fairly new member of the Secure Hash Algorithms standard that offers considerably better length extension resistance than its predecessor, at the cost of lower performance.
The SHA-3 family consists of functions named SHA3-XXX, SHAKE128, and SHAKE256.
Scrypt is a key derivation function (KDF), which makes it good for password hashing and generating encryption keys from passwords.
Argon2 is also a key derivation function built to resist GPU cracking and side-channel attacks better.