Thinking Through Cryptographic Hash Collisions
Cryptographic Hash functions are fascinating. They take variable-length inputs (such as bytes of data) and return a fixed-length digest representing a unique input data signature. They are a type of one-way function that makes it easy to compute the hash digest but unfeasible to find the original input from its output. These are useful for many things, such as getting a “fingerprint” of data (you can use a hash to ensure that the contents of a file haven’t changed, for instance).
For instance, I can publish the hash digest of a file I’m sharing, and you can verify that the file you received also has the same hash, ensuring the data was not altered.
However, because hashes are a fixed length, only a finite number of hashes can be generated, while the number of possible inputs is practically infinite. Because hashes are a subset of all possible inputs, an endless number of potential “hash collisions” could occur.
Therefore, one of the requirements of a sound cryptographic hash function is that it must be unfeasible to generate a forged digest. If an attacker knew how to forge a signature, they could take my file, alter it in some nefarious way, along with their forged digest attack, and trick me into thinking the original file was unaltered.
A meaningful way to address this concern is through a colossal hash space.
What is a hash space?
A hash space is the total number of possible hashes for a given number of bits. For cryptographic hash functions, this is represented in bits of 2 to the power of n-bits. Each bit is a doubling of the previous value. For instance, a byte (8 bits) has a hash space of 256 values. That is, the 8 bits can represent any value from 00000000
to 11111111
.
As we add bits, our hash space grows exponentially.
When we reach 192 bits, the hash space has grown to 6.28e+57
(6277101735386680763835789423207666416102355444464034512896
) possible values. To put this into perspective, it is estimated that there are around 1.00e+20
(100000000000000000000
) grains of sand on Earth. Put another way, with a 192-bit hash, you could have 6.28e+37
(62771017353866807638357894232076664161
) hashes for every grain of sand on Earth.
Birthday Attacks
One of the things we care about in hashing tools is collision resistance. One type of collision is the Birthday Attack, named after the birthday problem in probability theory that demonstrates the counterintuitive fact that it takes only 23 randomly chosen people to have a 50% probability that two people have the same birthday even though there are 365 days a year. In terms of a hash space, with a hash space of 365 possible values, it only takes 23 randomly generated hashes to have a 50% chance of finding a collision.
To generalize this, we can calculate the number of attacks required to reach a percent chance (p) in hash space (H) by using this algorithm based on work in Introduction to Modern Cryptography (2005):
This is trivial to port into python
import math
def birthday_attack(bits, probability):
return math.sqrt(2 * (2**bits) * math.log(1 / (1 - probability)))
How many hashes would it take to get a 50% chance of a collision with an 8-bit hash?
18
>>> birthday_attack(8, 0.5)
18.83
What about a 128-bit hash (hash space: 3.40e+38
)?
2.17e+19
Interesting. That’s less than the number of grains of sand on earth.
What about a modern hashing algorithm like blake2, blake3, or sha3? For 256 bits, we have a hash space of 1.16e+77 (wow!) and a 50% probability at 4.01e+38 (this is greater than the entire hash space for 128-bit hashes!) The same holds true for 512-bit hashes. They have a hash space of 1.34e+154 with a 50% probability of 1.36e+77, larger than the entire hash space of 256-bit hashes.
SHA-1 and UUIDv5
SHA-1 is a 160-bit (20-byte) hashing function developed by the NSA in 1995. Though it hasn’t been considered secure against well-funded adversaries since 2005, it is still widely used in everything from git to UUIDv5.
In the case of UUIDv5, the implementation of SHA-1 is further constrained by the limitations of the UUID specification. UUID can represent a maximum of 128 bits (16 bytes), but the UUID specification uses an additional 6 bits for handling version and variant identifiers; this leaves UUIDv5 constrained to 122 bits for its hash space. This removes 38 bits (38 doublings of security) for use by UUIDv5. This would be an exciting area for research in finding real-world SHA-1 exploits for a reduced bit length of 122 bits, especially for UUIDs that might be generated from hashing user data.
Things start to get interesting regarding those 38 bits and the Birthday paradox.
A 122-bit hash with a 99.99999999999999% collision probability requires 1.98e+19
hashes to ensure a collision.
A 160-bit hash with 0.00000001% collision probability requires 1.71e+19.
In short, we are taking a 1 in 100 million event from a 160-bit hash space and turning it into an overwhelmingly likely event if we can reach the required threshold. While 1.98e19 hashes are high hash counts, modern hardware for calculating hashes in cryptocurrencies like ASIC and GPU makes it possible to hit this number of hashes. If you can process something like 1 Gh/s with a GPU, you have a 99.99999999999999% chance of finding a collision in less than two years. If you can access an ASIC that can process 72 Th/s, you could find a collision in 3 days.
Final Thought
It would make a great plot line for a Neal Stephenson novel: how interesting would it be if the entire cryptocurrency thing was one big social experiment to use market forces to accelerate hash collision technology for a certain three-letter agency?
Notable Links this Week
Fantastic and devastating CVE deep dive from LiveOverflow.
A part two on Fractal Zoom from Henry Segerman.
I grew up listening to Rage Against the Machine. This guy’s reaction video to hearing “killing in the name of” for the first time is priceless.