The probability of a hash collision

learnbyexample@programming.dev · 1 day ago

The probability of a hash collision

The2b@lemmy.vg · 1 day ago

This is valid on a single unlisted assumption: The hash function has equal distribution. If your has function ends by multipliying the has value by 4, for example, your number of possible boxes is 1/4th the otherwise expected value based on the size of the hash output

TehPers@beehaw.org · 15 hours ago

The distribution is super important here too. Hashing any value to zero (or h(x) = 0) is valid, but a terrible distribution. The challenge is getting real-world values hashed in a mostly uniform distribution to avoid collisions where possible.

Still, the contents of the article are useful even outside of hashing. It should just disclaim that the width of the output isn’t the only thing important in a hash function.

squaresinger@lemmy.world · 21 hours ago

Of course. That’s one of the basic requirements for something to be an actual hash function.

The2b@lemmy.vg · 16 hours ago

Not only is md5sum not proven to have equal dostribution, it is specifically known to not have equal distribution, only nearly equal distribution.

A hash function is any function that converts an arbitrary input size into a specific output size deterministically. No other requirements are there. A hash for a simple job could be just adding the ascii values together and give the output. Needless to say, that would not have an even distribution.

FrederikNJS@lemmy.zip · 15 hours ago

You are correct for regular hash functions, but a cryptographic hash function has stronger requirements.

MD5 was supposed be a cryptographic hash function, but it was found to be flawed all the way back in 1996, and has been discouraged ever since… Now it’s too weak to be used in a cryptographic setting, and too slow to be used in non-cryptographic settings.

This is why hashes like xxhash is considered a non-cryptographic hash function, while SHA-256 is considered a cryptographic hash function.

squaresinger@lemmy.world · 14 hours ago

This is about cryptographic hashing functions (to be fair, could have spelled that out in my prior comment, but in general when someone talks about anything security relevant in conjunction with hashing, they always mean cryptographic hashing functions).

MD5 is not a cryptographic hashing function for exactly these reasons.

Also, the example you gave in your original comment wasn’t actually about distribution but about symbol space.

By multiplying by four (and I guess you implicitly meant that the bit length of the hash stays the same thus dropping two bits of information) you are reducing the number of possible hashes by a factor of four, because now 3/4 of all potential hashes can’t happen any more. So sure, if your 64bit hash is actually only a 62bit hash that just includes two constant 0 bits, then of course you have to calculate the collision chance for 62bits and not 64bits.

But if all hashes are still possible, and only the distribution isn’t perfectly even (like is the case with MD5), then the average chance for collisions doesn’t change at all. You have some hashes where collisions are more likely, but they are perfectly balanced with hashes where collisions are less likely.

tyler@programming.dev · 9 hours ago

The article calls out hash functions and links to the relevant Wikipedia page, so I don’t think this is solely about cryptographic hash functions, though that seems to be what you were talking to the other user about.

squaresinger@lemmy.world · 6 hours ago

I see what you are saying. But if you aren’t using a cryptographic hash function then collisions don’t matter in your use case anyway, otherwise you’d be using a cryptographic hash function.

For example, you’d use a non-cryptographic hash function for a hashmap. While collisions aren’t exactly desireable in that use case, they also aren’t bad and in fact, the whole process is designed with them in mind. And it doesn’t matter at all that the distribution might not be perfect.

So when we are talking about a context where collisions matter, there’s no question whether you should use a cryptographic hash or not.

Colloidal@programming.dev · 23 hours ago

The assumption is there though.

Wouldn’t multiplying the hash simply relabel the hash sites, as hashes non divisible by the factor simply be not accessible/not exist?

The2b@lemmy.vg · 16 hours ago

The hashes not being there isn’t particularly relevant within a hash function outputting a specific size. If your hash function is always 64 bits for example, the fact that you have 3/4th of them not exist means you should be operating as if its a 16 bit hash, not a 64 bit hash. If you still do this math based on the 64 bits outputted (2^64 boxes) you’d arrive at very inaccurate numbers.

anton@lemmy.blahaj.zone · 6 hours ago

That would be 62 bit, not 16 bit.

Justin@lemmy.jlh.name · 1 day ago

deleted by creator