Re: backup manifests
От | Andres Freund |
---|---|
Тема | Re: backup manifests |
Дата | |
Msg-id | 20200327195624.xthhd4xuwabvd3ou@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: backup manifests (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
Hi, On 2020-03-27 14:13:17 -0400, Robert Haas wrote: > On Thu, Mar 26, 2020 at 4:44 PM Stephen Frost <sfrost@snowman.net> wrote: > > > I mean, the property that I care about is the one where it detects > > > better than 999,999,999 errors out of every 1,000,000,000, regardless > > > of input length. > > > > Throwing these kinds of things around I really don't think is useful. > > ...but I don't understand his reasoning, or yours. > > My reasoning for thinking that the number is accurate is that a 32-bit > checksum has 2^32 possible results. If all of those results are > equally probable, then the probability that two files with unequal > contents produce the same result is 2^-32. This does assume that the > hash function is perfect, which no hash function is, so the actual > probability of a collision is likely higher. But if the hash function > is pretty good, it shouldn't be all that much higher. Note that I am > making no assumptions here about how many bits are different, nor am I > making any assumption about the length of a file. I am simply saying > that an n-bit checksum should detect a difference between two files > with a probability of roughly 1-2^{-n}, modulo the imperfections of > the hash function. I thought that this was a well-accepted fact that > would produce little argument from anybody, and I'm confused that > people seem to feel otherwise. Well: crc32 is a terrible hash, if you're looking for even distribution of hashed values. That's not too surprising - its design goals included guaranteed error detection for certain lengths, and error correction of single bit errors. My understanding of the underlying math is spotty at best, but from what I understand that does pretty directly imply less independence between source data -> hash value than what we'd want from a good hash function. Here's an smhasher result page for crc32 (at least the latter is crc32 afaict): https://notabug.org/vaeringjar/smhasher/src/master/doc/crc32 https://notabug.org/vaeringjar/smhasher/src/master/doc/crc32_hw and then compare that with something like xxhash, or even lookup3 (which I think is what our hash is a variant of): https://notabug.org/vaeringjar/smhasher/src/master/doc/xxHash32 https://notabug.org/vaeringjar/smhasher/src/master/doc/lookup3 The birthday paradoxon doesn't apply (otherwise 32bit would never be enough, at a 50% chance of conflict at around 80k hashes), but still I do wonder if it matters that we're trying to detect errors in not one, but commonly tens of thousands to millions of files. But since we just need to detect one error to call the whole backup corrupt... > One explanation that would make sense to me is if somebody said, well, > the nature of this particular algorithm means that, although values > are uniformly distributed in general, the kinds of errors that are > likely to occur in practice are likely to cancel out. For instance, if > you imagine trivial algorithms such as adding or xor-ing all the > bytes, adding zero bytes doesn't change the answer, and neither do > transpositions. However, CRC is, AIUI, designed to be resistant to > such problems. Your remark about large blocks of zero bytes is > interesting to me in this context, but in a quick search I couldn't > find anything stating that CRC was weak for such use cases. My main point was that CRC's error detection guarantees are pretty much irrelevant for us. I.e. while the right CRC will guarantee that all single 2 bit errors will be detected, that's not a helpful property for us. There rarely are single bit errors, and the bursts are too long to to benefit from any >2 bit guarantees. Nor are multiple failures rare once you hit a problem. > The old thread about switching from 64-bit CRC to 32-bit CRC had a > link to a page which has subsequently been moved to here: > > https://www.ece.unb.ca/tervo/ee4253/crc.shtml > > Down towards the bottom, it says: > > "In general, bit errors and bursts up to N-bits long will be detected > for a P(x) of degree N. For arbitrary bit errors longer than N-bits, > the odds are one in 2^{N} than a totally false bit pattern will > nonetheless lead to a zero remainder." That's still about a single sequence of bit errors though, as far as I can tell. I.e. it doesn't hold for CRCs if you have two errors at different places. Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: