Re: Reproducible coliisions in jsonb_hash

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Reproducible coliisions in jsonb_hash
Дата
Msg-id CA+Tgmob_=6+KfesBKkdeORZDnwXdY_wrYroSrLEKXOV3rruH0A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Reproducible coliisions in jsonb_hash  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Reproducible coliisions in jsonb_hash  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Thu, May 12, 2022 at 9:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 2022-05-12 Th 07:02, Valeriy Meleshkin wrote:
> >> AFAICT it happens because when iterating over a jsonb the hash function makes no distinction between raw scalars
andarrays (it doesn't inspect v.val.array.rawScalar)
 
>
> > It does look rather like a bug, but I'm unclear about the implications
> > of fixing it.
>
> Changing this hash algorithm would break existing hash indexes on jsonb
> columns.  Maybe there aren't any, but even if so I can't get very excited
> about changing this.  Hash algorithms always have collisions, and we have
> never made any promise that ours are cryptographically strong.

I might be missing something, but I don't know why "cryptographically
strong" is the relevant concept here. It seems like the question is
how likely it is that this would lead to queries having crappy
performance. In general we want hash functions to deliver different
values for different inputs so that we give ourselves the best
possible chance of spreading keys evenly across buckets. If for
example we hashed every possible JSON object to the same constant
value, everything we tried to do with this hash function would suck,
and the problem would be so severe that I think we'd have to fix it,
even though it would mean a compatibility break. Or even if we hashed
every JSON integer to the same value, that would be horrible, because
it's quite likely that you could have a column full of json objects
where ignoring the difference between one integer and another results
in a ton of duplicate hash values.

Here, that doesn't seem too likely. You could have a column that
contains 'tom' and ['tom'] and [['tom']] and [[['tom']]] and so forth
and they all get mapped onto the same bucket and you're sad. But
probably not. So I'd judge that while it was probably a mistake to
make the hash function work this way, it's not likely to cause serious
problems, and therefore we ought to maybe leave it alone for now, but
add a comment so that if we ever break backward-compatibility for any
other reason, we remember to fix this too.

IOW, I think I mostly agree with your conclusion, but perhaps not
entirely with the reasoning.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: gitmaster access
Следующее
От: vignesh C
Дата:
Сообщение: Re: First draft of the PG 15 release notes