Обсуждение: pgsql: Perform provider-specific initialization in new functions.
Perform provider-specific initialization in new functions. Reviewed-by: Andreas Karlsson Discussion: https://postgr.es/m/4548a168-62cd-457b-8d06-9ba7b985c477@proxel.se Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/1ba0782ce90cb4261098de59b49ae5cb2326566b Modified Files -------------- src/backend/utils/adt/Makefile | 1 + src/backend/utils/adt/meson.build | 1 + src/backend/utils/adt/pg_locale.c | 162 +++++------------------------- src/backend/utils/adt/pg_locale_builtin.c | 70 +++++++++++++ src/backend/utils/adt/pg_locale_icu.c | 97 +++++++++++++++++- src/backend/utils/adt/pg_locale_libc.c | 74 +++++++++++++- 6 files changed, 259 insertions(+), 146 deletions(-)
> On 3 Dec 2024, at 12:27, Jeff Davis <jdavis@postgresql.org> wrote: > > Perform provider-specific initialization in new functions. > > Reviewed-by: Andreas Karlsson > Discussion: https://postgr.es/m/4548a168-62cd-457b-8d06-9ba7b985c477@proxel.se Hi Jeff! I'm toying with my WAL compression patch. This test is segfaulting for me: src/test/recovery % PROVE_TESTS="t/001_stream_rep.pl" PGOPTIONS="-c wal_compression=lz4" make check in src/test/recovery/tmp_check/log/001_stream_rep_primary.log I observe: 2026-03-06 20:03:01.433 +05 [24424] 001_stream_rep.pl LOG: statement: GRANT pg_read_all_settings TO repl_role; 2026-03-06 20:03:01.441 +05 [24318] LOG: client backend (PID 24426) was terminated by signal 11: Segmentation fault: 11 2026-03-06 20:03:01.441 +05 [24318] LOG: terminating any other active server processes I do not expect it to pass, it would never pass with given arguments. But I do not expect it to segfault either. I bisected the problem, some tome ago this test would fail with something like "permission denied to change wal_compression". And that seems good. Later it became "Cannot find collation for ..." And after 1ba0782ce90cb4261098de59b49ae5cb2326566b it crashes. wal_compression is PGC_SUSET, so when a non-superuser sets it via the startup packet (PGC_BACKEND context), set_config_with_handle must call pg_parameter_aclcheck -> SearchSysCache1(PARAMETERACLNAME, ...) -> hashtext -> pg_newlocale_from_collation(DEFAULT_COLLATION_OID). I do not know if it's expected, so I decided to report this problem, just in case. I can suggest something in a line with attached, but it's kind of point-in-the-sky fix. Best regards, Andrey Borodin.
Вложения
On Fri, 2026-03-06 at 20:24 +0500, Andrey Borodin wrote:
> I'm toying with my WAL compression patch. This test is segfaulting
> for me
Thank you for the report!
> wal_compression is PGC_SUSET, so when a non-superuser sets it via the
> startup
> packet (PGC_BACKEND context), set_config_with_handle must call
> pg_parameter_aclcheck -> SearchSysCache1(PARAMETERACLNAME, ...) ->
> hashtext ->
> pg_newlocale_from_collation(DEFAULT_COLLATION_OID).
It seems like the real problem here is in catcache.c:texthashfast(),
which unconditionally passes DEFAULT_COLLATION_OID, despite the fact
that pg_parameter_acl.parname has collation "C".
namehashfast() uses C-like semantics, which is OK because all name
columns in the catalog have collation "C". But TEXT columns in the
catalog are about a mix of DEFAULT_COLLATION_OID and C_COLLATION_OID.
There are a few possible fixes:
1. Your fix addresses it, and would also add some safety against
other edge cases we haven't caught yet. The only time it would take
effect is for very early initialization, but there is nonzero risk of
inconsistency because the same value would get a different hash before
and after CheckMyDatabase().
2. We could hardcode texthashfunc() to use C_COLLATION_OID. That
wouldn't match the column collation, but it would avoid the crash, and
might technically still be fine: the default collation is always
deterministic, and all deterministic collations have the same equality
semantics as "C". Even if the proper hashtext() is used somewhere else,
then it uses "C" hashing semantics for all deterministic collations.
The problem here is that we'd like to allow the default collation to be
nondeterministic in the future (Peter has mentioned this a few times),
so relying on this assumption is fragile.
3. We could try to include collation information in the cachinfo or
somewhere and pass it down to find the right hash function. This feels
like a better fix, but there could be other areas we miss that are
using a catalog TEXT field with the default collation. Also it's more
invasive.
We could decide to do your approach for now (in master and
REL_18_STABLE), and then leave #3 for the future (master only).
Thoughts?
Regards,
Jeff Davis
Thanks for the swift reply! > On 7 Mar 2026, at 01:01, Jeff Davis <pgsql@j-davis.com> wrote: > > > 1. Your fix addresses it, and would also add some safety against > other edge cases we haven't caught yet. The only time it would take > effect is for very early initialization Maybe let's sprinkle with asserts like "I'm walsender"? Or at least "I'm not a user backend"? > , but there is nonzero risk of > inconsistency because the same value would get a different hash before > and after CheckMyDatabase(). Sounds scary, actually. I heard of several corruptions that started with bogus cache entries. > 2. We could hardcode texthashfunc() to use C_COLLATION_OID. That > wouldn't match the column collation, but it would avoid the crash, and > might technically still be fine: the default collation is always > deterministic, and all deterministic collations have the same equality > semantics as "C". Even if the proper hashtext() is used somewhere else, > then it uses "C" hashing semantics for all deterministic collations. > The problem here is that we'd like to allow the default collation to be > nondeterministic in the future (Peter has mentioned this a few times), > so relying on this assumption is fragile. > > 3. We could try to include collation information in the cachinfo or > somewhere and pass it down to find the > right hash function. IMO anything less cannot be correct in a long run. Allowing random hash function is a minefield. > This feels > like a better fix, but there could be other areas we miss that are > using a catalog TEXT field with the default collation. Also it's more > invasive. > > We could decide to do your approach for now (in master and > REL_18_STABLE), and then leave #3 for the future (master only). The approach sounds fine for me. Can we be sure hash function does not change after CheckMyDatabase()? Possible coredump by authenticated user does not sound like big issue. And I never saw any report about it in the wild. Best regards, Andrey Borodin.
On Sat, 2026-03-07 at 16:36 +0500, Andrey Borodin wrote:
> > 1. Your fix addresses it, and would also add some safety against
> > other edge cases we haven't caught yet. The only time it would take
> > effect is for very early initialization
>
> Maybe let's sprinkle with asserts like "I'm walsender"? Or at least
> "I'm not a user backend"?
That seems like excessive coupling that's hard to explain.
> > , but there is nonzero risk of
> > inconsistency because the same value would get a different hash
> > before
> > and after CheckMyDatabase().
>
> Sounds scary, actually. I heard of several corruptions that started
> with
> bogus cache entries.
Yeah, I'd prefer not take this approach.
> > 2. We could hardcode texthashfunc() to use C_COLLATION_OID. That
> > wouldn't match the column collation, but it would avoid the crash,
> > and
> > might technically still be fine: the default collation is always
> > deterministic, and all deterministic collations have the same
> > equality
> > semantics as "C". Even if the proper hashtext() is used somewhere
> > else,
> > then it uses "C" hashing semantics for all deterministic
> > collations.
> > The problem here is that we'd like to allow the default collation
> > to be
> > nondeterministic in the future (Peter has mentioned this a few
> > times),
> > so relying on this assumption is fragile.
Attached. I think this is the least-invasive patch to apply to master
now, because it doesn't change any assumptions.
The assumption "any deterministic collation will do" is still the same,
it just chooses C_COLLATION_OID rather than DEFAULT_COLLATION_OID. That
has two benefits:
1. Fixes your issue, because C_COLLATION_OID is always available.
2. Faster than a default collation based on libc or ICU.
Note that you may still have other problems trying to do interesting
things before CheckMyDatabase(), so I'm not necessarily endorsing that,
but this patch seems good regardless.
>
I don't see a reason to backport this, but if someone else does then I
could be convinced.
Thoughts?
Regards,
Jeff Davis
Вложения
On Mon, 2026-04-13 at 14:08 -0700, Jeff Davis wrote:
> That
> has two benefits:
>
> 1. Fixes your issue, because C_COLLATION_OID is always available.
> 2. Faster than a default collation based on libc or ICU.
I didn't detect any meaningful improvement here, probably because most
catalog cache lookups use columns with type NAME not TEXT.
But this still seems like a good idea on the grounds that, if we are
choosing an arbitrary deterministic collation for some internal
purpose, we should consistently choose the simplest and fastest one.
>
> I don't see a reason to backport this, but if someone else does then
> I
> could be convinced.
I plan to commit this soon.
I don't plan to backport unless someone sees a reason that it should be
backported (and if so, how far?).
Regards,
Jeff Davis
On Thu, 2026-04-16 at 11:42 -0700, Jeff Davis wrote:
> I plan to commit this soon.
>
> I don't plan to backport unless someone sees a reason that it should
> be
> backported (and if so, how far?).
Actually, this does need to be backported, a NULL pointer dereference
is easily reproducible on master and v18:
PGOPTIONS="-c zero_damaged_pages=on" \
pg_receivewal -D archive -U repl
On 17 the symptom is slightly different but the fix is the same.
I attached a new patch, and only the commit message is different, which
I plan to backport to 17.
There's another bug, though. Even with the patch applied, if you do the
same pg_receivewal command immediately after starting the server
(without any other connections), you get:
FATAL: cannot read pg_class without having selected a database
The path is similar: it's trying to do pg_parameter_aclcheck, but is
unable to open pg_parameter_acl at all because it can't read pg_class.
It seems to work if you connect another backend first, where it does
some initialization first, through I haven't worked out the details. I
think it goes back to when parameter ACLs were introduced in
a0ffa885e47, so CC Mark Dilger.
Regards,
Jeff Davis
Вложения
On Tue, 2026-04-21 at 21:48 -0700, Jeff Davis wrote: > I attached a new patch, and only the commit message is different, > which > I plan to backport to 17. Committed. > There's another bug, though. Even with the patch applied, if you do > the > same pg_receivewal command immediately after starting the server > (without any other connections), you get: > > FATAL: cannot read pg_class without having selected a database This is a separate issue so I moved the discussion here: https://www.postgresql.org/message-id/d8f8e11f06d692fff89e6be0f22732d30cf695a0.camel@j-davis.com Any other loose ends from this discussion? Regards, Jeff Davis