Обсуждение: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

Поиск
Список
Период
Сортировка

pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
pgstat: Bring up pgstat in BaseInit() to fix uninitialized use of pgstat by AV.

Previously pgstat_initialize() was called in InitPostgres() and
AuxiliaryProcessMain(). As it turns out there was at least one case where we
reported stats before pgstat_initialize() was called, see
AutoVacWorkerMain()'s intentionally early call to pgstat_report_autovac().

This turns out to not be a problem with the current pgstat implementation as
pgstat_initialize() only registers a shutdown callback. But in the shared
memory based stats implementation we are working towards pgstat_initialize()
has to do more work.

After b406478b87e BaseInit() is a central place where initialization shared by
normal backends and auxiliary backends can be put. Obviously BaseInit() is
called before InitPostgres() registers ShutdownPostgres. Previously
ShutdownPostgres was the first before_shmem_exit callback, now that's commonly
pgstats. That should be fine.

Previously pgstat_initialize() was not called in bootstrap mode, but there
does not appear to be a need for that. It's now done unconditionally.

To detect future issues like this, assertions are added to a few places
verifying that the pgstat subsystem is initialized and not yet shut down.

Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20210405092914.mmxqe7j56lsjfsej@alap3.anarazel.de
Discussion: https://postgr.es/m/20210802164124.ufo5buo4apl6yuvs@alap3.anarazel.de

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/ee3f8d3d3aec0d7c961d6b398d31504bb272a450

Modified Files
--------------
src/backend/postmaster/auxprocess.c |  2 --
src/backend/postmaster/pgstat.c     | 53 +++++++++++++++++++++++++++++++++++--
src/backend/utils/init/postinit.c   | 23 +++++++++-------
3 files changed, 65 insertions(+), 13 deletions(-)


Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> pgstat: Bring up pgstat in BaseInit() to fix uninitialized use of pgstat by AV.

sifaka took exception to this, or at least I guess it was this one out
of the four you pushed at once:

TRAP: FailedAssertion("pgstat_is_initialized && !pgstat_is_shutdown", File: "pgstat.c", Line: 4810, PID: 74447)
0   postgres                            0x0000000100e5a520 ExceptionalCondition + 124
1   postgres                            0x0000000100ca1dec pgstat_reset_counters + 0
2   postgres                            0x0000000100ca2548 pgstat_report_tempfile + 80
3   postgres                            0x0000000100d07ca4 FileClose + 536
4   postgres                            0x0000000100d09764 CleanupTempFiles + 160
5   postgres                            0x0000000100d0f8c0 proc_exit_prepare + 228
6   postgres                            0x0000000100d0f79c proc_exit + 24
7   postgres                            0x0000000100e5ae04 errfinish + 856
8   postgres                            0x0000000100f1a804 ProcessInterrupts.cold.9 + 88
9   postgres                            0x0000000100d36a98 ProcessInterrupts + 604
10  postgres                            0x0000000100cd93c4 sendDir + 1516
11  postgres                            0x0000000100cd83a0 perform_base_backup + 3592
12  postgres                            0x0000000100cd72c0 SendBaseBackup + 256
13  postgres                            0x0000000100ce6a48 exec_replication_command + 1852
14  postgres                            0x0000000100d39190 PostgresMain + 3260
15  postgres                            0x0000000100ca9c78 process_startup_packet_die + 0
16  postgres                            0x0000000100ca94ec ClosePostmasterPorts + 0
17  postgres                            0x0000000100ca6a0c PostmasterMain + 4584
18  postgres                            0x0000000100c0b798 help + 0
19  libdyld.dylib                       0x0000000184391430 start + 4

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-06 22:44:07 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > pgstat: Bring up pgstat in BaseInit() to fix uninitialized use of pgstat by AV.
> 
> sifaka took exception to this, or at least I guess it was this one out
> of the four you pushed at once:

Longfin as well. It's the assertions from ee3f8d3d3ae, but possibly only
exposed after fb2c5028e63.

Not sure why it's your two animals that report this issue, but not others? Why
would a backend doing SendBaseBackup() previously have allocated temp files?

Don't get me wrong - it's good that they surfaced the issue, and it's an issue
independent of the specific trigger.

Glad I added the assertions...


See also https://www.postgresql.org/message-id/20210803023612.iziacxk5syn2r4ut%40alap3.anarazel.de

> TRAP: FailedAssertion("pgstat_is_initialized && !pgstat_is_shutdown", File: "pgstat.c", Line: 4810, PID: 74447)
> 0   postgres                            0x0000000100e5a520 ExceptionalCondition + 124
> 1   postgres                            0x0000000100ca1dec pgstat_reset_counters + 0
> 2   postgres                            0x0000000100ca2548 pgstat_report_tempfile + 80
> 3   postgres                            0x0000000100d07ca4 FileClose + 536
> 4   postgres                            0x0000000100d09764 CleanupTempFiles + 160
> 5   postgres                            0x0000000100d0f8c0 proc_exit_prepare + 228
> 6   postgres                            0x0000000100d0f79c proc_exit + 24
> 7   postgres                            0x0000000100e5ae04 errfinish + 856
> 8   postgres                            0x0000000100f1a804 ProcessInterrupts.cold.9 + 88
> 9   postgres                            0x0000000100d36a98 ProcessInterrupts + 604
> 10  postgres                            0x0000000100cd93c4 sendDir + 1516
> 11  postgres                            0x0000000100cd83a0 perform_base_backup + 3592
> 12  postgres                            0x0000000100cd72c0 SendBaseBackup + 256
> 13  postgres                            0x0000000100ce6a48 exec_replication_command + 1852
> 14  postgres                            0x0000000100d39190 PostgresMain + 3260
> 15  postgres                            0x0000000100ca9c78 process_startup_packet_die + 0
> 16  postgres                            0x0000000100ca94ec ClosePostmasterPorts + 0
> 17  postgres                            0x0000000100ca6a0c PostmasterMain + 4584
> 18  postgres                            0x0000000100c0b798 help + 0
> 19  libdyld.dylib                       0x0000000184391430 start + 4

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Not sure why it's your two animals that report this issue, but not others? Why
> would a backend doing SendBaseBackup() previously have allocated temp files?

Guessing the common factor is "macOS", but that's just a guess.
I can poke into it tomorrow.

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-06 23:13:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Not sure why it's your two animals that report this issue, but not others? Why
> > would a backend doing SendBaseBackup() previously have allocated temp files?
> 
> Guessing the common factor is "macOS", but that's just a guess.

Probably not a bad one...


> I can poke into it tomorrow.

Thanks! Might be interesting to run the pg_basebackup tests with
log_temp_files=0...

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
I wrote:
> Guessing the common factor is "macOS", but that's just a guess.
> I can poke into it tomorrow.

I did try it real quick on my Mac laptop, and that fails too.
Here's a more accurate backtrace, in case that helps.

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff2033e92e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff2036d5bd libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007fff202c2406 libsystem_c.dylib`abort + 125
    frame #3: 0x0000000108cdf87f postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>,
fileName=<unavailable>,lineNumber=<unavailable>) at assert.c:69:2 [opt] 
    frame #4: 0x0000000108b0f551 postgres`pgstat_send [inlined] pgstat_assert_is_up at pgstat.c:4810:2 [opt]
    frame #5: 0x0000000108b0f532 postgres`pgstat_send(msg=<unavailable>, len=<unavailable>) at pgstat.c:3032 [opt]
    frame #6: 0x0000000108b0fbef postgres`pgstat_report_tempfile(filesize=<unavailable>) at pgstat.c:1812:2 [opt]
    frame #7: 0x0000000108b7abfe postgres`FileClose [inlined]
ReportTemporaryFileUsage(path="base/pgsql_tmp/pgsql_tmp35840.0",size=0) at fd.c:1483:2 [opt] 
    frame #8: 0x0000000108b7abf6 postgres`FileClose(file=1) at fd.c:1987 [opt]
    frame #9: 0x0000000108b7c3b8 postgres`CleanupTempFiles(isCommit=false, isProcExit=true) at fd.c:0 [opt]
    frame #10: 0x0000000108b82661 postgres`proc_exit_prepare(code=1) at ipc.c:209:3 [opt]
    frame #11: 0x0000000108b8253d postgres`proc_exit(code=1) at ipc.c:107:2 [opt]
    frame #12: 0x0000000108ce0201 postgres`errfinish(filename=<unavailable>, lineno=<unavailable>,
funcname=<unavailable>)at elog.c:666:3 [opt] 
    frame #13: 0x0000000108db362b postgres`ProcessInterrupts.cold.9 at postgres.c:3222:3 [opt]
    frame #14: 0x0000000108baa2a4 postgres`ProcessInterrupts at postgres.c:3218:22 [opt]
    frame #15: 0x0000000108b49f4a postgres`sendDir(path=".", basepathlen=1, sizeonly=false,
tablespaces=0x00007fb47f0719c8,sendtblspclinks=true, manifest=<unavailable>, spcoid=0x0000000000000000) at
basebackup.c:1277:3[opt] 
    frame #16: 0x0000000108b48dc8 postgres`perform_base_backup(opt=0x00007ffee73ae0a0) at basebackup.c:432:5 [opt]
    frame #17: 0x0000000108b47bf0 postgres`SendBaseBackup(cmd=<unavailable>) at basebackup.c:949:2 [opt]
    frame #18: 0x0000000108b58055 postgres`exec_replication_command(cmd_string="BASE_BACKUP LABEL 'pg_basebackup base
backup'PROGRESS   NOWAIT    MANIFEST 'yes' ") at walsender.c:1625:4 [opt] 
    frame #19: 0x0000000108bac8aa postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>,
username=<unavailable>)at postgres.c:4484:12 [opt] 
    frame #20: 0x0000000108b178db postgres`BackendRun(port=<unavailable>) at postmaster.c:4519:2 [opt]
    frame #21: 0x0000000108b17178 postgres`ServerLoop [inlined] BackendStartup(port=<unavailable>) at
postmaster.c:4241:3[opt] 
    frame #22: 0x0000000108b17154 postgres`ServerLoop at postmaster.c:1758 [opt]
    frame #23: 0x0000000108b1421b postgres`PostmasterMain(argc=4, argv=0x00007fb47ec06640) at postmaster.c:1430:11
[opt]
    frame #24: 0x0000000108a6ac13 postgres`main(argc=<unavailable>, argv=<unavailable>) at main.c:199:3 [opt]
    frame #25: 0x00007fff20388f3d libdyld.dylib`start + 1

I see two core dumps with what seem to be this same trace after
running pg_basebackup's tests.

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-06 23:22:36 -0400, Tom Lane wrote:
> I wrote:
> > Guessing the common factor is "macOS", but that's just a guess.
> > I can poke into it tomorrow.
>
> I did try it real quick on my Mac laptop, and that fails too.
> Here's a more accurate backtrace, in case that helps.

Thanks!

I now managed to reproduce it on a CI system (after some initial
failed attempts due to pg_upgrade tests failing due to path length issues):

https://api.cirrus-ci.com/v1/artifact/task/4863568911794176/log/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_main.log

2021-08-06 21:18:40.096 PDT [22031] 010_pg_basebackup.pl LOG:  received replication command: BASE_BACKUP LABEL
'pg_basebackupbase backup' PROGRESS   NOWAIT    MANIFEST 'yes' 
 
2021-08-06 21:18:40.096 PDT [22031] 010_pg_basebackup.pl STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup'
PROGRESS  NOWAIT    MANIFEST 'yes' 
 
2021-08-06 21:18:42.210 PDT [22031] 010_pg_basebackup.pl LOG:  could not send data to client: Broken pipe
2021-08-06 21:18:42.210 PDT [22031] 010_pg_basebackup.pl STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup'
PROGRESS  NOWAIT    MANIFEST 'yes' 
 
2021-08-06 21:18:42.210 PDT [22031] 010_pg_basebackup.pl FATAL:  connection to client lost
2021-08-06 21:18:42.210 PDT [22031] 010_pg_basebackup.pl STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup'
PROGRESS  NOWAIT    MANIFEST 'yes' 
 
TRAP: FailedAssertion("pgstat_is_initialized && !pgstat_is_shutdown", File: "pgstat.c", Line: 4810, PID: 22031)

vs what I locally get:
2021-08-06 20:52:06.163 PDT [1397252] 010_pg_basebackup.pl LOG:  received replication command: BASE_BACKUP LABEL
'pg_basebackupbase backup' PROGRESS   NOWAIT    MANIFEST 'yes' 
 
2021-08-06 20:52:06.163 PDT [1397252] 010_pg_basebackup.pl STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup'
PROGRESS  NOWAIT    MANIFEST 'yes' 
 
2021-08-06 20:52:08.189 PDT [1397252] 010_pg_basebackup.pl LOG:  could not send data to client: Broken pipe
2021-08-06 20:52:08.189 PDT [1397252] 010_pg_basebackup.pl STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup'
PROGRESS  NOWAIT    MANIFEST 'yes' 
 
2021-08-06 20:52:08.189 PDT [1397252] 010_pg_basebackup.pl ERROR:  base backup could not send data, aborting backup
2021-08-06 20:52:08.189 PDT [1397252] 010_pg_basebackup.pl STATEMENT:  BASE_BACKUP LABEL 'pg_basebackup base backup'
PROGRESS  NOWAIT    MANIFEST 'yes' 
 
2021-08-06 20:52:08.189 PDT [1397252] 010_pg_basebackup.pl FATAL:  connection to client lost

Note that OSX version doesn't have the "base backup could not send data,
aborting backup" message. I guess that is what leads to the difference in
behaviour:

The temp file is created by InitializeBackupManifest(). In the !OSX case, we
first abort via an ERROR, which triggers the cleanup via
WalSndResourceCleanup(). On OSX however, we immediately error out with FATAL
for some reason (timing? network buffering differences?), which will never
reach WalSndErrorCleanup(). Therefore the temp file only gets deleted during
proc_exit(), which triggers the issue...

Not yet really sure what the best way to deal with this is. Presumably this
issue would be fixed if AtProcExit_Files()/CleanupTempFiles() were scheduled
via before_shmem_exit(). And perhaps it's not too off to schedule
CleanupTempFiles() there - but it doesn't quite seem entirely right either.

I'd kinda like to avoid having to overhaul the process exit infrastructure as
a prerequisite to getting the shared memory stats patch in :(.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-06 21:49:52 -0700, Andres Freund wrote:
> The temp file is created by InitializeBackupManifest(). In the !OSX case, we
> first abort via an ERROR, which triggers the cleanup via
> WalSndResourceCleanup(). On OSX however, we immediately error out with FATAL
> for some reason (timing? network buffering differences?), which will never
> reach WalSndErrorCleanup(). Therefore the temp file only gets deleted during
> proc_exit(), which triggers the issue...
> 
> Not yet really sure what the best way to deal with this is. Presumably this
> issue would be fixed if AtProcExit_Files()/CleanupTempFiles() were scheduled
> via before_shmem_exit(). And perhaps it's not too off to schedule
> CleanupTempFiles() there - but it doesn't quite seem entirely right either.

Huh. I just noticed that AtProcExit_Files() is not even scheduled via
on_shmem_exit() but on_proc_exit().  That means that even before fb2c5028e63
we sent pgstat messages *well* after pgstat_shutdown_hook() already
ran. Crufty.

Just hacking in an earlier CleanupTempFiles() does "fix" the OSX issue:
https://cirrus-ci.com/task/5941265494704128?logs=macos_basebackup#L4

I'm inclined to leave things as-is until tomorrow to see if other things are
shaken loose and then either commit a bandaid along those lines or revert the
patch. Or something proper if we can figure it out till then.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2021-08-06 21:49:52 -0700, Andres Freund wrote:
>> Not yet really sure what the best way to deal with this is. Presumably this
>> issue would be fixed if AtProcExit_Files()/CleanupTempFiles() were scheduled
>> via before_shmem_exit(). And perhaps it's not too off to schedule
>> CleanupTempFiles() there - but it doesn't quite seem entirely right either.

> Huh. I just noticed that AtProcExit_Files() is not even scheduled via
> on_shmem_exit() but on_proc_exit().  That means that even before fb2c5028e63
> we sent pgstat messages *well* after pgstat_shutdown_hook() already
> ran. Crufty.

> Just hacking in an earlier CleanupTempFiles() does "fix" the OSX issue:
> https://cirrus-ci.com/task/5941265494704128?logs=macos_basebackup#L4

So if I have the lay of the land correctly:

1. Somebody decided it'd be a great idea for temp file cleanup to send
stats collector messages.

2. Temp file cleanup happens after shmem disconnection.

3. This accidentally worked, up to now, because stats transmission happens
via a separate socket not shared memory.

4. We can't keep both #1 and #2 if we'd like to switch to shmem-based
stats collection.

Intuitively it seems like temp file management should be a low-level,
backend-local function and therefore should be okay to run after
shmem disconnection.  I do not have a warm feeling about reversing that
module layering --- what's to stop someone from breaking things by
trying to use a temp file in their on_proc_exit or on_shmem_exit hook?
Maybe what needs to go overboard is point 1.

More generally, this points up the fact that we don't have a well-defined
module hierarchy that would help us understand what code can safely do which.
I'm not volunteering to design that, but maybe it needs to happen soon.

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-07 11:43:07 -0400, Tom Lane wrote:
> So if I have the lay of the land correctly:
> 
> 1. Somebody decided it'd be a great idea for temp file cleanup to send
> stats collector messages.
> 
> 2. Temp file cleanup happens after shmem disconnection.
> 
> 3. This accidentally worked, up to now, because stats transmission happens
> via a separate socket not shared memory.
> 
> 4. We can't keep both #1 and #2 if we'd like to switch to shmem-based
> stats collection.

Sounds accurate to me.


> Intuitively it seems like temp file management should be a low-level,
> backend-local function and therefore should be okay to run after
> shmem disconnection.  I do not have a warm feeling about reversing that
> module layering --- what's to stop someone from breaking things by
> trying to use a temp file in their on_proc_exit or on_shmem_exit hook?

We could just add an assert preventing that from happening. It's hard to
believe that there could be a good reason to use temp files in those hook
points.

I'm somewhat inclined to split InitFileAccess() into two by separating out
InitTemporaryFileAccess() or such. InitFileAccess() would continue to happen
early and register a proc exit hook that errors out when there's a temp file
(as a backstop for non cassert builds). The new InitTemporaryFileAccess()
would happen a bit later and schedule CleanupTempFiles() to happen via
before_shmem_access(). And we add a Assert(!proc_exit_inprogress) to the
routines for opening a temp file.


> Maybe what needs to go overboard is point 1.

Keeping stats of temp files seems useful enough that I'm a bit hesitant to go
there. I guess we could just prevent pgstats_report_tempfile() from being
called when CleanupTempFiles() is called during proc exit, but that doesn't
seem great.


> More generally, this points up the fact that we don't have a well-defined
> module hierarchy that would help us understand what code can safely do which.
> I'm not volunteering to design that, but maybe it needs to happen soon.

I agree. Part of the reason for whacking around process startup (in both
pushed and still pending commits) was that previously it wasn't just poorly
defined, it differed significantly between platforms.  And I'm quite unhappy
with the vagueness in which we defined the meaning of the various shutdown
callbacks ([1]).

I suspect to even get to the point of doing a useful redesign of the module
hierarchy, we'd need to unify more of the process initialization between
EXEC_BACKEND and normal builds.

I've bitten by all this often enough to be motivated to propose
something. However I want to get the basics of the shared memory stats stuff
in first - it's a pain to keep it upated, and we'll need to find and solve all
of the issues it has anyway, even if we go for a redesign of module / startup
/ shutdown layering.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210803023612.iziacxk5syn2r4ut%40alap3.anarazel.de



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2021-08-07 11:43:07 -0400, Tom Lane wrote:
>> Intuitively it seems like temp file management should be a low-level,
>> backend-local function and therefore should be okay to run after
>> shmem disconnection.  I do not have a warm feeling about reversing that
>> module layering --- what's to stop someone from breaking things by
>> trying to use a temp file in their on_proc_exit or on_shmem_exit hook?

> We could just add an assert preventing that from happening. It's hard to
> believe that there could be a good reason to use temp files in those hook
> points.

The bigger picture here is that anyplace anybody ever wants to add stats
collection in suddenly becomes a "must run before shmem disconnection"
module.  I think that way madness lies --- we can't have the entire
backend shut down before shmem disconnection.  So I feel like there's
a serious design problem here, and it's not confined to temp files,
or at least it won't stay confined to temp files.  (There may indeed
be other problems already, that we just haven't had the good luck to
have buildfarm timing vagaries expose to us.)

Maybe the solution is to acknowledge that we might lose some events
during backend shutdown, and redefine the behavior as "we ignore
event reports after pgstat shutdown", not "we assert that there never
can be any such reports".

> I'm somewhat inclined to split InitFileAccess() into two by separating out
> InitTemporaryFileAccess() or such. InitFileAccess() would continue to happen
> early and register a proc exit hook that errors out when there's a temp file
> (as a backstop for non cassert builds). The new InitTemporaryFileAccess()
> would happen a bit later and schedule CleanupTempFiles() to happen via
> before_shmem_access(). And we add a Assert(!proc_exit_inprogress) to the
> routines for opening a temp file.

Maybe that would work, but after you multiply it by a bunch of different
scenarios in different modules, it's going to get less and less attractive.

> I've bitten by all this often enough to be motivated to propose
> something. However I want to get the basics of the shared memory stats stuff
> in first - it's a pain to keep it upated, and we'll need to find and solve all
> of the issues it has anyway, even if we go for a redesign of module / startup
> / shutdown layering.

Fair.  But I suggest that the first cut should look more like what
I suggest above, ie just be willing to lose events during shutdown.
The downsides of that are not so enormous that we should be willing
to undertake major klugery to avoid it before we've even got a
semi-working system.

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-07 13:06:47 -0400, Tom Lane wrote:
> The bigger picture here is that anyplace anybody ever wants to add stats
> collection in suddenly becomes a "must run before shmem disconnection"
> module.  I think that way madness lies --- we can't have the entire
> backend shut down before shmem disconnection.  So I feel like there's
> a serious design problem here, and it's not confined to temp files,
> or at least it won't stay confined to temp files.  (There may indeed
> be other problems already, that we just haven't had the good luck to
> have buildfarm timing vagaries expose to us.)

I think more often it should not end up as "must run before shmem
disconnection" as a whole, but should be split into a portion running at that
point.


> Maybe the solution is to acknowledge that we might lose some events
> during backend shutdown, and redefine the behavior as "we ignore
> event reports after pgstat shutdown", not "we assert that there never
> can be any such reports".

I think it's fine to make such calls, but that it ought to reside in the stats
emitting modules. Only it can decide whether needing to emit stats during
shutdown is a rare edge case or a commonly expected path. E.g. the case of
parallel worker shutdown sending stats too late is something common enough to
be problematic, so I don't want to make it too hard to detect such cases.


> > I'm somewhat inclined to split InitFileAccess() into two by separating out
> > InitTemporaryFileAccess() or such. InitFileAccess() would continue to happen
> > early and register a proc exit hook that errors out when there's a temp file
> > (as a backstop for non cassert builds). The new InitTemporaryFileAccess()
> > would happen a bit later and schedule CleanupTempFiles() to happen via
> > before_shmem_access(). And we add a Assert(!proc_exit_inprogress) to the
> > routines for opening a temp file.
>
> Maybe that would work, but after you multiply it by a bunch of different
> scenarios in different modules, it's going to get less and less attractive.

I'm not quite convinced. Even if we had a nicer ordering / layering of
subsystems, we'd still have to deal with subsystems that don't fit nicely
because they have conflicting needs. And we'd still need detection of use of
subsystems that already have not been initialized / are already shut down.

One example of that is imo fd.c being a very low level module that might be
needed during shutdown of other modules but which also currently depends on
the stats subsystem for temp file management. Which doesn't really make sense,
because stats very well could depend on fd.c routines.

I think needing to split the current fd.c subsystem into a lower-level (file
access) and a higher level (temporary file management) module is precisely
what a better designed module layering system will *force* us to do.


> > I've bitten by all this often enough to be motivated to propose
> > something. However I want to get the basics of the shared memory stats stuff
> > in first - it's a pain to keep it upated, and we'll need to find and solve all
> > of the issues it has anyway, even if we go for a redesign of module / startup
> > / shutdown layering.
>
> Fair.  But I suggest that the first cut should look more like what
> I suggest above, ie just be willing to lose events during shutdown.
> The downsides of that are not so enormous that we should be willing
> to undertake major klugery to avoid it before we've even got a
> semi-working system.

I think that's more likely to hide bugs unfortunately. Consider fa91d4c91f2 -
I might not have found that if we had just ignored "too late" pgstats activity
in pgstats.c or fd.c, and that's not an edge case.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2021-08-07 13:06:47 -0400, Tom Lane wrote:
>> Fair.  But I suggest that the first cut should look more like what
>> I suggest above, ie just be willing to lose events during shutdown.
>> The downsides of that are not so enormous that we should be willing
>> to undertake major klugery to avoid it before we've even got a
>> semi-working system.

> I think that's more likely to hide bugs unfortunately. Consider fa91d4c91f2 -
> I might not have found that if we had just ignored "too late" pgstats activity
> in pgstats.c or fd.c, and that's not an edge case.

Depends what you want to define as a bug.  What I am not happy about
is the prospect of random assertion failures for the next six months
while you finish redesigning half of the system.  The rest of us
have work we want to get done, too.  I don't object to the idea of
making no-lost-events an end goal, but we are clearly not ready
for that today.

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-07 13:37:16 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-08-07 13:06:47 -0400, Tom Lane wrote:
> >> Fair.  But I suggest that the first cut should look more like what
> >> I suggest above, ie just be willing to lose events during shutdown.
> >> The downsides of that are not so enormous that we should be willing
> >> to undertake major klugery to avoid it before we've even got a
> >> semi-working system.
> 
> > I think that's more likely to hide bugs unfortunately. Consider fa91d4c91f2 -
> > I might not have found that if we had just ignored "too late" pgstats activity
> > in pgstats.c or fd.c, and that's not an edge case.
> 
> Depends what you want to define as a bug.  What I am not happy about
> is the prospect of random assertion failures for the next six months
> while you finish redesigning half of the system.  The rest of us
> have work we want to get done, too.  I don't object to the idea of
> making no-lost-events an end goal, but we are clearly not ready
> for that today.

I don't know what to do about that. How would we even find these cases if they
aren't hit during regression tests on my machine (nor on a lot of others)?
Obviously I had ran the regression tests many times before pushing the earlier
changes.

The check for pgstat being up is in one central place and thus easily can be
turned into a warning if problems around the shutdown sequence become a
frequent issue on HEAD.  If you think it's better to turn that into a WARNING
now and then arm into an assert later, I can live with that as well, but I
don't think it'll lead to a better outcome.

The shared memory stats stuff isn't my personal project - it's
Horiguchi-san's. I picked it up because it seems like an important thing to
address and because it'd been in maybe a dozen CFs without a lot of
progress. It just turns out that there's a lot of prerequisite changes :(.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-07 09:48:50 -0700, Andres Freund wrote:
> I'm somewhat inclined to split InitFileAccess() into two by separating out
> InitTemporaryFileAccess() or such. InitFileAccess() would continue to happen
> early and register a proc exit hook that errors out when there's a temp file
> (as a backstop for non cassert builds). The new InitTemporaryFileAccess()
> would happen a bit later and schedule CleanupTempFiles() to happen via
> before_shmem_access(). And we add a Assert(!proc_exit_inprogress) to the
> routines for opening a temp file.

Attached is a patch showing how this could look like. Note that the PANIC
should likely not be that but a WARNING, but the PANIC more useful for running
some initial tests...

I'm not sure whether we'd want to continue having the proc exit hook? It seems
to me that asserts would provide a decent enough protection against
introducing new temp files during shutdown.

Alternatively we could make the asserts in OpenTemporaryFile et al
elog(ERROR)s, and be pretty certain that no temp files would be open too late?

Greetings,

Andres Freund

Вложения

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2021-08-07 13:37:16 -0400, Tom Lane wrote:
>> Depends what you want to define as a bug.  What I am not happy about
>> is the prospect of random assertion failures for the next six months
>> while you finish redesigning half of the system.  The rest of us
>> have work we want to get done, too.  I don't object to the idea of
>> making no-lost-events an end goal, but we are clearly not ready
>> for that today.

> I don't know what to do about that. How would we even find these cases if they
> aren't hit during regression tests on my machine (nor on a lot of others)?

The regression tests really aren't that helpful for testing the problem
scenario here, which basically is SIGTERM'ing a query-in-progress.
I'm rather surprised that the buildfarm managed to exercise that at all.

You might try setting up a test scaffold that runs the core regression
tests and SIGINT's the postmaster, or alternatively SIGTERM's some
individual session, at random times partway through.  Obviously this
will make the regression tests report failure, but what to look for
is if anything dumps core on the way out.

            regards, tom lane



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-07 15:12:38 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-08-07 13:37:16 -0400, Tom Lane wrote:
> >> Depends what you want to define as a bug.  What I am not happy about
> >> is the prospect of random assertion failures for the next six months
> >> while you finish redesigning half of the system.  The rest of us
> >> have work we want to get done, too.  I don't object to the idea of
> >> making no-lost-events an end goal, but we are clearly not ready
> >> for that today.
> 
> > I don't know what to do about that. How would we even find these cases if they
> > aren't hit during regression tests on my machine (nor on a lot of others)?
> 
> The regression tests really aren't that helpful for testing the problem
> scenario here, which basically is SIGTERM'ing a query-in-progress.
> I'm rather surprised that the buildfarm managed to exercise that at all.

They're also not that helpful because this problem likely is unreachable for
any tempfiles other than the one in InitializeBackupManifest(). Pretty much
all, or even all, the other tempfiles are cleaned up either via transaction
and/or resowner cleanup.


I wonder if we should do something about WalSndResourceCleanup() not being
reached for FATALs? I think at least a note in WalSndResourceCleanup()
commenting on that fact seems like it might be a good idea?

It seems like it could eventually be a problem that the resowners added in
0d8c9c1210c4 aren't ever cleaned up in case of a FATAL error. Most resowner
cleanup actions are also backstopped with some form of on-exit hook, but I
don't think it's all - e.g. buffer pins aren't.

I guess I should start a thread about this on -hackers...


> You might try setting up a test scaffold that runs the core regression
> tests and SIGINT's the postmaster, or alternatively SIGTERM's some
> individual session, at random times partway through.  Obviously this
> will make the regression tests report failure, but what to look for
> is if anything dumps core on the way out.

Worth trying.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-07 12:01:31 -0700, Andres Freund wrote:
> Attached is a patch showing how this could look like. Note that the PANIC
> should likely not be that but a WARNING, but the PANIC more useful for running
> some initial tests...

I pushed a slightly evolved version of this. As the commit message noted, this
may not be the best approach, but we can revise after further discussion.


> I'm not sure whether we'd want to continue having the proc exit hook? It seems
> to me that asserts would provide a decent enough protection against
> introducing new temp files during shutdown.

> Alternatively we could make the asserts in OpenTemporaryFile et al
> elog(ERROR)s, and be pretty certain that no temp files would be open too late?

I ended up removing the proc exit hook and not converting the asserts to an
elog(). Happy to change either.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Robert Haas
Дата:
On Sat, Aug 7, 2021 at 11:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Intuitively it seems like temp file management should be a low-level,
> backend-local function and therefore should be okay to run after
> shmem disconnection.  I do not have a warm feeling about reversing that
> module layering --- what's to stop someone from breaking things by
> trying to use a temp file in their on_proc_exit or on_shmem_exit hook?
> Maybe what needs to go overboard is point 1.
>
> More generally, this points up the fact that we don't have a well-defined
> module hierarchy that would help us understand what code can safely do which.
> I'm not volunteering to design that, but maybe it needs to happen soon.

Yeah, I was quite surprised when I saw this commit, because my first
reaction was - why in the world would temporary file shutdown properly
precede DSM shutdown, given that temporary files are a low-level
mechanism? The explanation that we're trying to send statistics at
that point makes sense as far as it goes, but it seems to me that we
cannot be far from having a circular dependency. All we need is to
have DSM require the use of temporary files, and we'll end up needing
DSM shutdown to happen both before and after temporary file cleanup.

/me wonders idly about dynamic_shared_memory_type=file

I think that subsystems like "memory" and "files" really ought to be
the lowest-level things we have, and should be shut down last. Stuff
like "send a message to the stats collector" seems like a higher level
thing that may require those lower-level facilities in order to
operate, and must therefore be shut down first. Maybe some subsystems
need to be divided into upper and lower levels to make this work, or,
well, I don't know, something else. But I'm deeply suspicious that
lifting stuff like this to the front of the shutdown sequence is just
papering over the problem, and not actually solving it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Andres Freund
Дата:
Hi,

On 2021-08-09 11:46:30 -0400, Robert Haas wrote:
> I think that subsystems like "memory" and "files" really ought to be
> the lowest-level things we have, and should be shut down last.

I don't disagree with that - but there's a difference between having that as
an abstract goal, and having it a dependency of committing somebodies patch.


> Stuff like "send a message to the stats collector" seems like a higher level
> thing that may require those lower-level facilities in order to operate, and
> must therefore be shut down first.

Yep.


> Maybe some subsystems need to be divided into upper and lower levels to make
> this work, or, well, I don't know, something else.

That's what I ended up doing, right? There's now InitFileAccess() and
InitTemporaryFileAccess().


> But I'm deeply suspicious that lifting stuff like this to the front of the
> shutdown sequence is just papering over the problem, and not actually
> solving it.

If you have a concrete proposal that you think makes sense to tie shared
memory stats to, I'm happy to entertain it. One main motivator for b406478b87e
etc was to allow rejiggering things like this more easily.

Greetings,

Andres Freund



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Masahiko Sawada
Дата:
Hi,

On Sun, Aug 8, 2021 at 11:24 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-07 12:01:31 -0700, Andres Freund wrote:
> > Attached is a patch showing how this could look like. Note that the PANIC
> > should likely not be that but a WARNING, but the PANIC more useful for running
> > some initial tests...
>
> I pushed a slightly evolved version of this. As the commit message noted, this
> may not be the best approach, but we can revise after further discussion.

While testing streaming logical replication, I got another assertion
failure with the current HEAD (e2ce88b58f) when the apply worker
raised an error during applying spooled changes:

TRAP: FailedAssertion("pgstat_is_initialized && !pgstat_is_shutdown",
File: "pgstat.c", Line: 4810, PID: 11084)
0   postgres                            0x000000010704fd9a
ExceptionalCondition + 234
1   postgres                            0x0000000106d41e62
pgstat_assert_is_up + 66
2   postgres                            0x0000000106d43854 pgstat_send + 20
3   postgres                            0x0000000106d4433e
pgstat_report_tempfile + 94
4   postgres                            0x0000000106df9519
ReportTemporaryFileUsage + 25
5   postgres                            0x0000000106df945a
PathNameDeleteTemporaryFile + 282
6   postgres                            0x0000000106df8c7e
unlink_if_exists_fname + 174
7   postgres                            0x0000000106df8b3a walkdir + 426
8   postgres                            0x0000000106df8982
PathNameDeleteTemporaryDir + 82
9   postgres                            0x0000000106dfe591
SharedFileSetDeleteAll + 113
10  postgres                            0x0000000106dfdf62
SharedFileSetDeleteOnProcExit + 66
11  postgres                            0x0000000106e05275
proc_exit_prepare + 325
12  postgres                            0x0000000106e050a3 proc_exit + 19
13  postgres                            0x0000000106d3ba99
StartBackgroundWorker + 649
14  postgres                            0x0000000106d54e85
do_start_bgworker + 613
15  postgres                            0x0000000106d4ef26
maybe_start_bgworkers + 486
16  postgres                            0x0000000106d4d767 sigusr1_handler + 631
17  libsystem_platform.dylib            0x00007fff736705fd _sigtramp + 29
18  ???                                 0x0000000000000000 0x0 + 0
19  postgres                            0x0000000106d4c990 PostmasterMain + 6640
20  postgres                            0x0000000106c24fa3 main + 819
21  libdyld.dylib                       0x00007fff73477cc9 start + 1

The apply worker registers SharedFileSetDeleteOnProcExit() when
creating a file set to serialize the changes. When it raises an error
due to conflict during applying the change, the callback eventually
reports the temp file statistics but pgstat already shut down,
resulting in this assertion failure.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Amit Kapila
Дата:
On Tue, Aug 10, 2021 at 4:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> The apply worker registers SharedFileSetDeleteOnProcExit() when
> creating a file set to serialize the changes. When it raises an error
> due to conflict during applying the change, the callback eventually
> reports the temp file statistics but pgstat already shut down,
> resulting in this assertion failure.
>

I think we can try to fix this by registering to clean up these files
via before_shmem_exit() as done by Andres in commit 675c945394.
Similar to that commit, we can change the function name
SharedFileSetDeleteOnProcExit to SharedFileSetDeleteOnShmExit and
register it via before_shmem_exit() instead of on_proc_exit(). Can you
try that and see if it fixes the issue for you unless you have better
ideas to try out?

-- 
With Regards,
Amit Kapila.



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Masahiko Sawada
Дата:
On Wed, Aug 11, 2021 at 10:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 10, 2021 at 4:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > The apply worker registers SharedFileSetDeleteOnProcExit() when
> > creating a file set to serialize the changes. When it raises an error
> > due to conflict during applying the change, the callback eventually
> > reports the temp file statistics but pgstat already shut down,
> > resulting in this assertion failure.
> >
>
> I think we can try to fix this by registering to clean up these files
> via before_shmem_exit() as done by Andres in commit 675c945394.
> Similar to that commit, we can change the function name
> SharedFileSetDeleteOnProcExit to SharedFileSetDeleteOnShmExit and
> register it via before_shmem_exit() instead of on_proc_exit(). Can you
> try that and see if it fixes the issue for you unless you have better
> ideas to try out?

It seems to me that moving the shared fileset cleanup to
before_shmem_exit() is the right approach to fix this problem. The
issue is fixed by the attached patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Dilip Kumar
Дата:
On Thu, Aug 12, 2021 at 7:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 11, 2021 at 10:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Aug 10, 2021 at 4:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > The apply worker registers SharedFileSetDeleteOnProcExit() when
> > > creating a file set to serialize the changes. When it raises an error
> > > due to conflict during applying the change, the callback eventually
> > > reports the temp file statistics but pgstat already shut down,
> > > resulting in this assertion failure.
> > >
> >
> > I think we can try to fix this by registering to clean up these files
> > via before_shmem_exit() as done by Andres in commit 675c945394.
> > Similar to that commit, we can change the function name
> > SharedFileSetDeleteOnProcExit to SharedFileSetDeleteOnShmExit and
> > register it via before_shmem_exit() instead of on_proc_exit(). Can you
> > try that and see if it fixes the issue for you unless you have better
> > ideas to try out?
>
> It seems to me that moving the shared fileset cleanup to
> before_shmem_exit() is the right approach to fix this problem. The
> issue is fixed by the attached patch.

+1, the fix makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Amit Kapila
Дата:
On Thu, Aug 12, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 7:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Aug 11, 2021 at 10:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Aug 10, 2021 at 4:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > The apply worker registers SharedFileSetDeleteOnProcExit() when
> > > > creating a file set to serialize the changes. When it raises an error
> > > > due to conflict during applying the change, the callback eventually
> > > > reports the temp file statistics but pgstat already shut down,
> > > > resulting in this assertion failure.
> > > >
> > >
> > > I think we can try to fix this by registering to clean up these files
> > > via before_shmem_exit() as done by Andres in commit 675c945394.
> > > Similar to that commit, we can change the function name
> > > SharedFileSetDeleteOnProcExit to SharedFileSetDeleteOnShmExit and
> > > register it via before_shmem_exit() instead of on_proc_exit(). Can you
> > > try that and see if it fixes the issue for you unless you have better
> > > ideas to try out?
> >
> > It seems to me that moving the shared fileset cleanup to
> > before_shmem_exit() is the right approach to fix this problem. The
> > issue is fixed by the attached patch.
>
> +1, the fix makes sense to me.
>

I have also tested and fix works for me. The fix works because
pgstat_initialize() is called before we register clean up in
SharedFileSetInit(). I am not sure if we need an Assert to ensure that
and if so how we can do that? Any suggestions?

-- 
With Regards,
Amit Kapila.



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Masahiko Sawada
Дата:
On Thu, Aug 12, 2021 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 12, 2021 at 7:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Aug 11, 2021 at 10:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 10, 2021 at 4:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > The apply worker registers SharedFileSetDeleteOnProcExit() when
> > > > > creating a file set to serialize the changes. When it raises an error
> > > > > due to conflict during applying the change, the callback eventually
> > > > > reports the temp file statistics but pgstat already shut down,
> > > > > resulting in this assertion failure.
> > > > >
> > > >
> > > > I think we can try to fix this by registering to clean up these files
> > > > via before_shmem_exit() as done by Andres in commit 675c945394.
> > > > Similar to that commit, we can change the function name
> > > > SharedFileSetDeleteOnProcExit to SharedFileSetDeleteOnShmExit and
> > > > register it via before_shmem_exit() instead of on_proc_exit(). Can you
> > > > try that and see if it fixes the issue for you unless you have better
> > > > ideas to try out?
> > >
> > > It seems to me that moving the shared fileset cleanup to
> > > before_shmem_exit() is the right approach to fix this problem. The
> > > issue is fixed by the attached patch.
> >
> > +1, the fix makes sense to me.
> >
>
> I have also tested and fix works for me. The fix works because
> pgstat_initialize() is called before we register clean up in
> SharedFileSetInit(). I am not sure if we need an Assert to ensure that
> and if so how we can do that? Any suggestions?

I think that the assertion added by ee3f8d3d3ae ensures that
pgstat_initialize() is callbed before the callback for fileset cleanup
is registered, no?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Dilip Kumar
Дата:
On Thu, Aug 12, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 12, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 12, 2021 at 7:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 11, 2021 at 10:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 10, 2021 at 4:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > The apply worker registers SharedFileSetDeleteOnProcExit() when
> > > > > > creating a file set to serialize the changes. When it raises an error
> > > > > > due to conflict during applying the change, the callback eventually
> > > > > > reports the temp file statistics but pgstat already shut down,
> > > > > > resulting in this assertion failure.
> > > > > >
> > > > >
> > > > > I think we can try to fix this by registering to clean up these files
> > > > > via before_shmem_exit() as done by Andres in commit 675c945394.
> > > > > Similar to that commit, we can change the function name
> > > > > SharedFileSetDeleteOnProcExit to SharedFileSetDeleteOnShmExit and
> > > > > register it via before_shmem_exit() instead of on_proc_exit(). Can you
> > > > > try that and see if it fixes the issue for you unless you have better
> > > > > ideas to try out?
> > > >
> > > > It seems to me that moving the shared fileset cleanup to
> > > > before_shmem_exit() is the right approach to fix this problem. The
> > > > issue is fixed by the attached patch.
> > >
> > > +1, the fix makes sense to me.
> > >
> >
> > I have also tested and fix works for me. The fix works because
> > pgstat_initialize() is called before we register clean up in
> > SharedFileSetInit(). I am not sure if we need an Assert to ensure that
> > and if so how we can do that? Any suggestions?
>
> I think that the assertion added by ee3f8d3d3ae ensures that
> pgstat_initialize() is callbed before the callback for fileset cleanup
> is registered, no?

Right, it ensures that callback for fileset, is called after
pgstat_initialize() and before pgstat_shutdown.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o

От
Amit Kapila
Дата:
On Thu, Aug 12, 2021 at 1:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Aug 12, 2021 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I have also tested and fix works for me. The fix works because
> > pgstat_initialize() is called before we register clean up in
> > SharedFileSetInit(). I am not sure if we need an Assert to ensure that
> > and if so how we can do that? Any suggestions?
>
> I think that the assertion added by ee3f8d3d3ae ensures that
> pgstat_initialize() is callbed before the callback for fileset cleanup
> is registered, no?
>

I think I am missing something here, can you please explain?

-- 
With Regards,
Amit Kapila.