Обсуждение: How abnormal server shutdown could be detected by tests?

Поиск
Список
Период
Сортировка

How abnormal server shutdown could be detected by tests?

От
Alexander Lakhin
Дата:
Hello hackers,

While studying bug #18158, I've come to the conclusion that the existing
testing infrastructure is unable to detect abnormal situations. of some
kind.

Just a simple example:
With Assert(0) injected in walsender (say:
sed "s/WalSndDone(send_data)/Assert(0)/" -i src/backend/replication/walsender.c
), I observe the following:
$ make -s check -C src/test/recovery PROVE_TESTS="t/012*"

t/012_subtransactions.pl .. ok
All tests successful.

(The same with meson:
1/1 postgresql:recovery / recovery/012_subtransactions OK                3.24s   12 subtests passed)

But:
$ grep TRAP src/test/recovery/tmp_check/log/*
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373729
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373750
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373821
src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed Assert("0"), File: "walsender.c", Line: 
2528, PID: 373786

src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
...
2023-12-09 03:23:20.210 UTC [375933] LOG:  server process (PID 375975) was terminated by signal 6: Aborted
2023-12-09 03:23:20.210 UTC [375933] DETAIL:  Failed process was running: START_REPLICATION 0/3000000 TIMELINE 3
2023-12-09 03:23:20.210 UTC [375933] LOG:  terminating any other active server processes
2023-12-09 03:23:20.210 UTC [375933] LOG:  abnormal database system shutdown
2023-12-09 03:23:20.211 UTC [375933] LOG:  database system is shut down
...

So the shutdown definitely was considered abnormal, but we missed that.

With log_min_messages = DEBUG3, I can see in the log:
2023-12-09 03:26:50.145 UTC [377844] LOG:  abnormal database system shutdown
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up orphaned dynamic shared memory with ID 2898643884
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up dynamic shared memory control segment with ID 3446966170
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(1): 2 callbacks to make
2023-12-09 03:26:50.146 UTC [377844] LOG:  database system is shut down
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  exit(1)
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(-1): 0 callbacks to make

The postmaster process exits with exit code 1, but pg_ctl can't get the
code and just reports that stop was completed successfully.

Should this be improved? And if yes, how it can be done?
Maybe postmaster shouldn't remove it's postmaster.pid when it exits
abnormally to let pg_ctl know of it?

Best regards,
Alexander



Re: How abnormal server shutdown could be detected by tests?

От
shveta malik
Дата:
On Sat, Dec 9, 2023 at 9:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello hackers,
>
> While studying bug #18158, I've come to the conclusion that the existing
> testing infrastructure is unable to detect abnormal situations. of some
> kind.
>
> Just a simple example:
> With Assert(0) injected in walsender (say:
> sed "s/WalSndDone(send_data)/Assert(0)/" -i src/backend/replication/walsender.c
> ), I observe the following:
> $ make -s check -C src/test/recovery PROVE_TESTS="t/012*"
>
> t/012_subtransactions.pl .. ok
> All tests successful.
>
> (The same with meson:
> 1/1 postgresql:recovery / recovery/012_subtransactions OK                3.24s   12 subtests passed)
>
> But:
> $ grep TRAP src/test/recovery/tmp_check/log/*
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373729
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373750
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373821
> src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373786
>
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
> ...
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  server process (PID 375975) was terminated by signal 6: Aborted
> 2023-12-09 03:23:20.210 UTC [375933] DETAIL:  Failed process was running: START_REPLICATION 0/3000000 TIMELINE 3
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  terminating any other active server processes
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  abnormal database system shutdown
> 2023-12-09 03:23:20.211 UTC [375933] LOG:  database system is shut down
> ...
>
> So the shutdown definitely was considered abnormal, but we missed that.
>
> With log_min_messages = DEBUG3, I can see in the log:
> 2023-12-09 03:26:50.145 UTC [377844] LOG:  abnormal database system shutdown
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up orphaned dynamic shared memory with ID 2898643884
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up dynamic shared memory control segment with ID 3446966170
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(1): 2 callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] LOG:  database system is shut down
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  exit(1)
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(-1): 0 callbacks to make
>
> The postmaster process exits with exit code 1, but pg_ctl can't get the
> code and just reports that stop was completed successfully.
>

For what it's worth, there is another thread which stated the similar problem:
https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us

> Should this be improved? And if yes, how it can be done?
> Maybe postmaster shouldn't remove it's postmaster.pid when it exits
> abnormally to let pg_ctl know of it?
>


thanks
Shveta



Re: How abnormal server shutdown could be detected by tests?

От
Alexander Lakhin
Дата:
Hello Shveta,

12.12.2023 11:44, shveta malik wrote:
>
>> The postmaster process exits with exit code 1, but pg_ctl can't get the
>> code and just reports that stop was completed successfully.
>>
> For what it's worth, there is another thread which stated the similar problem:
> https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us
>

Thank you for the reference!
So I refreshed a first part of the question Tom Lane raised before...

I've made a quick experiment with leaving postmaster.pid intact in case of
abnormal shutdown:
@@ -1113,6 +1113,7 @@ UnlinkLockFiles(int status, Datum arg)
      {
          char       *curfile = (char *) lfirst(l);

+if (strcmp(curfile, DIRECTORY_LOCK_FILE) != 0 || status == 0)
          unlink(curfile);
          /* Should we complain if the unlink fails? */
      }

and `make check-world` passed for me with no failure.
(In the meantime, the assertion failure forced as above is detected.)

Though there is a minor issue with a couple of tests. Namely,
003_recovery_targets.pl does the following:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
     $logfile = slurp_file($node_primary->logfile());
     $res = ($logfile =~
         qr/FATAL: .* recovery ended before configured recovery target was reached/);
     if ($res) {
         last;
     }
     usleep(100_000);
}
ok($res,
     'recovery end before target reached is a fatal error');

With postmaster.pid left after unclean shutdown, the test waits for 300
seconds by default and then completes successfully.

If rewrite that loop as follows:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
     $logfile = slurp_file($node_primary->logfile());
     $res = ($logfile =~
         qr/FATAL: .* recovery ended before configured recovery target was reached/);
     if ($res) {
         last;
     }
     usleep(100_000);
}
ok($res,
     'recovery end before target reached is a fatal error');

the test completes as quickly as before.
(standby.log is only 2kb, so rereading it isn't a big deal, IMO)

So maybe it's the way to go?

Another way I can think of is sending some signal to pg_ctl in case
postmaster terminates with status 0. Though I think it would complicate
things a little as it allows for three different states:
postmaster.pid preserved (in case postmaster killed with -9),
postmaster.pid removed and the signal received/not received.

Best regards,
Alexander