Обсуждение: Apparent race condition in standby promotion

Поиск
Список
Период
Сортировка

Apparent race condition in standby promotion

От
Tom Lane
Дата:
I'm looking at buildfarm member tern's recent failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08

The perl script thinks it's started the postmaster and sent a promote
signal, then it's waited 90 seconds for the postmaster to come out of
standby:

### Starting node "standby"
# Running: pg_ctl -w -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l
/home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logstart
 
waiting for server to start.... done
server started
# Postmaster PID for node "standby" is 14025120
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO tbl1 values ('in
master,before promotion')
 
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO trunc_tbl values ('in
master,before promotion')
 
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c INSERT INTO tail_tbl SELECT g, 'in
master,before promotion: ' || g FROM generate_series(1, 10000) g
 
# Running: psql -q --no-psqlrc -d port=59882 host=/tmp/hr4cDmbIX0 dbname=postgres -c CHECKPOINT
### Promoting node "standby"
# Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l
/home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logpromote
 
server promoting
# 
Timed out while waiting for promotion of standby at RewindTest.pm line 166.

But look at what's in the standby's log:

LOG:  database system was interrupted; last known up at 2016-04-25 01:58:39 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  consistent recovery state reached at 0/2000C00
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
LOG:  statement: SELECT NOT pg_is_in_recovery()
... 88 occurrences of that removed ...
LOG:  statement: SELECT NOT pg_is_in_recovery()
LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command

The standby server didn't notice the promote request until AFTER
the perl script had probed 90 times, once a second, for promotion
to finish.  What's up with that?

One theory is that the postmaster handles promotion requests improperly.
Its only consideration of the problem is in sigusr1_handler:
   if (CheckPromoteSignal() && StartupPID != 0 &&       (pmState == PM_STARTUP || pmState == PM_RECOVERY ||
pmState== PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))   {       /* Tell startup process to finish recovery */
signal_child(StartupPID,SIGUSR2);   }
 

This means that if pg_ctl sends a SIGUSR1 after creating a promotion
trigger file, and the system state is not such that the signal can be
forwarded on to the startup process at that very instant, the signal is
simply forgotten.  We will reconsider sending it to the startup process
only when the postmaster next gets a SIGUSR1, which might not be for a
long time.  Now, I do not immediately see a way for this to happen: the
postmaster should create the startup process before it first enables
interrupts, and that list of states seems to cover every possibility.
But it seems fragile as heck.  Maybe we need to check for promotion in
PostmasterStateMachine or related places, not in (or not just in) the
signal handler.

The other theory is that the startup process received the SIGUSR2
but failed to act on it for a long time.  It checks for that only
in CheckForStandbyTrigger(), and the calls to that function are
in assorted rather unprincipled places, which makes it hard to
convince oneself that such a call will always happen soon.
This example sure looks like no such call happened.

We've seen previous examples of the same thing, eg
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09
Noah raised the test's timeout from 30 seconds to 90 in response
to that failure, but it looks to me like this was a mis-diagnosis
of the problem.  What I suspect is happening is that if the signal
arrives at the startup process at the right time, probably around
the time it's opening up the streaming connection from the primary,
it doesn't get handled promptly.

I'm not planning to pursue this further, but someone who's more
familiar with the WAL-receiving logic in startup.c ought to.
        regards, tom lane



Re: Apparent race condition in standby promotion

От
Noah Misch
Дата:
On Mon, Apr 25, 2016 at 02:09:27PM -0400, Tom Lane wrote:
> I'm looking at buildfarm member tern's recent failure:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08

> # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l
/home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.logpromote 
> server promoting
> #
> Timed out while waiting for promotion of standby at RewindTest.pm line 166.
>
> But look at what's in the standby's log:
>
> LOG:  database system was interrupted; last known up at 2016-04-25 01:58:39 UTC
> LOG:  entering standby mode
> LOG:  redo starts at 0/2000028
> LOG:  consistent recovery state reached at 0/2000C00
> LOG:  database system is ready to accept read only connections
> LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
> LOG:  statement: SELECT NOT pg_is_in_recovery()
> ... 88 occurrences of that removed ...
> LOG:  statement: SELECT NOT pg_is_in_recovery()
> LOG:  received promote request
> FATAL:  terminating walreceiver process due to administrator command
>
> The standby server didn't notice the promote request until AFTER
> the perl script had probed 90 times, once a second, for promotion
> to finish.  What's up with that?

This resembles symptoms I studied last September.

> The other theory is that the startup process received the SIGUSR2
> but failed to act on it for a long time.  It checks for that only
> in CheckForStandbyTrigger(), and the calls to that function are
> in assorted rather unprincipled places, which makes it hard to
> convince oneself that such a call will always happen soon.

Right.  In every case, I caught the startup process taking too long in a
single XLOG_DBASE_CREATE replay.  I don't suspect a bug; that is a costly
record to replay, especially given the machine's slow filesystem metadata
operations.  I could reproduce such symptoms reliably, on any system, with the
attached demonstration patch.  While we could hack things until the message
prints earlier, that won't make promotion finish earlier.

> We've seen previous examples of the same thing, eg
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09
> Noah raised the test's timeout from 30 seconds to 90 in response
> to that failure, but it looks to me like this was a mis-diagnosis
> of the problem.

I can understand you guessing the 2016-02-09 failure inspired that change, but
it was either a coincidence or merely reminded me to commit an already-written
change.  I had performed burn-in runs, with timeouts raised to 900s, totalling
25177 executions of src/bin/pg_rewind/t/002_databases.pl.  The longest promote
duration among those runs was 43s, so I doubled that quantity and rounded up.

For the 2016-04-25 failure, note this bit of 001_basic_master.log:

LOG:  statement: CHECKPOINT
LOG:  statement: SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name =
'rewind_standby';
... 39 duplicate lines removed ...
LOG:  statement: SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name =
'rewind_standby';
LOG:  received immediate shutdown request

That's 41s just to copy WAL.  The slowest of those 002_databases.pl runs used
7s, and the worst I've seen in 2579 runs of 001_basic.pl is 22s.  The failed
run likely coincided with exceptional system load.

If this happens again anytime soon, I should triple (or so) that 90s timeout.

Вложения