Обсуждение: pg_rewind test race condition..?

Поиск
Список
Период
Сортировка

pg_rewind test race condition..?

От
Stephen Frost
Дата:
Heikki,
 Not sure if anyone else is seeing this, but I'm getting regression test failures when running the pg_rewind tests
prettyconsistently with 'make check'.  Specifically with "basic remote", I'm getting:
 

source and target cluster are on the same timeline
Failure, exiting
 in regress_log/pg_rewind_log_basic_remote.
 If I throw a "sleep(5);" into t/001_basic.pl before the call to RewindTest::run_pg_rewind($test_mode); then everything
worksfine.
 
 ./configure --silent --prefix="$INSTALL" --with-openssl --with-tcl --with-tclconfig=/usr/lib/tcl8.6 --with-perl
--enable-debug--enable-cassert --enable-tap-tests --with-gssapi
 
 Full unsuccessful pg_rewind_log_basic_remote output:
-----------------------------------------------------------------
waiting for server to start....LOG:  database system was shut down at 2015-04-28 13:46:34 EDT
LOG:  database system is ready to accept connectionsdone
server started
waiting for server to start....LOG:  database system was interrupted; last known up at 2015-04-28 13:46:35 EDT
LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  consistent recovery state reached at 0/20000F8
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1done
server started
server promoting
LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 0/30EB410
LOG:  redo done at 0/30EB3A0
LOG:  last completed transaction was at log time 2015-04-28 13:46:37.540952-04
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
waiting for server to shut down....LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
LOG:  database system is shut downdone
server stopped

source and target cluster are on the same timeline
Failure, exiting
waiting for server to start....LOG:  database system was shut down at 2015-04-28 13:46:39 EDT
LOG:  entering standby mode
LOG:  consistent recovery state reached at 0/3311438
LOG:  invalid record length at 0/3311438
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/30EB410.
LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/3311438
LOG:  restarted WAL streaming at 0/3000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/30EB410.
LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/3311438done
server started
LOG:  received immediate shutdown request
LOG:  received immediate shutdown request
-----------------------------------------------------------------
 Full successful pg_rewind_log_basic_remote output:

-----------------------------------------------------------------
waiting for server to start....LOG:  database system was shut down at 2015-04-28 13:54:04 EDT
LOG:  database system is ready to accept connectionsdone
server started
waiting for server to start....LOG:  database system was interrupted; last known up at 2015-04-28 13:54:05 EDT
LOG:  entering standby mode
LOG:  redo starts at 0/2000028
LOG:  consistent recovery state reached at 0/20000F8
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/3000000 on timeline 1done
server started
server promoting
LOG:  received promote request
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 0/30EB410
LOG:  redo done at 0/30EB3A0
LOG:  last completed transaction was at log time 2015-04-28 13:54:07.547533-04
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
waiting for server to shut down....LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  shutting down
LOG:  database system is shut downdone
server stopped
The servers diverged at WAL position 0/30EB410 on timeline 1.
Rewinding from last common checkpoint at 0/30EB3A0 on timeline 1
Done!
waiting for server to start....LOG:  database system was interrupted while in recovery at log time 2015-04-28 13:54:08
EDT
HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier
recoverytarget.
 
LOG:  entering standby mode
LOG:  redo starts at 0/30EB368
LOG:  consistent recovery state reached at 0/3113788
LOG:  invalid record length at 0/3113788
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/3000000 on timeline 2done
server started
LOG:  received immediate shutdown request
LOG:  unexpected EOF on standby connection
LOG:  received immediate shutdown request
-----------------------------------------------------------------
Thanks!
    Stephen

Re: pg_rewind test race condition..?

От
Heikki Linnakangas
Дата:
On 04/28/2015 11:02 AM, Stephen Frost wrote:
> Heikki,
>
>    Not sure if anyone else is seeing this, but I'm getting regression
>    test failures when running the pg_rewind tests pretty consistently
>    with 'make check'.  Specifically with "basic remote", I'm getting:
>
> source and target cluster are on the same timeline
> Failure, exiting
>
>    in regress_log/pg_rewind_log_basic_remote.
>
>    If I throw a "sleep(5);" into t/001_basic.pl before the call to
>    RewindTest::run_pg_rewind($test_mode); then everything works fine.

The problem seems to be that when the standby is promoted, it's a 
so-called "fast promotion", where it writes an end-of-recovery record 
and starts accepting queries before creating a real checkpoint. 
pg_rewind looks at the TLI in the latest checkpoint, as it's in the 
control file, but that isn't updated until the checkpoint completes. I 
don't see it on my laptop normally, but I can reproduce it if I insert a 
"sleep(5)" in StartupXLog, just before it requests the checkpoint:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7173,7 +7173,10 @@ StartupXLOG(void)      * than is appropriate now that we're not in standby mode anymore.      */
   if (fast_promoted)
 
+    {
+        sleep(5);         RequestCheckpoint(CHECKPOINT_FORCE);
+    } }

The simplest fix would be to force a checkpoint in the regression test, 
before running pg_rewind. It's a bit of a cop out, since you'd still get 
the same issue when you tried to do the same thing in the real world. It 
should be rare in practice - you'd not normally run pg_rewind 
immediately after promoting the standby - but a better error message at 
least would be nice..
- Heikki




Re: pg_rewind test race condition..?

От
Stephen Frost
Дата:
Heikki,

* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> The problem seems to be that when the standby is promoted, it's a
> so-called "fast promotion", where it writes an end-of-recovery
> record and starts accepting queries before creating a real
> checkpoint. pg_rewind looks at the TLI in the latest checkpoint, as
> it's in the control file, but that isn't updated until the
> checkpoint completes. I don't see it on my laptop normally, but I
> can reproduce it if I insert a "sleep(5)" in StartupXLog, just
> before it requests the checkpoint:

Ah, interesting.

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>       * than is appropriate now that we're not in standby mode anymore.
>       */
>      if (fast_promoted)
> +    {
> +        sleep(5);
>          RequestCheckpoint(CHECKPOINT_FORCE);
> +    }
>  }
>
> The simplest fix would be to force a checkpoint in the regression
> test, before running pg_rewind. It's a bit of a cop out, since you'd
> still get the same issue when you tried to do the same thing in the
> real world. It should be rare in practice - you'd not normally run
> pg_rewind immediately after promoting the standby - but a better
> error message at least would be nice..

Forcing a checkpoint in the regression tests and then providing a better
error message sounds reasonable to me.  I agree that it's very unlikely
to happen in the real world, even when you're bouncing between systems
for upgrades, etc, you're unlikely to do it fast enough for this issue
to exhibit itself, and a better error message would help any users who
manage to run into this (perhaps during their own testing).

Another thought would be to provide an option to pg_rewind to have it do
an explicit checkpoint before it reads the control file..  I'm not
against having it simply always do it as I don't see pg_rewind being a
commonly run thing, but I know some environments have very heavy
checkpoints and that might not be ideal.
Thanks!
    Stephen

Re: pg_rewind test race condition..?

От
Heikki Linnakangas
Дата:
On 04/29/2015 06:03 AM, Stephen Frost wrote:
> * Heikki Linnakangas (hlinnaka@iki.fi) wrote:
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>>        * than is appropriate now that we're not in standby mode anymore.
>>        */
>>       if (fast_promoted)
>> +    {
>> +        sleep(5);
>>           RequestCheckpoint(CHECKPOINT_FORCE);
>> +    }
>>   }
>>
>> The simplest fix would be to force a checkpoint in the regression
>> test, before running pg_rewind. It's a bit of a cop out, since you'd
>> still get the same issue when you tried to do the same thing in the
>> real world. It should be rare in practice - you'd not normally run
>> pg_rewind immediately after promoting the standby - but a better
>> error message at least would be nice..
>
> Forcing a checkpoint in the regression tests and then providing a better
> error message sounds reasonable to me.  I agree that it's very unlikely
> to happen in the real world, even when you're bouncing between systems
> for upgrades, etc, you're unlikely to do it fast enough for this issue
> to exhibit itself, and a better error message would help any users who
> manage to run into this (perhaps during their own testing).

I've committed this simple fix for now.
- Heikki




Re: pg_rewind test race condition..?

От
Stephen Frost
Дата:
Heikki,

* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> On 04/29/2015 06:03 AM, Stephen Frost wrote:
> >Forcing a checkpoint in the regression tests and then providing a better
> >error message sounds reasonable to me.  I agree that it's very unlikely
> >to happen in the real world, even when you're bouncing between systems
> >for upgrades, etc, you're unlikely to do it fast enough for this issue
> >to exhibit itself, and a better error message would help any users who
> >manage to run into this (perhaps during their own testing).
>
> I've committed this simple fix for now.

Looks good, I'm no longer seeing the failures.
Thanks!
    Stephen

Re: pg_rewind test race condition..?

От
Oleksii Kliukin
Дата:
Hi,

On Wed, Apr 29, 2015, at 01:36 AM, Heikki Linnakangas wrote:
> The problem seems to be that when the standby is promoted, it's a 
> so-called "fast promotion", where it writes an end-of-recovery record 
> and starts accepting queries before creating a real checkpoint. 
> pg_rewind looks at the TLI in the latest checkpoint, as it's in the 
> control file, but that isn't updated until the checkpoint completes. I 
> don't see it on my laptop normally, but I can reproduce it if I insert a 
> "sleep(5)" in StartupXLog, just before it requests the checkpoint:
> 
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>        * than is appropriate now that we're not in standby mode anymore.
>        */
>       if (fast_promoted)
> +       {
> +               sleep(5);
>           RequestCheckpoint(CHECKPOINT_FORCE);
> +       }
>   }
> 
> The simplest fix would be to force a checkpoint in the regression test, 
> before running pg_rewind. It's a bit of a cop out, since you'd still get 
> the same issue when you tried to do the same thing in the real world. It 
> should be rare in practice - you'd not normally run pg_rewind 
> immediately after promoting the standby - but a better error message at 
> least would be nice..

I think we came across this issue in production. We run a Python daemon
called 'Patroni' (https://github.com/zalando/patroni) in order to
automate failovers. It stops the current master before promotion of one
of the replicas, and runs pg_rewind on the former master to make it
"compatible" with the new one. Naturally, since pg_rewind is launched
automatically, it is called right after detecting that the new master is
running.

What we get sometimes is  "source and target cluster are on the same
timeline". Per the explanation from Heikki, it happens when pg_rewind
connects to the new master after promotion but before the first
checkpoint with the new timeline. Since "fast promotion" is the only
option for pg_ctl promote, I think it might be beneficial to add an
option for pg_rewind to issue a checkpoint  (I worked around this
problem by adding a checkpoint call to Patroni, another way would be to
create the 'fallback_promote' flle and do the 'slow' promote without
relying on pg_ctl promote functionality).

There is another weird case I cannot explain easily. If I do a single
promotion/rewind, while executing an insert that takes quite some time
(insert into test select id from generate_series(1, 10000000) id) and
actually interrupting it by the Patroni promotion (which does a
checkpoint and pg_ctl -mf stop on the former master, and pg_ctl promote
on the replica candidate), I'm getting the following (postgresql0 is the
rewound node, note that the latest checkpoint location there is lower
than the prior checkpoint one):

$ pg_controldata data/postgresql0
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in archive recovery
pg_control last modified:             Tue Dec  8 17:20:16 2015
Latest checkpoint location:           0/6000028
Prior checkpoint location:            0/6000138
Latest checkpoint's REDO location:    0/6000028
Latest checkpoint's REDO WAL file:    000000010000000000000006
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/944
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/6020580
Min recovery ending loc's timeline:   2
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

Furthermore, on the newly promoted master the latest checkpoint timeline
ID is different:

$ pg_controldata data/postgresql1
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in production
pg_control last modified:             Tue Dec  8 17:20:07 2015
Latest checkpoint location:           0/6000138
Prior checkpoint location:            0/2000060
Latest checkpoint's REDO location:    0/60000C8
Latest checkpoint's REDO WAL file:    000000020000000000000006
Latest checkpoint's TimeLineID:       2
Latest checkpoint's PrevTimeLineID:   2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/944
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  944
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:05 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

If I do the second round of promotion/rewind, this time waiting until
the insert completes, I'm getting this (note the target timeline history
goes ahead of the source one)

2015-12-08 17:21:01,991 INFO: running pg_rewind from user=postgres
host=127.0.0.1 port=5432 dbname=postgres sslmode=prefer sslcompression=1
fetched file "global/pg_control", length 8192
Source timeline history:
1: 0/0 - 0/0
Target timeline history:
1: 0/0 - 0/6000098
2: 0/6000098 - 0/0
servers diverged at WAL position 0/6000098 on timeline 1
rewinding from last common checkpoint at 0/6000028 on timeline 1
....
HINT:  If this has occurred more than once some data might be corrupted
and you might need to choose an earlier recovery target.
LOG:  restored log file "00000002.history" from archive
LOG:  restored log file "00000003.history" from archive
cp: ../wal_archive/00000004.history: No such file or directory
LOG:  entering standby mode
LOG:  restored log file "00000003.history" from archive
cp: ../wal_archive/000000030000000000000006: No such file or directory
LOG:  restored log file "000000020000000000000006" from archive
DEBUG:  got WAL segment from archive
DEBUG:  checkpoint record is at 0/6000028
FATAL:  requested timeline 3 does not contain minimum recovery point
0/A01D180 on timeline 1

again, pg_controldata after the second promote:

$ pg_controldata data/postgresql0
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in production
pg_control last modified:             Tue Dec  8 17:20:51 2015
Latest checkpoint location:           0/6000028
Prior checkpoint location:            0/6000138
Latest checkpoint's REDO location:    0/6000028
Latest checkpoint's REDO WAL file:    000000010000000000000006
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/948
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/A0000C8
Min recovery ending loc's timeline:   3
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

$ pg_controldata data/postgresql1
pg_control version number:            942
Catalog version number:               201511071
Database system identifier:           6225948429980199864
Database cluster state:               in archive recovery
pg_control last modified:             Tue Dec  8 17:20:51 2015
Latest checkpoint location:           0/6000028
Prior checkpoint location:            0/6000138
Latest checkpoint's REDO location:    0/6000028
Latest checkpoint's REDO WAL file:    000000010000000000000006
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/948
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:            Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/A01D180
Min recovery ending loc's timeline:   1
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    hot_standby
wal_log_hints setting:                on
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0


This is with the master branch of PostgreSQL. In order to produce the
timeline comparison output above, I've applied a small fix to pg_rewind:

diff --git a/src/bin/pg_rewind/pg_rewind.c
b/src/bin/pg_rewind/pg_rewind.c
index 1ab82f0..a3ae8a4 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -468,9 +468,9 @@ getTimelineHistory(ControlFileData *controlFile, int
*nentries)                       Assert(false);
               /*
-                * Print the target timeline history.
+                * Print the timeline history.                */
-               for (i = 0; i < targetNentries; i++)
+               for (i = 0; i < *nentries; i++)               {                       TimeLineHistoryEntry *entry;


--
Oleksii



Re: pg_rewind test race condition..?

От
Michael Paquier
Дата:
On Wed, Apr 29, 2015 at 10:03 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Another thought would be to provide an option to pg_rewind to have it do
> an explicit checkpoint before it reads the control file..  I'm not
> against having it simply always do it as I don't see pg_rewind being a
> commonly run thing, but I know some environments have very heavy
> checkpoints and that might not be ideal.

Waking an old thread. It seems to me that such an option would still
be useful when rewinding using a stream so as an inconsistent control
file is not read. The default should definitely be to not issue this
checkpoint, but in some environments I believe that this could really
matter, say two failovers followed by two rewinds done consequently.
Thoughts?
-- 
Michael