Обсуждение: BUG: Former primary node might stuck when started as a standby

Поиск
Список
Период
Сортировка

BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello hackers,

[ reporting this bug here due to limitations of the bug reporting form ]

When a node, that acted as a primary, becomes a standby as in the
following script:
[ ... some WAL-logged activity ... ]
$primary->teardown_node;
$standby->promote;

$primary->enable_streaming($standby);
$primary->start;

it might not go online, due to the error:
new timeline N forked off current database system timeline M before current recovery point X/X

A complete TAP test is attached.
I put it in src/test/recovery/t, run as follows:
for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" ||
break;
 
done
and get:
...
iteration 7
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. ok
All tests successful.
Files=1, Tests=20, 14 wallclock secs ( 0.01 usr  0.00 sys +  4.20 cusr  4.75 csys =  8.96 CPU)
Result: PASS
iteration 8
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated

With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all),
I see the following in the _node1.log:
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0:  - Transaction/COMMIT: 2024-01-18 
15:21:02.258739+00
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG:  xlog flush request 0/304DBF0; write 0/0; flush 0/0
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time
2024-01-18
 
15:21:02.2588+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time
2024-01-18
 
15:21:02.258809+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time
2024-01-18
 
15:21:02.258864+00
2024-01-18 15:21:02.259 UTC [663563] DEBUG:  server process (PID 663701) exited with exit code 0
2024-01-18 15:21:02.260 UTC [663563] DEBUG:  forked new backend, pid=663704 socket=8
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608, 
'background activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40:  - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
2024-01-18 15:21:02.261 UTC [663563] DEBUG:  postmaster received shutdown request signal
2024-01-18 15:21:02.261 UTC [663563] LOG:  received immediate shutdown request
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68:  - Transaction/COMMIT: 2024-01-18 
15:21:02.261828+00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG:  xlog flush request 0/304DC68; write 0/0; flush 0/0
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
...
2024-01-18 15:21:02.262 UTC [663563] LOG:  database system is shut down
...
2024-01-18 15:21:02.474 UTC [663810] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
...
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.258739+00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0:

rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.261828+00
...
2024-01-18 15:21:02.481 UTC [663819] LOG:  fetching timeline history file for timeline 20 from primary server
2024-01-18 15:21:02.481 UTC [663819] LOG:  started streaming WAL from primary at 0/3000000 on timeline 19
...
2024-01-18 15:21:02.481 UTC [663819] DETAIL:  End of WAL reached on timeline 19 at 0/304DBF0.
...
2024-01-18 15:21:02.481 UTC [663816] LOG:  new timeline 20 forked off current database system timeline 19 before
current
 
recovery point 0/304DC68

In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Reproduced on REL_12_STABLE .. master.

Best regards,
Alexander
Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Aleksander Alekseev
Дата:
Hi,

> it might not go online, due to the error:
> new timeline N forked off current database system timeline M before current recovery point X/X
> [...]
> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
> only record 0/304DBF0, then node2, being promoted to primary, forked a next
> timeline from it, but when node1 was started as a standby, it first
> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
> starting from the previous position.

Unless I'm missing something, this is just the right behavior of the system.

node1 has no way of knowing the history of node1/node2/nodeN
promotion. It sees that it has more data and/or inconsistent timeline
with another node and refuses to process further until DBA will
intervene. What else can node1 do, drop the data? That's not how
things are done in Postgres :) What if this is a very important data
and node2 was promoted mistakenly, either manually or by a buggy
script.

It's been a while since I seriously played with replication, but if
memory serves, a proper way to switch node1 to a replica mode would be
to use pg_rewind on it first.

-- 
Best regards,
Aleksander Alekseev



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hi Aleksander,

19.01.2024 14:45, Aleksander Alekseev wrote:
>
>> it might not go online, due to the error:
>> new timeline N forked off current database system timeline M before current recovery point X/X
>> [...]
>> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
>> only record 0/304DBF0, then node2, being promoted to primary, forked a next
>> timeline from it, but when node1 was started as a standby, it first
>> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
>> starting from the previous position.
> Unless I'm missing something, this is just the right behavior of the system.

Thank you for the answer!

> node1 has no way of knowing the history of node1/node2/nodeN
> promotion. It sees that it has more data and/or inconsistent timeline
> with another node and refuses to process further until DBA will
> intervene.

But node1 knows that it's a standby now and it's expected to get all the
WAL records from the primary, doesn't it?
Maybe it could REDO from it's own WAL as little records as possible,
before requesting records from the authoritative source...
Is it supposed that it's more performance-efficient (not on the first
restart, but on later ones)?

>   What else can node1 do, drop the data? That's not how
> things are done in Postgres :)

In case no other options exist (this behavior is really correct and the
only possible), maybe the server should just stop?
Can DBA intervene somehow to make the server proceed without stopping it?

> It's been a while since I seriously played with replication, but if
> memory serves, a proper way to switch node1 to a replica mode would be
> to use pg_rewind on it first.

Perhaps that's true generally, but as we can see, without the extra
records replayed, this scenario works just fine. Moreover, existing tests
rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
research of the issue was initiated per a test failure).

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Aleksander Alekseev
Дата:
Hi,

> But node1 knows that it's a standby now and it's expected to get all the
> WAL records from the primary, doesn't it?

Yes, but node1 doesn't know if it always was a standby or not. What if
node1 was always a standby, node2 was a primary, then node2 died and
node3 is a new primary. If node1 sees inconsistency  in the WAL
records, it should report it and stop doing anything, since it doesn't
has all the information needed to resolve the inconsistencies in all
the possible cases. Only DBA has this information.

> > It's been a while since I seriously played with replication, but if
> > memory serves, a proper way to switch node1 to a replica mode would be
> > to use pg_rewind on it first.
>
> Perhaps that's true generally, but as we can see, without the extra
> records replayed, this scenario works just fine. Moreover, existing tests
> rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
> research of the issue was initiated per a test failure).

I suggest focusing on particular flaky tests then and how to fix them.

-- 
Best regards,
Aleksander Alekseev



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hi Aleksander,

[ I'm writing this off-list to minimize noise, but we can continue the discussion in -hackers, if you wish ]

22.01.2024 14:00, Aleksander Alekseev wrote:

> Hi,
>
>> But node1 knows that it's a standby now and it's expected to get all the
>> WAL records from the primary, doesn't it?
> Yes, but node1 doesn't know if it always was a standby or not. What if
> node1 was always a standby, node2 was a primary, then node2 died and
> node3 is a new primary.

Excuse me, but I still can't understand what could go wrong in this case.
Let's suppose, node1 has WAL with the following contents before start:
CPLOC | TL1R1 | TL1R2 | TL1R3 |

while node2's WAL contains:
TL1R1 | TL2R1 | TL2R2 | ...

where CPLOC -- a checkpoint location, TLxRy -- a record y on a timeline x.

I assume that requesting all WAL records from node2 without redoing local
records should be the right thing.

And even in the situation you propose:
CPLOC | TL2R5 | TL2R6 | TL2R7 |

while node3's WAL contains:
TL2R5 | TL3R1 | TL3R2 | ...

I see no issue with applying records from node3...

>   If node1 sees inconsistency  in the WAL
> records, it should report it and stop doing anything, since it doesn't
> has all the information needed to resolve the inconsistencies in all
> the possible cases. Only DBA has this information.

I still wonder, what can be considered an inconsistency in this situation.
Doesn't the exactly redo of all the local WAL records create the
inconsistency here?
For me, it's the question of an authoritative source, and if we had such a
source, we should trust it's records only.

Or in the other words, what if the record TL1R3, which node1 wrote to it's
WAL, but didn't send to node2, happened to have an incorrect checksum (due
to partial write, for example)?
If I understand correctly, node1 will just stop redoing WAL at that
position to receive all the following records from node2 and move forward
without reporting the inconsistency (an extra WAL record).

Best regards,
Alexander