Re: BF member drongo doesn't like 035_standby_logical_decoding.pl
| От | Andres Freund | 
|---|---|
| Тема | Re: BF member drongo doesn't like 035_standby_logical_decoding.pl | 
| Дата | |
| Msg-id | j3dy36nv37uieq54xm2zoi5qzi7gfzohuf24vr5nqtusfxd2dj@xbuc2so4rrfx обсуждение исходный текст | 
| Ответ на | Re: BF member drongo doesn't like 035_standby_logical_decoding.pl (Andres Freund <andres@anarazel.de>) | 
| Список | pgsql-hackers | 
Hi, On 2025-01-24 14:23:26 -0500, Andres Freund wrote: > On 2025-01-24 11:42:15 -0500, Tom Lane wrote: > There's a failure before this one: > [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class > > > > This has been happening for some time, in all three branches where > > that test script exists. The oldest failure that looks like that in > > the v16 branch is > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35 > > Hm. > > 2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at0/4000180 > 2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions. > 2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot"LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1') > ... > 2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot" > 2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747. > 2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected:0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0 > ... > 2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed,0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680 > 2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680 > ... > 2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0 > > There is no further message from 1612 and no intervening message by 3984 > (startup) between invalidating row_removal_inactiveslot and reaching the end > of the WAL. > > The regress_log does have: > > Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary > done > > > I.e. this looks to not be a failure of the test to find the right message or > the stats about the terminatio nnot being flushed, but that > row_removal_activeslot isn't getting terminated. And due to the replay_lsn > check on the primary, we know that replication must have progressed past the > point of the conflict. So the problem doesn't appear to be that we're waiting > for termination, but that somehow we determined that termination is not > necessary. > > Unfortunately with the logs available I can't really pinpoint why. I think we > need to > a) increase the postgres log level for the standby > b) add some explicit debug logging to report the contents of > pg_replication_slots > > I hope the issue is that somehow we didn't need to terminate, rather than > missing the need to terminate and just continuing on with replay. But I can't > think of a scenario in which there wouldn't be a need to terminate... <thinks> Hm, maybe I'm missing something, but isn't it possible for the active slot to actually progress decoding past the conflict point? It's an active slot, with the consumer running in the background, so all that needs to happen for that is that logical decoding progresses past the conflict point. That requires there be some reference to a newer xid to be in the WAL, but there's nothing preventing that afaict? In fact, I now saw this comment: # Note that pg_current_snapshot() is used to get the horizon. It does # not generate a Transaction/COMMIT WAL record, decreasing the risk of # seeing a xl_running_xacts that would advance an active replication slot's # catalog_xmin. Advancing the active replication slot's catalog_xmin # would break some tests that expect the active slot to conflict with # the catalog xmin horizon. Which seems precisely what's happening here? If that's the issue, I think we need to find a way to block logical decoding from making forward progress during the test. The easiest way would be to stop pg_recvlogical and emit a bunch of changes, so that the backend is stalled sending out data. But that'd require a hard to predict amount of data to be emitted, which isn't great. But perhaps we could do something smarter, by starting a session on the primary that acquires an access exclusive lock on a relation that logical decoding will need to access? The tricky bit likely would be that it'd somehow need to *not* prevent VACUUM on the primary. If we could trigger VACUUM in a transaction on the primary this would be easy, but we can't. Greetings, Andres
В списке pgsql-hackers по дате отправления: