Test 031_recovery_conflict fails when a conflict counted twice
От | Alexander Lakhin |
---|---|
Тема | Test 031_recovery_conflict fails when a conflict counted twice |
Дата | |
Msg-id | 153ad43d-6ea2-6281-7781-dfbc1370f060@gmail.com обсуждение исходный текст |
Ответы |
Re: Test 031_recovery_conflict fails when a conflict counted twice
|
Список | pgsql-hackers |
Hello hackers, As a recent buildfarm test failure on olingo (which builds postgres with -O1 and address sanitizer) [1] shows: ... [23:12:02.127](0.166s) not ok 6 - snapshot conflict: stats show conflict on standby [23:12:02.130](0.003s) # Failed test 'snapshot conflict: stats show conflict on standby' # at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332. [23:12:02.130](0.000s) # got: '2' # expected: '1' ... [23:12:06.848](1.291s) not ok 17 - 5 recovery conflicts shown in pg_stat_database [23:12:06.887](0.040s) # Failed test '5 recovery conflicts shown in pg_stat_database' # at /home/bf/bf-build/olingo/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 286. [23:12:06.887](0.000s) # got: '6' # expected: '5' Waiting for replication conn standby's replay_lsn to pass 0/3459160 on primary done ... pgsql.build/testrun/recovery/031_recovery_conflict/log/031_recovery_conflict_standby.log: 2024-05-15 23:12:01.616 UTC [1299981][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; 2024-05-15 23:12:01.617 UTC [1299981][client backend][2/2:0] LOG: statement: ; 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] LOG: recovery still waiting after 15.289 ms: recovery conflict on snapshot 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] DETAIL: Conflicting process: 1299981. 2024-05-15 23:12:01.910 UTC [1297595][startup][34/0:0] CONTEXT: WAL redo at 0/344F468 for Heap2/PRUNE_VACUUM_SCAN: snapshotConflictHorizon: 746, isCatalogRel: F, nplans: 2, nredirected: 18, ndead: 0, nunused: 0, plans: [{ xmax: 0, infomask: 2816, infomask2: 2, ntuples: 2, offsets: [21, 22] }, { xmax: 0, infomask: 11008, infomask2: 32770, ntuples: 18, offsets: [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] }], redirected: [23->41, 24->42, 25->43, 26->44, 27->45, 28->46, 29->47, 30->48, 31->49, 32->50, 33->51, 34->52, 35->53, 36->54, 37->55, 38->56, 39->57, 40->58]; blkref #0: rel 1663/16385/16386, blk 0 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] DETAIL: User query might have needed to see row versions that must be removed. 2024-05-15 23:12:01.959 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the database and repeat your command. vvv an important difference from a successful test run 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] LOG: could not send data to client: Broken pipe 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] FATAL: terminating connection due to conflict with recovery 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] DETAIL: User query might have needed to see row versions that must be removed. 2024-05-15 23:12:01.966 UTC [1299981][client backend][2/2:0] HINT: In a moment you should be able to reconnect to the database and repeat your command. ^^^ test 031_recovery_conflict may fail in the following scenario: 031_recovery_conflict.pl: executes a query, which produces a conflict: ## RECOVERY CONFLICT 2: Snapshot conflict ... $psql_standby->query_safe(...) startup process: detects a snapshot conflict and sends PROCSIG_RECOVERY_CONFLICT_SNAPSHOT: (ResolveRecoveryConflictWithVirtualXIDs -> CancelVirtualTransaction) to the client backend client backend: receives and processes the signal: HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt -> ProcessInterrupts -> ProcessRecoveryConflictInterrupts -> ProcessRecoveryConflictInterrupt, reports the recovery conflict: pgstat_report_recovery_conflict(reason); and reports the error: ereport(FATAL, ... "terminating connection due to conflict with recovery" ...) sends the message to the server log: errfinish -> EmitErrorReport -> send_message_to_server_log 031_recovery_conflict.pl: # finds the message in the log and resets psql connection: check_conflict_log( "User query might have needed to see row versions that must be removed"); $psql_standby->reconnect_and_clear(); startup process: keeps sending PROCSIG_RECOVERY_CONFLICT_SNAPSHOT to the client backend in a loop inside ResolveRecoveryConflictWithVirtualXIDs client backend: tries to send the message to the client: send_message_to_frontend -> socket_flush -> internal_flush_buffer, gets an error (EPIPE) from secure_write, and calls ereport(COMMERROR, (errcode_for_socket_access(), errmsg("could not send data to client: %m"))); receives the following PROCSIG_RECOVERY_CONFLICT_SNAPSHOT signal and processes it the same way: HandleRecoveryConflictInterrupt; ProcessClientReadInterrupt -> ProcessInterrupts -> ProcessRecoveryConflictInterrupts -> ProcessRecoveryConflictInterrupt, reports the recovery conflict: pgstat_report_recovery_conflict(reason); // now the conflict is counted twice and reports the error: ereport(FATAL, ... "terminating connection due to conflict with recovery" ...) sends the message to the server log: errfinish -> EmitErrorReport -> send_message_to_server_log 031_recovery_conflict.pl: calls check_conflict_stat("snapshot"); and gets 2 instead of 1. The patch adding delays to reproduce the issue is attached. With the patch applied, I run the test (against an "-O0" build) in a loop: for i in `seq 20`; do echo "I $i"; make check -s -C \ src/test/recovery/ PROVE_TESTS="t/031*"; grep ' not ok 6 ' \ src/test/recovery/tmp_check/log/regress_log_031_recovery_conflict && break; done and get exactly the same failure as on olingo: I 1 # +++ tap check in src/test/recovery +++ t/031_recovery_conflict.pl .. 6/? # Failed test 'snapshot conflict: stats show conflict on standby' # at t/031_recovery_conflict.pl line 333. # got: '2' # expected: '1' t/031_recovery_conflict.pl .. 13/? # Failed test '5 recovery conflicts shown in pg_stat_database' # at t/031_recovery_conflict.pl line 287. # got: '6' # expected: '5' # Looks like you failed 2 tests of 18. t/031_recovery_conflict.pl .. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/18 subtests Test Summary Report ------------------- t/031_recovery_conflict.pl (Wstat: 512 Tests: 18 Failed: 2) Failed tests: 6, 17 Non-zero exit status: 2 (Similar failures can be seen with other sections of 031_recovery_conflict.) [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-05-15%2023%3A03%3A30 Best regards, Alexander
Вложения
В списке pgsql-hackers по дате отправления: