Re: Changing the state of data checksums in a running cluster
От | Tomas Vondra |
---|---|
Тема | Re: Changing the state of data checksums in a running cluster |
Дата | |
Msg-id | 47d946a3-c1c7-421d-a2b1-6a51cc329e6c@vondra.me обсуждение исходный текст |
Ответ на | Re: Changing the state of data checksums in a running cluster (Daniel Gustafsson <daniel@yesql.se>) |
Список | pgsql-hackers |
On 8/25/25 20:32, Daniel Gustafsson wrote: >> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote: > >> This happens quite regularly, it's not hard to hit. But I've only seen >> it to happen on a FSM, and only right after immediate shutdown. I don't >> think that's quite expected. >> >> I believe the built-in TAP tests (with injection points) can't catch >> this, because there's no concurrent activity while flipping checksums >> on/off. It'd be good to do something like that, by running pgbench in >> the background, or something like that. > > In searching for this bug I opted for implementing a version of the stress > tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch > version. It's gated behind PG_TEST_EXTRA since it's clearly not something > which can be enabled by default (if this goes in this need to be re-done to > provide two levels IMO, but during testing this is more convenient). I'm > curious to see which improvements you can think to make it stress the code to > the breaking point. > I think this TAP looks very nice, but there's a couple issues with it. See the attached patch fixing those. 1) I think test_checksums should be in src/test/modules/Makefile? 2) The test_checksums/Makefile didn't seem to work for me, I was getting Makefile:23: *** recipe commences before first target. Stop. Because there was a missing "\" so I had to fix that. And then it was complaining about Makefile.global or something, so I fixed that by cargo-culting what other Makefiles in test modules do. Now it seems to work for me. I guess you're on meson? 3) I'm no perl expert, but AFAICS the test wasn't really running the pgbench, for a couple of reasons. It was passing "-q" to pgbench, but that's only for initialization. The clusters had max_connections=10, but the pgbench was using "-c 10", so I was getting "too many connections". It was not setting "$pgbench_running = 1" so the other loops were getting "too many connections" too. Another thing is I'm not sure it's OK to pass '' to IPC::Run::start, I think it'll take it as an argument, confusing pgbench. With these changes it runs for me, and I even saw some LOG: page verification failed in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a while - a couple minutes, maybe? I think I saw it at t/006_concurrent_pgbench.pl .. 427/? or something like that. I think the bash version did a couple things differently, which might make the failures more frequent (but it's just a wild guess). In particular, I think the script restarts the two nodes independently, while the TAP always stops both primary and standby, in this order. I think it'd be useful to restart one or both. The other thing is the bash script added some random delays/sleep, which increases the test duration, but it also means generating somewhat random amounts of data, etc. It also randomized some other stuff (scale, client count, ...). But that can wait. regards -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: