Re: Changing the state of data checksums in a running cluster

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Changing the state of data checksums in a running cluster
Дата
Msg-id 47d946a3-c1c7-421d-a2b1-6a51cc329e6c@vondra.me
обсуждение исходный текст
Ответ на Re: Changing the state of data checksums in a running cluster  (Daniel Gustafsson <daniel@yesql.se>)
Список pgsql-hackers
On 8/25/25 20:32, Daniel Gustafsson wrote:
>> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> This happens quite regularly, it's not hard to hit. But I've only seen
>> it to happen on a FSM, and only right after immediate shutdown. I don't
>> think that's quite expected.
>>
>> I believe the built-in TAP tests (with injection points) can't catch
>> this, because there's no concurrent activity while flipping checksums
>> on/off. It'd be good to do something like that, by running pgbench in
>> the background, or something like that.
> 
> In searching for this bug I opted for implementing a version of the stress
> tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
> version.  It's gated behind PG_TEST_EXTRA since it's clearly not something
> which can be enabled by default (if this goes in this need to be re-done to
> provide two levels IMO, but during testing this is more convenient).  I'm
> curious to see which improvements you can think to make it stress the code to
> the breaking point.
> 

I think this TAP looks very nice, but there's a couple issues with it.
See the attached patch fixing those.

1) I think test_checksums should be in src/test/modules/Makefile?

2) The test_checksums/Makefile didn't seem to work for me, I was getting

Makefile:23: *** recipe commences before first target.  Stop.

Because there was a missing "\" so I had to fix that. And then it was
complaining about Makefile.global or something, so I fixed that by
cargo-culting what other Makefiles in test modules do. Now it seems to
work for me. I guess you're on meson?

3) I'm no perl expert, but AFAICS the test wasn't really running the
pgbench, for a couple of reasons. It was passing "-q" to pgbench, but
that's only for initialization. The clusters had max_connections=10, but
the pgbench was using "-c 10", so I was getting "too many connections".
It was not setting "$pgbench_running = 1" so the other loops were
getting "too many connections" too. Another thing is I'm not sure it's
OK to pass '' to IPC::Run::start, I think it'll take it as an argument,
confusing pgbench.

With these changes it runs for me, and I even saw some

   LOG: page verification failed

in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
while - a couple minutes, maybe? I think I saw it at

    t/006_concurrent_pgbench.pl .. 427/?

or something like that. I think the bash version did a couple things
differently, which might make the failures more frequent (but it's just
a wild guess).

In particular, I think the script restarts the two nodes independently,
while the TAP always stops both primary and standby, in this order. I
think it'd be useful to restart one or both.

The other thing is the bash script added some random delays/sleep, which
increases the test duration, but it also means generating somewhat
random amounts of data, etc. It also randomized some other stuff (scale,
client count, ...). But that can wait.


regards

-- 
Tomas Vondra

Вложения

В списке pgsql-hackers по дате отправления: