Re: stats.sql might fail due to shared buffers also used by parallel tests

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема Re: stats.sql might fail due to shared buffers also used by parallel tests
Дата
Msg-id e9d78260-d401-406b-9e3a-9f939f27359b@gmail.com
обсуждение исходный текст
Ответ на Re: stats.sql might fail due to shared buffers also used by parallel tests  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
Список pgsql-hackers
Hello Shlok,

03.07.2025 09:54, Shlok Kyal wrote:
> I have also encountered a similar buildfarm failure [1].
>
> | 1/1 + subscription                              142 ms FAIL
> 1/1 postgresql:regress / regress/regress ERROR           284.85s   exit status 1
>
> diff --strip-trailing-cr -U3
> c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
> c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
> --- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
> 2025-06-28 20:13:02 +0900
> +++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
> 2025-06-28 20:35:21 +0900
> @@ -70,7 +70,7 @@
>   SELECT :'prev_stats_reset' < stats_reset FROM
> pg_stat_subscription_stats WHERE subname = 'regress_testsub';
>    ?column?
>   ----------
> - t
> + f
>   (1 row)

Thank you for your attention to this! Yes, I also noticed that failure,
but could not reproduce it yet.

> So according to me, I suspect the following causes
> 1) The time difference between 'prev_stats_reset' and current
> 'stats_reset' value is less than 1 microseconds.
> 'stats_reset' is of type 'timestamp with time zone' and the content of
> it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
> difference between 'prev_stats_reset' and current 'stats_reset' is
> less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
> stats_reset FROM pg_stat_subscription_stats WHERE subname =
> 'regress_testsub'' might return 'false' instead of 'true'.
> But I was not able to reproduce such a scenario after multiple
> testing. Even in high end machines, it takes at least a few
> microseconds. Also there are multiple cases where we did similar
> timestamp comparison in 'stats.sql' as well. And, I didn't find any
> other failure related to such case. So, I feel this is not possible.

Did you try that on Windows (hamerkop is a Windows animal)? IIUC,
GetCurrentTimestamp() -> gettimeofday() implemented on Windows via
GetSystemTimePreciseAsFileTime(), and it has 100ns resolution,
according to [1].

I tried it on a Windows VM and saw the following numbers:
SELECT :'prev_stats_reset' < stats_reset, :'prev_stats_reset', stats_reset FROM pg_stat_subscription_stats WHERE
subname
 
= 'regress_testsub';
  ?column? |              ?column?               | stats_reset
----------+-------------------------------------+-------------------------------------
  t        | Fri Jul 04 11:46:10.224323 2025 PDT | Fri Jul 04 11:46:10.224975 2025 PDT
  (1 row)

that is, the difference is about 600 ns — if I could make the VM 6x faster
or run the test on a bare metal, maybe I'd get the test failed.

Although the overall duration of the subscription doesn't differ much:
My VM:
# parallel group (2 tests):  subscription publication
ok 164       + publication                               806 ms
ok 165       + subscription                              210 ms
vs
hamerkop:
# parallel group (2 tests):  subscription publication
ok 164       + publication                              2383 ms
not ok 165   + subscription                              142 ms

So probably the failure is caused by something else.

> 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
> We have a shared hash 'pgStatLocal.shared_hash'. If the entry
> reference (for the subscription) is not found while executing
> 'pg_stat_reset_subscription_stats(oid)'. It  may not be able to reset
> the stats. Maybe somehow this shared hash is getting dropped..
> Also, it could be failing due to the same reason as Alexander has

I don't think 2) is relevant here, because shared buffers shouldn't affect
subscription's statistics.

[1] https://stackoverflow.com/questions/4444090/measuring-time-with-a-resolution-of-microseconds-in-c

Best regards.
Alexander



В списке pgsql-hackers по дате отправления: