Обсуждение: Re: stats.sql might fail due to shared buffers also used by parallel tests

Поиск

Список

Период

Сортировка

Re: stats.sql might fail due to shared buffers also used by parallel tests

От

Shlok Kyal

Дата:

03 июля 2025 г., 09:54:50

On Mon, 6 Jan 2025 at 12:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello hackers,
>
> I'd like to share my investigation of one mysterious stats.sql failure
> occurred in December: [1].
> The difference of the failure is:
> --- C:/prog/bf/root/HEAD/pgsql/src/test/regress/expected/stats.out 2024-09-18 19:31:14.665516500 +0000
> +++ C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2024-12-17
> 09:57:08.944588500 +0000
> @@ -1291,7 +1291,7 @@
>   SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
>    ?column?
>   ----------
> - t
> + f
>   (1 row)
>
> 027_stream_regress_primary.log contains:
> 2024-12-17 09:57:06.778 UTC [8568:109] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS
fsyncs
>        FROM pg_stat_io
>        WHERE object = 'relation'
> 2024-12-17 09:57:06.779 UTC [8568:110] pg_regress/stats LOG: statement: CREATE TABLE test_io_shared(a int);
> 2024-12-17 09:57:06.780 UTC [8568:111] pg_regress/stats LOG: statement: INSERT INTO test_io_shared SELECT i FROM
> generate_series(1,100)i;
> ...
> 2024-12-17 09:57:06.782 UTC [8568:115] pg_regress/stats LOG: statement: CHECKPOINT;
> 2024-12-17 09:57:06.794 UTC [3664:40] LOG:  checkpoint starting: immediate force wait
> 2024-12-17 09:57:06.856 UTC [3664:41] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
> file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.062 s; sync files=0, longest=0.000 s,
> average=0.000 s; distance=1875 kB, estimate=52682 kB; lsn=0/14A2F410, redo lsn=0/14A2F3B8
> 2024-12-17 09:57:06.857 UTC [8568:116] pg_regress/stats LOG: statement: CHECKPOINT;
> 2024-12-17 09:57:06.857 UTC [3664:42] LOG:  checkpoint starting: immediate force wait
> 2024-12-17 09:57:06.859 UTC [3664:43] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
> file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s,
> average=0.000 s; distance=0 kB, estimate=47414 kB; lsn=0/14A2F4E0, redo lsn=0/14A2F488
> 2024-12-17 09:57:06.859 UTC [8568:117] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS
fsyncs
>        FROM pg_stat_io
>        WHERE object = 'relation'
> 2024-12-17 09:57:06.860 UTC [8568:118] pg_regress/stats LOG: statement: SELECT 77693 > 77693;
>
> The corresponding test case (added by 10a082bf7, dated 2023-02-11) is:
> SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
>    FROM pg_stat_io
>    WHERE object = 'relation' \gset io_sum_shared_before_
> ...
> CREATE TABLE test_io_shared(a int);
> INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
> ...
> CHECKPOINT;
> CHECKPOINT;
> ...
> SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
>    FROM pg_stat_get_backend_io(pg_backend_pid())
>    WHERE object = 'relation' \gset my_io_sum_shared_after_
> SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
>
> So, as we can see from the log above, both checkpoints wrote 0 buffers,
> moreover, no other process/backend wrote any buffers.
>
> After series of experiments, I found that this situation is possible when
> other backend steals shared buffers from checkpointer, without making new
> buffers dirty, and then delays flushing it's statistics.
>
> Please find attached a reproducer for this issue: I've added two sleeps
> to stats.sql and reduced parallel_schedule, but I believe it's a legal
> change, which doesn't affect the test case essentially.
>
> With bufchurn.sql running concurrently. and TEMP_CONFIG containing
> shared_buffers = 1MB (as in the 027_stream_regress test in question), I can
> see exactly the same failure:
> $ echo 'shared_buffers = 1MB' >/tmp/extra.config; TEMP_CONFIG=/tmp/extra.config make -s check
> # +++ regress check in src/test/regress +++
> # initializing database system by copying initdb template
> # using temp instance on port 65312 with PID 129700
> ok 1         - test_setup                                356 ms
> ok 2         - create_index                             1002 ms
> # parallel group (2 tests):  stats bufchurn
> not ok 3     + stats                                   21176 ms
> ...
>
> diff -U3 .../src/test/regress/expected/stats.out .../src/test/regress/results/stats.out
> --- .../src/test/regress/expected/stats.out  2025-01-06 08:03:07.720275980 +0200
> +++ .../src/test/regress/results/stats.out   2025-01-06 08:15:34.602170721 +0200
> @@ -1321,7 +1321,7 @@
>   SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes;
>    ?column?
>   ----------
> - t
> + f
>   (1 row)
> ...
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-17%2008%3A59%3A44
>

Hi Hackers,

I have also encountered a similar buildfarm failure [1].

| 1/1 + subscription                              142 ms FAIL
1/1 postgresql:regress / regress/regress ERROR           284.85s   exit status 1

diff --strip-trailing-cr -U3
c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
--- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
2025-06-28 20:13:02 +0900
+++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
2025-06-28 20:35:21 +0900
@@ -70,7 +70,7 @@
 SELECT :'prev_stats_reset' < stats_reset FROM
pg_stat_subscription_stats WHERE subname = 'regress_testsub';
  ?column?
 ----------
- t
+ f
 (1 row)

logs around this statement do not show unexpected thing:
2025-06-28 20:35:20.857 JST client backend[6092]
pg_regress/subscription STATEMENT:  CREATE SUBSCRIPTION
regress_testsub CONNECTION 'testconn' PUBLICATION testpub;
2025-06-28 20:35:20.857 JST client backend[6092]
pg_regress/subscription ERROR:  publication name "foo" used more than
once
2025-06-28 20:35:20.857 JST client backend[6092]
pg_regress/subscription STATEMENT:  CREATE SUBSCRIPTION
regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION
foo, testpub, foo WITH (connect = false);
2025-06-28 20:35:20.858 JST client backend[6092]
pg_regress/subscription WARNING:  subscription was created, but is not
connected
2025-06-28 20:35:20.858 JST client backend[6092]
pg_regress/subscription HINT:  To initiate replication, you must
manually create the replication slot, enable the subscription, and
refresh the subscription.
2025-06-28 20:35:20.861 JST client backend[6092]
pg_regress/subscription ERROR:  subscription "regress_testsub" already
exists

In subscription.sql, I looked at the test. The statement are like:
-- Reset the stats again and check if the new reset_stats is updated.
SELECT stats_reset as prev_stats_reset FROM pg_stat_subscription_stats
WHERE subname = 'regress_testsub' \gset
SELECT pg_stat_reset_subscription_stats(oid) FROM pg_subscription
WHERE subname = 'regress_testsub';
SELECT :'prev_stats_reset' < stats_reset FROM
pg_stat_subscription_stats WHERE subname = 'regress_testsub';

So according to me, I suspect the following causes
1) The time difference between 'prev_stats_reset' and current
'stats_reset' value is less than 1 microseconds.
'stats_reset' is of type 'timestamp with time zone' and the content of
it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
difference between 'prev_stats_reset' and current 'stats_reset' is
less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
stats_reset FROM pg_stat_subscription_stats WHERE subname =
'regress_testsub'' might return 'false' instead of 'true'.
But I was not able to reproduce such a scenario after multiple
testing. Even in high end machines, it takes at least a few
microseconds. Also there are multiple cases where we did similar
timestamp comparison in 'stats.sql' as well. And, I didn't find any
other failure related to such case. So, I feel this is not possible.

2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
We have a shared hash 'pgStatLocal.shared_hash'. If the entry
reference (for the subscription) is not found while executing
'pg_stat_reset_subscription_stats(oid)'. It  may not be able to reset
the stats. Maybe somehow this shared hash is getting dropped..
Also, it could be failing due to the same reason as Alexander has
reproduced in another case [2].

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-06-28%2011%3A02%3A30
[2]: https://www.postgresql.org/message-id/fe0391a8-dfa9-41c3-bf1c-7ea058e40f30%40gmail.com

Thanks and Regards,
Shlok Kyal

Re: stats.sql might fail due to shared buffers also used by parallel tests

От

Alexander Lakhin

Дата:

04 июля 2025 г., 23:00:00

Hello Shlok,

03.07.2025 09:54, Shlok Kyal wrote:
> I have also encountered a similar buildfarm failure [1].
>
> | 1/1 + subscription                              142 ms FAIL
> 1/1 postgresql:regress / regress/regress ERROR           284.85s   exit status 1
>
> diff --strip-trailing-cr -U3
> c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
> c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
> --- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out
> 2025-06-28 20:13:02 +0900
> +++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out
> 2025-06-28 20:35:21 +0900
> @@ -70,7 +70,7 @@
>   SELECT :'prev_stats_reset' < stats_reset FROM
> pg_stat_subscription_stats WHERE subname = 'regress_testsub';
>    ?column?
>   ----------
> - t
> + f
>   (1 row)

Thank you for your attention to this! Yes, I also noticed that failure,
but could not reproduce it yet.

> So according to me, I suspect the following causes
> 1) The time difference between 'prev_stats_reset' and current
> 'stats_reset' value is less than 1 microseconds.
> 'stats_reset' is of type 'timestamp with time zone' and the content of
> it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
> difference between 'prev_stats_reset' and current 'stats_reset' is
> less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
> stats_reset FROM pg_stat_subscription_stats WHERE subname =
> 'regress_testsub'' might return 'false' instead of 'true'.
> But I was not able to reproduce such a scenario after multiple
> testing. Even in high end machines, it takes at least a few
> microseconds. Also there are multiple cases where we did similar
> timestamp comparison in 'stats.sql' as well. And, I didn't find any
> other failure related to such case. So, I feel this is not possible.

Did you try that on Windows (hamerkop is a Windows animal)? IIUC,
GetCurrentTimestamp() -> gettimeofday() implemented on Windows via
GetSystemTimePreciseAsFileTime(), and it has 100ns resolution,
according to [1].

I tried it on a Windows VM and saw the following numbers:
SELECT :'prev_stats_reset' < stats_reset, :'prev_stats_reset', stats_reset FROM pg_stat_subscription_stats WHERE
subname
 
= 'regress_testsub';
  ?column? |              ?column?               | stats_reset
----------+-------------------------------------+-------------------------------------
  t        | Fri Jul 04 11:46:10.224323 2025 PDT | Fri Jul 04 11:46:10.224975 2025 PDT
  (1 row)

that is, the difference is about 600 ns — if I could make the VM 6x faster
or run the test on a bare metal, maybe I'd get the test failed.

Although the overall duration of the subscription doesn't differ much:
My VM:
# parallel group (2 tests):  subscription publication
ok 164       + publication                               806 ms
ok 165       + subscription                              210 ms
vs
hamerkop:
# parallel group (2 tests):  subscription publication
ok 164       + publication                              2383 ms
not ok 165   + subscription                              142 ms

So probably the failure is caused by something else.

> 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
> We have a shared hash 'pgStatLocal.shared_hash'. If the entry
> reference (for the subscription) is not found while executing
> 'pg_stat_reset_subscription_stats(oid)'. It  may not be able to reset
> the stats. Maybe somehow this shared hash is getting dropped..
> Also, it could be failing due to the same reason as Alexander has

I don't think 2) is relevant here, because shared buffers shouldn't affect
subscription's statistics.

[1] https://stackoverflow.com/questions/4444090/measuring-time-with-a-resolution-of-microseconds-in-c

Best regards.
Alexander

RE: stats.sql might fail due to shared buffers also used by parallel tests

От

"Hayato Kuroda (Fujitsu)"

Дата:

15 июля 2025 г., 10:33:10

Dear Alexander,

> > So according to me, I suspect the following causes
> > 1) The time difference between 'prev_stats_reset' and current
> > 'stats_reset' value is less than 1 microseconds.
> > 'stats_reset' is of type 'timestamp with time zone' and the content of
> > it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time
> > difference between 'prev_stats_reset' and current 'stats_reset' is
> > less than 1 microseconds. The query 'SELECT :'prev_stats_reset' <
> > stats_reset FROM pg_stat_subscription_stats WHERE subname =
> > 'regress_testsub'' might return 'false' instead of 'true'.
> > But I was not able to reproduce such a scenario after multiple
> > testing. Even in high end machines, it takes at least a few
> > microseconds. Also there are multiple cases where we did similar
> > timestamp comparison in 'stats.sql' as well. And, I didn't find any
> > other failure related to such case. So, I feel this is not possible.
> 
> Did you try that on Windows (hamerkop is a Windows animal)? IIUC,
> GetCurrentTimestamp() -> gettimeofday() implemented on Windows via
> GetSystemTimePreciseAsFileTime(), and it has 100ns resolution,

Hmm. I'm not familiar with the Windows environment, but I have the doubt for it.

GetSystemTimePreciseAsFileTime() returns FILETIME structure, which represents the
time UTC with 100-nanosecod intervals [1]. The stack overflow seemed to refer it.
However, the document for GetSystemTimePreciseAsFileTime() says that the
resolution is < 1 us [2]. Also, MS doc [3] does not say that
GetSystemTimePreciseAsFileTime() returns value monotonically.
Another API QueryPerformanceCounter() seems to have the monotony.

A bit old document [4] also raised the possibility:

```
Consecutive calls may return the same result. The call time is less than the
smallest increment of the system time. The granularity is in the sub-microsecond
regime. The function may be used for time measurements but some care has to be
taken: Time differences may be ZERO.
```

Also, what if the the system clock is modified during the test via NTP?

> > 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
> > We have a shared hash 'pgStatLocal.shared_hash'. If the entry
> > reference (for the subscription) is not found while executing
> > 'pg_stat_reset_subscription_stats(oid)'. It  may not be able to reset
> > the stats. Maybe somehow this shared hash is getting dropped..
> > Also, it could be failing due to the same reason as Alexander has
> 
> I don't think 2) is relevant here, because shared buffers shouldn't affect
> subscription's statistics.

To confirm; we do not consider the possibility that pgstat_get_entry_ref() returns
NULL right?

[1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime
[2]: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime
[3]: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps
[4]:
http://www.windowstimestamp.com/description#:~:text=2.1.4.2.%C2%A0%C2%A0Desktop%20Applications%3A%20GetSystemTimePreciseAsFileTime()

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: stats.sql might fail due to shared buffers also used by parallel tests

От

Alexander Lakhin

Дата:

20 июля 2025 г., 11:00:01

Hello Kuroda-san,

Thank you for your attention to this!

15.07.2025 10:33, Hayato Kuroda (Fujitsu) wrote:

GetSystemTimePreciseAsFileTime() returns FILETIME structure, which represents the
time UTC with 100-nanosecod intervals [1]. The stack overflow seemed to refer it.
However, the document for GetSystemTimePreciseAsFileTime() says that the
resolution is < 1 us [2]. Also, MS doc [3] does not say that
GetSystemTimePreciseAsFileTime() returns value monotonically.
Another API QueryPerformanceCounter() seems to have the monotony.

A bit old document [4] also raised the possibility:

```
Consecutive calls may return the same result. The call time is less than the
smallest increment of the system time. The granularity is in the sub-microsecond
regime. The function may be used for time measurements but some care has to be
taken: Time differences may be ZERO.
```

Also, what if the the system clock is modified during the test via NTP?

Yeah, I made a simple test for GetSystemTimePreciseAsFileTime() and
confirmed that in my VM it provides sub-microsecond precision. Regarding
NTP, I think the second failure of this ilk [1] makes this cause close to
impossible. (Can't wait for the third one to gather more information.)

2) pg_stat_reset_subscription_stats(oid) function did not reset the stats.
We have a shared hash 'pgStatLocal.shared_hash'. If the entry
reference (for the subscription) is not found while executing
'pg_stat_reset_subscription_stats(oid)'. It  may not be able to reset
the stats. Maybe somehow this shared hash is getting dropped..
Also, it could be failing due to the same reason as Alexander has

I don't think 2) is relevant here, because shared buffers shouldn't affect
subscription's statistics.

To confirm; we do not consider the possibility that pgstat_get_entry_ref() returns
NULL right?

I've held a simple experiment with a modification like this:
@@ -1078,6 +1078,7 @@ pgstat_reset_entry(PgStat_Kind kind, Oid dboid, uint64 objid, TimestampTz ts)
     Assert(!pgstat_get_kind_info(kind)->fixed_amount);

     entry_ref = pgstat_get_entry_ref(kind, dboid, objid, false, NULL);
+if (rand() % 3 == 0) entry_ref = NULL;
     if (!entry_ref || entry_ref->shared_entry->dropped)

and got several failures like:
--- .../postgresql/src/test/regress/expected/subscription.out    2025-04-25 10:27:32.851554400 -0700
+++ .../postgresql/build/testrun/regress/regress/results/subscription.out    2025-07-20 00:05:05.667903300 -0700
@@ -56,7 +56,7 @@
SELECT subname, stats_reset IS NULL stats_reset_is_null FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub';
      subname     | stats_reset_is_null
-----------------+---------------------
- regress_testsub | f
+ regress_testsub | t
(1 row)

-- Reset the stats again and check if the new reset_stats is updated.
@@ -68,11 +68,9 @@
(1 row)

SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub';
- ?column?
-----------
- t
-(1 row)
-
+ERROR: syntax error at or near ":"
+LINE 1: SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscr...
+

--- .../postgresql/src/test/regress/expected/stats.out    2025-04-25 10:27:36.930322500 -0700
+++ .../postgresql/build/testrun/regress/regress/results/stats.out    2025-07-20 00:05:19.579864900 -0700
@@ -1720,7 +1720,7 @@
SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
?column?
----------
- t
+ f
(1 row)
...

Thus, if there is some issue with pgstat_get_entry_ref(), then it should
be specific to subscriptions and come out in that place only (given the
information we have now).

So I still suspect some Windows/concrete animal's peculiarity.

Nagata-san, could you please share the configuration of hamerkop? If it's
running inside VM, what virtualization software is used?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-07-09%2011%3A02%3A23

Best regards.
Alexander

Re: stats.sql might fail due to shared buffers also used by parallel tests

От

Yugo Nagata

Дата:

23 июля 2025 г., 03:55:31

Hello Alexander, 

> Nagata-san, could you please share the configuration of hamerkop? If it's
> running inside VM, what virtualization software is used?
> 
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-07-09%2011%3A02%3A23

I am not the person in charge of it, so I added the responsible email address to the CC list.
Takatsuka-san, could you please take care of this?

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: stats.sql might fail due to shared buffers also used by parallel tests

От

Alexander Lakhin

Дата:

23 июля 2025 г., 07:00:00

Hello Kuroda-san,

20.07.2025 11:00, Alexander Lakhin wrote:
>
> Yeah, I made a simple test for GetSystemTimePreciseAsFileTime() and
> confirmed that in my VM it provides sub-microsecond precision. Regarding
> NTP, I think the second failure of this ilk [1] makes this cause close to
> impossible. (Can't wait for the third one to gather more information.)

And here it is [1]:
diff --strip-trailing-cr -U3 c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.out 
c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/results/stats.out
--- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.out 2025-07-22 20:08:30 +0900
+++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/results/stats.out 2025-07-22 20:30:47
+0900
@@ -3729,7 +3729,7 @@

  name |pg_stat_get_function_calls|total_above_zero|self_above_zero
  --------------+--------------------------+----------------+---------------
-test_stat_func|                         1|t |t
+test_stat_func|                         1|f |f
  (1 row)

Not related to subscriptions this time, but still related to pg_stat and
time measurement.

There was the same looking failure on caiman, which is running on
continuously updated Fedora: [2].

I could not reproduce that one either, but suspected the OS kernel:
It had 6.14.0-0.rc3.20250219git6537cfb395f3.31.fc43.x86_64 at the moment
of the failure and updated to 6.14.0-0.rc4.36.fc43.x86_64 on 2025-02-27.
There was no failures of this kind since then.

So I think we could observe such anomalies if, say, the OS kernel can't
read system clock in time (stalls for a millisecond when accessing it)...

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-07-22%2011%3A02%3A15
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2025-02-23%2006%3A54%3A54

Best regards,
Alexander

RE: stats.sql might fail due to shared buffers also used by parallel tests

От

"Hayato Kuroda (Fujitsu)"

Дата:

23 июля 2025 г., 09:15:57

Dear Alexander,

> And here it is [1]:
> diff --strip-trailing-cr -U3
> c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.ou
> t
> c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/res
> ults/stats.out
> ---
> c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.ou
> t 2025-07-22 20:08:30 +0900
> +++
> c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/res
> ults/stats.out 2025-07-22 20:30:47 +0900
> @@ -3729,7 +3729,7 @@
> 
>   name |pg_stat_get_function_calls|total_above_zero|self_above_zero
>   --------------+--------------------------+----------------+---------------
> -test_stat_func|                         1|t |t
> +test_stat_func|                         1|f |f
>   (1 row)
> 
> Not related to subscriptions this time, but still related to pg_stat and
> time measurement.

It looks like for me that we measured the execution time of the function in
millisecond but it was "zero", right?

> So I think we could observe such anomalies if, say, the OS kernel can't
> read system clock in time (stalls for a millisecond when accessing it)...

I also feel like that. But if so, how should we fix tests? We must remove all
stuff which assumes the time is monotonic?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: stats.sql might fail due to shared buffers also used by parallel tests

От

Alexander Lakhin

Дата:

26 июля 2025 г., 12:00:01

Hello Kuroda-san and Takatsuka-san,

24.07.2025 03:49, TAKATSUKA Haruka wrote:

{snip}

Maybe you could try tools.syncTime = "0" by any chance?

It has been already tools.syncTime = "0" so far.
I confirmed the following GUI setting.
...

23.07.2025 09:15, Hayato Kuroda (Fujitsu) wrote:

It looks like for me that we measured the execution time of the function in
millisecond but it was "zero", right?

Yes, my understanding is the same.

So I think we could observe such anomalies if, say, the OS kernel can't
read system clock in time (stalls for a millisecond when accessing it)...

I also feel like that. But if so, how should we fix tests? We must remove all
stuff which assumes the time is monotonic?

From what Takatsuka-san shared on hamerkop's configuration, I still
suspect there could be some platform specifics there. I've found another
interesting reading on the subject, which describes effects of CPU
pressure and mentions other low-level parameters, e. g.
monitor_control.virtual_rdtsc: [1].

Probably there could be some experiments performed there to measure the
maximum timer resolution (e. g. with a simple program attached).

I also observed a failure of pg_stat_statements on ARMv7 device in the past:
--- .../contrib/pg_stat_statements/expected/entry_timestamp.out    2024-04-11 07:20:32.563588101 +0300
+++ .../contrib/pg_stat_statements/results/entry_timestamp.out     2024-04-15 11:16:00.217396694 +0300
@@ -45,7 +45,7 @@
WHERE query LIKE '%STMTTS%';
total | minmax_plan_zero | minmax_exec_zero | minmax_stats_since_after_ref | stats_since_after_ref
-------+------------------+------------------+------------------------------+-----------------------
-     2 |                0 |                0 |                            0 |                     0
+     2 |                0 |                1 |                            0 |                     0
(1 row)

with clocksource = 32k_counter, which gave me the maximum resolution
0.030517 sec.

So if to choose fixing tests, then it's not clear to me, what lowest timer
resolution to consider acceptable.

[1] https://www.vmware.com/docs/vmware_timekeeping

Best regards,
Alexander

Вложения

test-gst.c

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: stats.sql might fail due to shared buffers also used by parallel tests

Вложения