Обсуждение: Re: stats.sql might fail due to shared buffers also used by parallel tests
On Mon, 6 Jan 2025 at 12:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello hackers, > > I'd like to share my investigation of one mysterious stats.sql failure > occurred in December: [1]. > The difference of the failure is: > --- C:/prog/bf/root/HEAD/pgsql/src/test/regress/expected/stats.out 2024-09-18 19:31:14.665516500 +0000 > +++ C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2024-12-17 > 09:57:08.944588500 +0000 > @@ -1291,7 +1291,7 @@ > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > ?column? > ---------- > - t > + f > (1 row) > > 027_stream_regress_primary.log contains: > 2024-12-17 09:57:06.778 UTC [8568:109] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs > FROM pg_stat_io > WHERE object = 'relation' > 2024-12-17 09:57:06.779 UTC [8568:110] pg_regress/stats LOG: statement: CREATE TABLE test_io_shared(a int); > 2024-12-17 09:57:06.780 UTC [8568:111] pg_regress/stats LOG: statement: INSERT INTO test_io_shared SELECT i FROM > generate_series(1,100)i; > ... > 2024-12-17 09:57:06.782 UTC [8568:115] pg_regress/stats LOG: statement: CHECKPOINT; > 2024-12-17 09:57:06.794 UTC [3664:40] LOG: checkpoint starting: immediate force wait > 2024-12-17 09:57:06.856 UTC [3664:41] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL > file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.062 s; sync files=0, longest=0.000 s, > average=0.000 s; distance=1875 kB, estimate=52682 kB; lsn=0/14A2F410, redo lsn=0/14A2F3B8 > 2024-12-17 09:57:06.857 UTC [8568:116] pg_regress/stats LOG: statement: CHECKPOINT; > 2024-12-17 09:57:06.857 UTC [3664:42] LOG: checkpoint starting: immediate force wait > 2024-12-17 09:57:06.859 UTC [3664:43] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL > file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, > average=0.000 s; distance=0 kB, estimate=47414 kB; lsn=0/14A2F4E0, redo lsn=0/14A2F488 > 2024-12-17 09:57:06.859 UTC [8568:117] pg_regress/stats LOG: statement: SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs > FROM pg_stat_io > WHERE object = 'relation' > 2024-12-17 09:57:06.860 UTC [8568:118] pg_regress/stats LOG: statement: SELECT 77693 > 77693; > > The corresponding test case (added by 10a082bf7, dated 2023-02-11) is: > SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs > FROM pg_stat_io > WHERE object = 'relation' \gset io_sum_shared_before_ > ... > CREATE TABLE test_io_shared(a int); > INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i; > ... > CHECKPOINT; > CHECKPOINT; > ... > SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs > FROM pg_stat_get_backend_io(pg_backend_pid()) > WHERE object = 'relation' \gset my_io_sum_shared_after_ > SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes; > > So, as we can see from the log above, both checkpoints wrote 0 buffers, > moreover, no other process/backend wrote any buffers. > > After series of experiments, I found that this situation is possible when > other backend steals shared buffers from checkpointer, without making new > buffers dirty, and then delays flushing it's statistics. > > Please find attached a reproducer for this issue: I've added two sleeps > to stats.sql and reduced parallel_schedule, but I believe it's a legal > change, which doesn't affect the test case essentially. > > With bufchurn.sql running concurrently. and TEMP_CONFIG containing > shared_buffers = 1MB (as in the 027_stream_regress test in question), I can > see exactly the same failure: > $ echo 'shared_buffers = 1MB' >/tmp/extra.config; TEMP_CONFIG=/tmp/extra.config make -s check > # +++ regress check in src/test/regress +++ > # initializing database system by copying initdb template > # using temp instance on port 65312 with PID 129700 > ok 1 - test_setup 356 ms > ok 2 - create_index 1002 ms > # parallel group (2 tests): stats bufchurn > not ok 3 + stats 21176 ms > ... > > diff -U3 .../src/test/regress/expected/stats.out .../src/test/regress/results/stats.out > --- .../src/test/regress/expected/stats.out 2025-01-06 08:03:07.720275980 +0200 > +++ .../src/test/regress/results/stats.out 2025-01-06 08:15:34.602170721 +0200 > @@ -1321,7 +1321,7 @@ > SELECT :io_sum_shared_after_writes > :io_sum_shared_before_writes; > ?column? > ---------- > - t > + f > (1 row) > ... > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-17%2008%3A59%3A44 > Hi Hackers, I have also encountered a similar buildfarm failure [1]. | 1/1 + subscription 142 ms FAIL 1/1 postgresql:regress / regress/regress ERROR 284.85s exit status 1 diff --strip-trailing-cr -U3 c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out --- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out 2025-06-28 20:13:02 +0900 +++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out 2025-06-28 20:35:21 +0900 @@ -70,7 +70,7 @@ SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub'; ?column? ---------- - t + f (1 row) logs around this statement do not show unexpected thing: 2025-06-28 20:35:20.857 JST client backend[6092] pg_regress/subscription STATEMENT: CREATE SUBSCRIPTION regress_testsub CONNECTION 'testconn' PUBLICATION testpub; 2025-06-28 20:35:20.857 JST client backend[6092] pg_regress/subscription ERROR: publication name "foo" used more than once 2025-06-28 20:35:20.857 JST client backend[6092] pg_regress/subscription STATEMENT: CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION foo, testpub, foo WITH (connect = false); 2025-06-28 20:35:20.858 JST client backend[6092] pg_regress/subscription WARNING: subscription was created, but is not connected 2025-06-28 20:35:20.858 JST client backend[6092] pg_regress/subscription HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription. 2025-06-28 20:35:20.861 JST client backend[6092] pg_regress/subscription ERROR: subscription "regress_testsub" already exists In subscription.sql, I looked at the test. The statement are like: -- Reset the stats again and check if the new reset_stats is updated. SELECT stats_reset as prev_stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub' \gset SELECT pg_stat_reset_subscription_stats(oid) FROM pg_subscription WHERE subname = 'regress_testsub'; SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub'; So according to me, I suspect the following causes 1) The time difference between 'prev_stats_reset' and current 'stats_reset' value is less than 1 microseconds. 'stats_reset' is of type 'timestamp with time zone' and the content of it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time difference between 'prev_stats_reset' and current 'stats_reset' is less than 1 microseconds. The query 'SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub'' might return 'false' instead of 'true'. But I was not able to reproduce such a scenario after multiple testing. Even in high end machines, it takes at least a few microseconds. Also there are multiple cases where we did similar timestamp comparison in 'stats.sql' as well. And, I didn't find any other failure related to such case. So, I feel this is not possible. 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats. We have a shared hash 'pgStatLocal.shared_hash'. If the entry reference (for the subscription) is not found while executing 'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset the stats. Maybe somehow this shared hash is getting dropped.. Also, it could be failing due to the same reason as Alexander has reproduced in another case [2]. [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-06-28%2011%3A02%3A30 [2]: https://www.postgresql.org/message-id/fe0391a8-dfa9-41c3-bf1c-7ea058e40f30%40gmail.com Thanks and Regards, Shlok Kyal
Re: stats.sql might fail due to shared buffers also used by parallel tests
От
Alexander Lakhin
Дата:
Hello Shlok, 03.07.2025 09:54, Shlok Kyal wrote: > I have also encountered a similar buildfarm failure [1]. > > | 1/1 + subscription 142 ms FAIL > 1/1 postgresql:regress / regress/regress ERROR 284.85s exit status 1 > > diff --strip-trailing-cr -U3 > c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out > c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out > --- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/regress/expected/subscription.out > 2025-06-28 20:13:02 +0900 > +++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/regress/regress/results/subscription.out > 2025-06-28 20:35:21 +0900 > @@ -70,7 +70,7 @@ > SELECT :'prev_stats_reset' < stats_reset FROM > pg_stat_subscription_stats WHERE subname = 'regress_testsub'; > ?column? > ---------- > - t > + f > (1 row) Thank you for your attention to this! Yes, I also noticed that failure, but could not reproduce it yet. > So according to me, I suspect the following causes > 1) The time difference between 'prev_stats_reset' and current > 'stats_reset' value is less than 1 microseconds. > 'stats_reset' is of type 'timestamp with time zone' and the content of > it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time > difference between 'prev_stats_reset' and current 'stats_reset' is > less than 1 microseconds. The query 'SELECT :'prev_stats_reset' < > stats_reset FROM pg_stat_subscription_stats WHERE subname = > 'regress_testsub'' might return 'false' instead of 'true'. > But I was not able to reproduce such a scenario after multiple > testing. Even in high end machines, it takes at least a few > microseconds. Also there are multiple cases where we did similar > timestamp comparison in 'stats.sql' as well. And, I didn't find any > other failure related to such case. So, I feel this is not possible. Did you try that on Windows (hamerkop is a Windows animal)? IIUC, GetCurrentTimestamp() -> gettimeofday() implemented on Windows via GetSystemTimePreciseAsFileTime(), and it has 100ns resolution, according to [1]. I tried it on a Windows VM and saw the following numbers: SELECT :'prev_stats_reset' < stats_reset, :'prev_stats_reset', stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub'; ?column? | ?column? | stats_reset ----------+-------------------------------------+------------------------------------- t | Fri Jul 04 11:46:10.224323 2025 PDT | Fri Jul 04 11:46:10.224975 2025 PDT (1 row) that is, the difference is about 600 ns — if I could make the VM 6x faster or run the test on a bare metal, maybe I'd get the test failed. Although the overall duration of the subscription doesn't differ much: My VM: # parallel group (2 tests): subscription publication ok 164 + publication 806 ms ok 165 + subscription 210 ms vs hamerkop: # parallel group (2 tests): subscription publication ok 164 + publication 2383 ms not ok 165 + subscription 142 ms So probably the failure is caused by something else. > 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats. > We have a shared hash 'pgStatLocal.shared_hash'. If the entry > reference (for the subscription) is not found while executing > 'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset > the stats. Maybe somehow this shared hash is getting dropped.. > Also, it could be failing due to the same reason as Alexander has I don't think 2) is relevant here, because shared buffers shouldn't affect subscription's statistics. [1] https://stackoverflow.com/questions/4444090/measuring-time-with-a-resolution-of-microseconds-in-c Best regards. Alexander
RE: stats.sql might fail due to shared buffers also used by parallel tests
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander, > > So according to me, I suspect the following causes > > 1) The time difference between 'prev_stats_reset' and current > > 'stats_reset' value is less than 1 microseconds. > > 'stats_reset' is of type 'timestamp with time zone' and the content of > > it is like: '2025-06-30 21:01:07.925253+05:30'. So if the time > > difference between 'prev_stats_reset' and current 'stats_reset' is > > less than 1 microseconds. The query 'SELECT :'prev_stats_reset' < > > stats_reset FROM pg_stat_subscription_stats WHERE subname = > > 'regress_testsub'' might return 'false' instead of 'true'. > > But I was not able to reproduce such a scenario after multiple > > testing. Even in high end machines, it takes at least a few > > microseconds. Also there are multiple cases where we did similar > > timestamp comparison in 'stats.sql' as well. And, I didn't find any > > other failure related to such case. So, I feel this is not possible. > > Did you try that on Windows (hamerkop is a Windows animal)? IIUC, > GetCurrentTimestamp() -> gettimeofday() implemented on Windows via > GetSystemTimePreciseAsFileTime(), and it has 100ns resolution, Hmm. I'm not familiar with the Windows environment, but I have the doubt for it. GetSystemTimePreciseAsFileTime() returns FILETIME structure, which represents the time UTC with 100-nanosecod intervals [1]. The stack overflow seemed to refer it. However, the document for GetSystemTimePreciseAsFileTime() says that the resolution is < 1 us [2]. Also, MS doc [3] does not say that GetSystemTimePreciseAsFileTime() returns value monotonically. Another API QueryPerformanceCounter() seems to have the monotony. A bit old document [4] also raised the possibility: ``` Consecutive calls may return the same result. The call time is less than the smallest increment of the system time. The granularity is in the sub-microsecond regime. The function may be used for time measurements but some care has to be taken: Time differences may be ZERO. ``` Also, what if the the system clock is modified during the test via NTP? > > 2) pg_stat_reset_subscription_stats(oid) function did not reset the stats. > > We have a shared hash 'pgStatLocal.shared_hash'. If the entry > > reference (for the subscription) is not found while executing > > 'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset > > the stats. Maybe somehow this shared hash is getting dropped.. > > Also, it could be failing due to the same reason as Alexander has > > I don't think 2) is relevant here, because shared buffers shouldn't affect > subscription's statistics. To confirm; we do not consider the possibility that pgstat_get_entry_ref() returns NULL right? [1]: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime [2]: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemtimepreciseasfiletime [3]: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps [4]: http://www.windowstimestamp.com/description#:~:text=2.1.4.2.%C2%A0%C2%A0Desktop%20Applications%3A%20GetSystemTimePreciseAsFileTime() Best regards, Hayato Kuroda FUJITSU LIMITED
Re: stats.sql might fail due to shared buffers also used by parallel tests
От
Alexander Lakhin
Дата:
Hello Kuroda-san,
Thank you for your attention to this!
15.07.2025 10:33, Hayato Kuroda (Fujitsu) wrote:
Thank you for your attention to this!
15.07.2025 10:33, Hayato Kuroda (Fujitsu) wrote:
GetSystemTimePreciseAsFileTime() returns FILETIME structure, which represents the time UTC with 100-nanosecod intervals [1]. The stack overflow seemed to refer it. However, the document for GetSystemTimePreciseAsFileTime() says that the resolution is < 1 us [2]. Also, MS doc [3] does not say that GetSystemTimePreciseAsFileTime() returns value monotonically. Another API QueryPerformanceCounter() seems to have the monotony. A bit old document [4] also raised the possibility: ``` Consecutive calls may return the same result. The call time is less than the smallest increment of the system time. The granularity is in the sub-microsecond regime. The function may be used for time measurements but some care has to be taken: Time differences may be ZERO. ``` Also, what if the the system clock is modified during the test via NTP?
Yeah, I made a simple test for GetSystemTimePreciseAsFileTime() and
confirmed that in my VM it provides sub-microsecond precision. Regarding
NTP, I think the second failure of this ilk [1] makes this cause close to
impossible. (Can't wait for the third one to gather more information.)
2) pg_stat_reset_subscription_stats(oid) function did not reset the stats. We have a shared hash 'pgStatLocal.shared_hash'. If the entry reference (for the subscription) is not found while executing 'pg_stat_reset_subscription_stats(oid)'. It may not be able to reset the stats. Maybe somehow this shared hash is getting dropped.. Also, it could be failing due to the same reason as Alexander hasI don't think 2) is relevant here, because shared buffers shouldn't affect subscription's statistics.To confirm; we do not consider the possibility that pgstat_get_entry_ref() returns NULL right?
I've held a simple experiment with a modification like this:
@@ -1078,6 +1078,7 @@ pgstat_reset_entry(PgStat_Kind kind, Oid dboid, uint64 objid, TimestampTz ts)
Assert(!pgstat_get_kind_info(kind)->fixed_amount);
entry_ref = pgstat_get_entry_ref(kind, dboid, objid, false, NULL);
+if (rand() % 3 == 0) entry_ref = NULL;
if (!entry_ref || entry_ref->shared_entry->dropped)
and got several failures like:
--- .../postgresql/src/test/regress/expected/subscription.out 2025-04-25 10:27:32.851554400 -0700
+++ .../postgresql/build/testrun/regress/regress/results/subscription.out 2025-07-20 00:05:05.667903300 -0700
@@ -56,7 +56,7 @@
SELECT subname, stats_reset IS NULL stats_reset_is_null FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub';
subname | stats_reset_is_null
-----------------+---------------------
- regress_testsub | f
+ regress_testsub | t
(1 row)
-- Reset the stats again and check if the new reset_stats is updated.
@@ -68,11 +68,9 @@
(1 row)
SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscription_stats WHERE subname = 'regress_testsub';
- ?column?
-----------
- t
-(1 row)
-
+ERROR: syntax error at or near ":"
+LINE 1: SELECT :'prev_stats_reset' < stats_reset FROM pg_stat_subscr...
+
--- .../postgresql/src/test/regress/expected/stats.out 2025-04-25 10:27:36.930322500 -0700
+++ .../postgresql/build/testrun/regress/regress/results/stats.out 2025-07-20 00:05:19.579864900 -0700
@@ -1720,7 +1720,7 @@
SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
?column?
----------
- t
+ f
(1 row)
...
Thus, if there is some issue with pgstat_get_entry_ref(), then it should
be specific to subscriptions and come out in that place only (given the
information we have now).
So I still suspect some Windows/concrete animal's peculiarity.
Nagata-san, could you please share the configuration of hamerkop? If it's
running inside VM, what virtualization software is used?
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-07-09%2011%3A02%3A23
Best regards.
Alexander
Hello Alexander, > Nagata-san, could you please share the configuration of hamerkop? If it's > running inside VM, what virtualization software is used? > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-07-09%2011%3A02%3A23 I am not the person in charge of it, so I added the responsible email address to the CC list. Takatsuka-san, could you please take care of this? Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Re: stats.sql might fail due to shared buffers also used by parallel tests
От
Alexander Lakhin
Дата:
Hello Kuroda-san, 20.07.2025 11:00, Alexander Lakhin wrote: > > Yeah, I made a simple test for GetSystemTimePreciseAsFileTime() and > confirmed that in my VM it provides sub-microsecond precision. Regarding > NTP, I think the second failure of this ilk [1] makes this cause close to > impossible. (Can't wait for the third one to gather more information.) And here it is [1]: diff --strip-trailing-cr -U3 c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.out c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/results/stats.out --- c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.out 2025-07-22 20:08:30 +0900 +++ c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/results/stats.out 2025-07-22 20:30:47 +0900 @@ -3729,7 +3729,7 @@ name |pg_stat_get_function_calls|total_above_zero|self_above_zero --------------+--------------------------+----------------+--------------- -test_stat_func| 1|t |t +test_stat_func| 1|f |f (1 row) Not related to subscriptions this time, but still related to pg_stat and time measurement. There was the same looking failure on caiman, which is running on continuously updated Fedora: [2]. I could not reproduce that one either, but suspected the OS kernel: It had 6.14.0-0.rc3.20250219git6537cfb395f3.31.fc43.x86_64 at the moment of the failure and updated to 6.14.0-0.rc4.36.fc43.x86_64 on 2025-02-27. There was no failures of this kind since then. So I think we could observe such anomalies if, say, the OS kernel can't read system clock in time (stalls for a millisecond when accessing it)... [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2025-07-22%2011%3A02%3A15 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=caiman&dt=2025-02-23%2006%3A54%3A54 Best regards, Alexander
RE: stats.sql might fail due to shared buffers also used by parallel tests
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander, > And here it is [1]: > diff --strip-trailing-cr -U3 > c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.ou > t > c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/res > ults/stats.out > --- > c:/build-farm-local/buildroot/HEAD/pgsql/src/test/isolation/expected/stats.ou > t 2025-07-22 20:08:30 +0900 > +++ > c:/build-farm-local/buildroot/HEAD/pgsql.build/testrun/isolation/isolation/res > ults/stats.out 2025-07-22 20:30:47 +0900 > @@ -3729,7 +3729,7 @@ > > name |pg_stat_get_function_calls|total_above_zero|self_above_zero > --------------+--------------------------+----------------+--------------- > -test_stat_func| 1|t |t > +test_stat_func| 1|f |f > (1 row) > > Not related to subscriptions this time, but still related to pg_stat and > time measurement. It looks like for me that we measured the execution time of the function in millisecond but it was "zero", right? > So I think we could observe such anomalies if, say, the OS kernel can't > read system clock in time (stalls for a millisecond when accessing it)... I also feel like that. But if so, how should we fix tests? We must remove all stuff which assumes the time is monotonic? Best regards, Hayato Kuroda FUJITSU LIMITED
Re: stats.sql might fail due to shared buffers also used by parallel tests
От
Alexander Lakhin
Дата:
Hello Kuroda-san and Takatsuka-san,
23.07.2025 09:15, Hayato Kuroda (Fujitsu) wrote:
24.07.2025 03:49, TAKATSUKA Haruka wrote:
{snip}Maybe you could try tools.syncTime = "0" by any chance?It has been already tools.syncTime = "0" so far. I confirmed the following GUI setting. ...
23.07.2025 09:15, Hayato Kuroda (Fujitsu) wrote:
It looks like for me that we measured the execution time of the function in millisecond but it was "zero", right?
Yes, my understanding is the same.
So I think we could observe such anomalies if, say, the OS kernel can't read system clock in time (stalls for a millisecond when accessing it)...I also feel like that. But if so, how should we fix tests? We must remove all stuff which assumes the time is monotonic?
From what Takatsuka-san shared on hamerkop's configuration, I still
suspect there could be some platform specifics there. I've found another
interesting reading on the subject, which describes effects of CPU
pressure and mentions other low-level parameters, e. g.
monitor_control.virtual_rdtsc: [1].
Probably there could be some experiments performed there to measure the
maximum timer resolution (e. g. with a simple program attached).
I also observed a failure of pg_stat_statements on ARMv7 device in the past:
--- .../contrib/pg_stat_statements/expected/entry_timestamp.out 2024-04-11 07:20:32.563588101 +0300
+++ .../contrib/pg_stat_statements/results/entry_timestamp.out 2024-04-15 11:16:00.217396694 +0300
@@ -45,7 +45,7 @@
WHERE query LIKE '%STMTTS%';
total | minmax_plan_zero | minmax_exec_zero | minmax_stats_since_after_ref | stats_since_after_ref
-------+------------------+------------------+------------------------------+-----------------------
- 2 | 0 | 0 | 0 | 0
+ 2 | 0 | 1 | 0 | 0
(1 row)
with clocksource = 32k_counter, which gave me the maximum resolution
0.030517 sec.
So if to choose fixing tests, then it's not clear to me, what lowest timer
resolution to consider acceptable.
[1] https://www.vmware.com/docs/vmware_timekeeping
Best regards,
Alexander