Обсуждение: fairywren timeout failures on the pg_amcheck/005_opclass_damage test
Hello hackers, Please take a look at today's failure [1], that raises several questions at once: 117/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT 3001.48s exit status 1 180/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT 3001.43s exit status 1 Ok: 227 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 15 Timeout: 2 But looking at the previous test run [2], marked 'OK', I can see almost the same: 115/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT 3001.54s exit status 1 176/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT 3001.26s exit status 1 Ok: 227 Expected Fail: 0 Fail: 0 Unexpected Pass: 0 Skipped: 15 Timeout: 2 So it's not clear to me, why is the latter test run considered failed unlike the former? As to the 005_opclass_damage failure itself, we can find successful test runs with duration close to 3000s: [3] 212/242 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK 2894.75s 10 subtests passed [4] 212/242 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK 2941.73s 10 subtests passed (The average duration across 35 successful runs in June was around 1300s; but in July only 5 test runs were successful.) The other question is: why is 005_opclass_damage taking so much time there? Looking at the last three runs in REL_17_STABLE, I see: 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK 22.80s 10 subtests passed 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK 19.60s 10 subtests passed 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK 6.09s 10 subtests passed I suppose, the most significant factor here is 'HEAD' => [ 'debug_parallel_query = regress' ] in the buildfarm client's config. Indeed, it affects timing of the test very much, judging by what I'm seeing locally, on Linux (I guess, process-related overhead might be higher on Windows): $ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" [11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.07 csys = 0.17 CPU) $ echo "debug_parallel_query = regress" >/tmp/extra.config $ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" [11:12:46] t/005_opclass_damage.pl .. ok 40854 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.10 csys = 0.20 CPU) (Thus we can see 30x duration increase.) It's worth to note that such increase is rather new (introduced by 5ae208720); in REL_16_STABLE I'm seeing: $ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" [11:18:52] t/005_opclass_damage.pl .. ok 1453 ms ( 0.00 usr 0.00 sys + 0.82 cusr 0.11 csys = 0.93 CPU) $ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" [11:19:18] t/005_opclass_damage.pl .. ok 8032 ms ( 0.00 usr 0.00 sys + 0.82 cusr 0.11 csys = 0.93 CPU) So maybe at least this test should be improved for testing with debug_parallel_query enabled, if such active use of parallel workers by pg_amcheck can't be an issue to end users? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-25%2003%3A23%3A19 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-23%2023%3A01%3A55 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-05-02%2019%3A03%3A08 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-04-30%2014%3A03%3A06 Best regards, Alexander
On 2024-07-25 Th 8:00 AM, Alexander Lakhin wrote: > Hello hackers, > > Please take a look at today's failure [1], that raises several questions > at once: > 117/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade > TIMEOUT 3001.48s exit status 1 > 180/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage > TIMEOUT 3001.43s exit status 1 > > Ok: 227 > Expected Fail: 0 > Fail: 0 > Unexpected Pass: 0 > Skipped: 15 > Timeout: 2 > > But looking at the previous test run [2], marked 'OK', I can see almost > the same: > 115/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade > TIMEOUT 3001.54s exit status 1 > 176/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage > TIMEOUT 3001.26s exit status 1 > > Ok: 227 > Expected Fail: 0 > Fail: 0 > Unexpected Pass: 0 > Skipped: 15 > Timeout: 2 > > So it's not clear to me, why is the latter test run considered failed > unlike the former? Yesterday I changed the way we detect failure in the buildfarm client, and pushed that to fairywren (and a few more of my animals). Previously it did not consider a timeout to be a failure, but now it does. I'm going to release that soon. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
25.07.2024 15:00, Alexander Lakhin wrote: > > The other question is: why is 005_opclass_damage taking so much time there? > ... > $ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" > [11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.07 csys = 0.17 CPU) > > $ echo "debug_parallel_query = regress" >/tmp/extra.config > $ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" > [11:12:46] t/005_opclass_damage.pl .. ok 40854 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.10 csys = 0.20 CPU) > > ... > So maybe at least this test should be improved for testing with > debug_parallel_query enabled, if such active use of parallel workers by > pg_amcheck can't be an issue to end users? > When running this test with "log_min_messages = DEBUG2" in my extra.config, I see thousands of the following messages in the test log: 2024-07-26 09:32:54.544 UTC [2572189:46] DEBUG: postmaster received pmsignal signal 2024-07-26 09:32:54.544 UTC [2572189:47] DEBUG: registering background worker "parallel worker for PID 2572197" 2024-07-26 09:32:54.544 UTC [2572189:48] DEBUG: starting background worker process "parallel worker for PID 2572197" 2024-07-26 09:32:54.547 UTC [2572189:49] DEBUG: unregistering background worker "parallel worker for PID 2572197" 2024-07-26 09:32:54.547 UTC [2572189:50] DEBUG: background worker "parallel worker" (PID 2572205) exited with exit code0 2024-07-26 09:32:54.547 UTC [2572189:51] DEBUG: postmaster received pmsignal signal 2024-07-26 09:32:54.547 UTC [2572189:52] DEBUG: registering background worker "parallel worker for PID 2572197" 2024-07-26 09:32:54.547 UTC [2572189:53] DEBUG: starting background worker process "parallel worker for PID 2572197" 2024-07-26 09:32:54.549 UTC [2572189:54] DEBUG: unregistering background worker "parallel worker for PID 2572197" 2024-07-26 09:32:54.549 UTC [2572189:55] DEBUG: background worker "parallel worker" (PID 2572206) exited with exit code0 ... grep ' registering background worker' \ src/bin/pg_amcheck/tmp_check/log/005_opclass_damage_test.log | wc -l 15669 So this test launches more than 15000 processes when debug_parallel_query is enabled. As far as I can see, this is happening because of the "PARALLEL UNSAFE" marking is ignored when the function is called by CREATE INDEX/amcheck. Namely, with a function defined as: CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$ SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$; SELECT int4_asc_cmp(1, 2); executed without parallel workers. Whilst when it's used by an index: CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS ... OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4); INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs); CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops); launches 1000 parallel workers. (This is reminiscent of bug #18314.) One way to workaround this is to disable debug_parallel_query in the test and another I find possible is to set max_parallel_workers = 0. Best regards, Alexander
On 2024-07-26 Fr 7:00 AM, Alexander Lakhin wrote: > 25.07.2024 15:00, Alexander Lakhin wrote: > >> >> The other question is: why is 005_opclass_damage taking so much time >> there? >> ... >> $ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" >> PROVE_FLAGS="--timer" >> [11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 >> sys + 0.10 cusr 0.07 csys = 0.17 CPU) >> >> $ echo "debug_parallel_query = regress" >/tmp/extra.config >> $ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ >> PROVE_TESTS="t/005*" PROVE_FLAGS="--timer" >> [11:12:46] t/005_opclass_damage.pl .. ok 40854 ms ( 0.00 usr 0.00 >> sys + 0.10 cusr 0.10 csys = 0.20 CPU) >> >> ... >> So maybe at least this test should be improved for testing with >> debug_parallel_query enabled, if such active use of parallel workers by >> pg_amcheck can't be an issue to end users? >> > > When running this test with "log_min_messages = DEBUG2" in my > extra.config, > I see thousands of the following messages in the test log: > 2024-07-26 09:32:54.544 UTC [2572189:46] DEBUG: postmaster received > pmsignal signal > 2024-07-26 09:32:54.544 UTC [2572189:47] DEBUG: registering > background worker "parallel worker for PID 2572197" > 2024-07-26 09:32:54.544 UTC [2572189:48] DEBUG: starting background > worker process "parallel worker for PID 2572197" > 2024-07-26 09:32:54.547 UTC [2572189:49] DEBUG: unregistering > background worker "parallel worker for PID 2572197" > 2024-07-26 09:32:54.547 UTC [2572189:50] DEBUG: background worker > "parallel worker" (PID 2572205) exited with exit code 0 > 2024-07-26 09:32:54.547 UTC [2572189:51] DEBUG: postmaster received > pmsignal signal > 2024-07-26 09:32:54.547 UTC [2572189:52] DEBUG: registering > background worker "parallel worker for PID 2572197" > 2024-07-26 09:32:54.547 UTC [2572189:53] DEBUG: starting background > worker process "parallel worker for PID 2572197" > 2024-07-26 09:32:54.549 UTC [2572189:54] DEBUG: unregistering > background worker "parallel worker for PID 2572197" > 2024-07-26 09:32:54.549 UTC [2572189:55] DEBUG: background worker > "parallel worker" (PID 2572206) exited with exit code 0 > ... > > grep ' registering background worker' \ > src/bin/pg_amcheck/tmp_check/log/005_opclass_damage_test.log | wc -l > 15669 > > So this test launches more than 15000 processes when debug_parallel_query > is enabled. > > As far as I can see, this is happening because of the "PARALLEL UNSAFE" > marking is ignored when the function is called by CREATE INDEX/amcheck. > > Namely, with a function defined as: > CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE > sql AS $$ > SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 > END; $$; > > SELECT int4_asc_cmp(1, 2); > executed without parallel workers. Whilst when it's used by an index: > CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS > ... > OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4); > > INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs); > > CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops); > launches 1000 parallel workers. > > (This is reminiscent of bug #18314.) > > One way to workaround this is to disable debug_parallel_query in the test > and another I find possible is to set max_parallel_workers = 0. > > But wouldn't either of those just be masking the problem? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
26.07.2024 15:41, Andrew Dunstan wrote: > >> >> One way to workaround this is to disable debug_parallel_query in the test >> and another I find possible is to set max_parallel_workers = 0. >> >> > > But wouldn't either of those just be masking the problem? > Yes, I'm inclined to consider this behavior a problem (what if the table contained 1M rows?), that's why I called those solutions workarounds. Of course, there are parallel_setup_cost and parallel_tuple_cost parameters, which can prevent this from happening in the wild, but still... Best regards. Alexander