recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)

Поиск
Список
Период
Сортировка
От Alexander Lakhin
Тема recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
Дата
Msg-id 95ca84ca-39b4-f6aa-260f-da5f73d05a90@gmail.com
обсуждение исходный текст
Ответы Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
Список pgsql-hackers
Hello hackers,

This week dodo started failing on the 008_fsm_truncation test sporadically
due to pg_ctl timeout. For example, [1] (REL_14_STABLE):
### Starting node "standby"
# Running: pg_ctl -D 

/media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_008_fsm_truncation_standby_data/pgdata

-l 

/media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/log/008_fsm_truncation_standby.log

-o --cluster-name=standby start
waiting for server to 

start...........................................................................................................................

stopped waiting
pg_ctl: server did not start in time
# pg_ctl start failed; logfile:
2024-06-19 21:27:30.843 ACST [13244:1] LOG:  starting PostgreSQL 14.12 on armv7l-unknown-linux-gnueabihf, compiled by 
gcc (GCC) 15.0.0 20240617 (experimental), 32-bit
2024-06-19 21:27:31.768 ACST [13244:2] LOG:  listening on Unix socket 
"/media/pi/250gb/proj/bf2/v17/buildroot/tmp/vLgcHgvc7O/.s.PGSQL.50013"
2024-06-19 21:27:35.055 ACST [13246:1] LOG:  database system was interrupted; last known up at 2024-06-19 21:26:55
ACST

A successful run between two failures [2]:
2024-06-21 05:17:43.102 ACST [18033:1] LOG:  database system was interrupted; last known up at 2024-06-21 05:17:31
ACST
2024-06-21 05:18:06.718 ACST [18033:2] LOG:  entering standby mode
(That is, that start took around 20 seconds.)

We can also find longer "successful" duration at [3]:
008_fsm_truncation_standby.log:
2024-06-19 23:11:13.854 ACST [26202:1] LOG:  database system was interrupted; last known up at 2024-06-19 23:11:02
ACST
2024-06-19 23:12:07.115 ACST [26202:2] LOG:  entering standby mode
(That start took almost a minute.)

So it doesn't seem impossible for this operation to last for more than two
minutes.

The facts that SyncDataDirectory() is executed between these two messages
logged, 008_fsm_truncation is the only test which turns fsync on, and we
see no such failures in newer branches (because of a7f417107), make me
suspect that dodo is slow on fsync.

I managed to reproduce similar fsync degradation (and reached 40 seconds
duration of this start operation) on a slow armv7 device with a SD card,
which getting significantly slower after many test runs without executing
fstrim periodically.

So maybe fstrim could help dodo too...

On the other hand, backporting a7f417107 could fix the issue too, but I'm
afraid we'll still see other tests (027_stream_regress) failing like [4].
When similar failures were observed on Andres Freund's animals, Andres
came to conclusion that they were caused by fsync too ([5]).

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-19%2010%3A15%3A08
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2018%3A30%3A53
[3] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-06-19%2012%3A30%3A51&stg=recovery-check
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-21%2018%3A31%3A11
[5] https://www.postgresql.org/message-id/20240321063953.oyfojyq3wbc77xxb%40awork3.anarazel.de

Best regards,
Alexander



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Langote
Дата:
Сообщение: Re: ON ERROR in json_query and the like
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: New standby_slot_names GUC in PG 17