Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch
От | Shayon Mukherjee |
---|---|
Тема | Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch |
Дата | |
Msg-id | CANqtF-q7x-aw84OtiTZ-VkLuafmutzgr-Lk3LeeiKs+zFm2Mng@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch (Sami Imseih <samimseih@gmail.com>) |
Список | pgsql-hackers |
Thank you for engaging into the conversation and sharing your thoughts Sami and Robert.
On Thu, May 8, 2025 at 2:57 PM Sami Imseih <samimseih@gmail.com> wrote:
> actually determine you are in this situation, or how bad the situation
> was, in order to know that setting truncate off would help? To that
That's definitely a sound idea and the way I was able to replicate and learn myself was by doing a PATCH that would just print the # of attempts it was doing. That said, just to add more color - I was coming at this from a point of view where truncation would be on a "best effort" basis and the trade off here would be less availability risk at the cost of missed truncations. My first instinct was that, having vacuum_truncate set to ON by default on busy systems that have a lot of churn can catch a lot of developers and PostgreSQL by surprise (sure did to me :D), so the best effort and "fail fast" behavior could help, however at the same time I agree that it doesn't solve either of the problems really well. Another idea I had around this section was to expose an attribute/guc/setting like VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES with a default of `1` or something, that users could optionally configure, perhaps that could be a middleground, but I am myself not fully sold on the idea either (?).
Adding counters for this area is not a bad idea in general, as this hits
customers particularly hard on hot standbys when the truncate does actually
occur on the primary.
What about adding cumulative counters ( per table and pg_stat_database
) such as:
pages_vac_truncated - # of pages truncated by vacuum
vac_truncate_conflicts - # of time truncate was skipped due to conflict
vac_truncate_suspended - # of times the truncate was suspended.
The difference between conflict and suspended is conflict causes vacuum
to skip the truncate phase while suspended causes vacuum to retry the
phase continually.
Would vac_truncate_retries or something similar be more immediately clear? If so, maybe something like
- vac_truncate_interruptions - # of time truncate was skipped due to conflict and not retried
- vac_truncate_retries - # of times the truncate was suspended and retried
The argument against adding these counters is that the views,
pg_stat_all_tables/
pg_stat_all_database are becoming super-wide, so maybe we need to think
about inventing a new view for vacuum related counter metrics. This seems like
a good discussion for v19.
+1 for this. I'd even be happy to even just start showing the attempts as part of INFO or VERBOSE on conflicts & suspensions, without keeping track, but it is useful statistical data.
Thanks
Shayon
В списке pgsql-hackers по дате отправления: