Debugging sudden increase in planning CPU time (possibly due to get_actual_variable_range)
От | Core Studios Inc. |
---|---|
Тема | Debugging sudden increase in planning CPU time (possibly due to get_actual_variable_range) |
Дата | |
Msg-id | b0f47540-2915-4537-90a4-4896183bb027@gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
Hello hackers,
We have a hot table (1.4k QPS) with 150M records and 55GB in size. Recently, we noticed a sudden spike in CPU utilization (from 20% to 80%), right after an ANALYZE operation finished on that table:
automatic analyze of table "cars" system usage: CPU: user: 1.50 s, system: 0.55 s, elapsed: 25.47 s
The CPU spike correlated perfectly with an increase in `index_rows_read` and `index_rows_fetched` (as reported by DataDog) for a b-tree index on a timestamp column, and we narrowed it to a specific query. However, we could not find any query plans that used the index, no matter how many different ways we tried. The query is the following:
explain (buffers,analyze) SELECT
"cars".*
FROM
"cars"
WHERE
("cars"."user_id" = 'xxx')
AND ("cars"."brand_id" in ('xxx'))
AND (replacement IS NULL)
AND ("cars"."status" = 'active')
AND ("cars"."sold_at" > NOW())
AND ("cars"."bought_at" > NOW())
AND (timeout = 0
OR EXTRACT(EPOCH
FROM (NOW() - GREATEST(updated_at,sold_at))) <= timeout);
The index is a b-tree index on bought_at (timestamp without timezone), named `index_cars_on_bought_at`. The query plan looks like this:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
Index Scan using index_cars_on_brand_id_include_sold_at_bought_at on cars (cost=0.70..7.58 rows=1 width=508) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: ((brand_id)::text = 'xxx'::text)
Filter: ((replacement IS NULL) AND ((user_id)::text = 'xxx'::text) AND ((status)::text = 'active'::text) AND (sold_at > now()) AND (bought_at > now()) AND ((timeout = 0) OR (date_part('epoch'::text, (now() - (GREATEST(updated_at, sold_at))::timestamp with time zone)) <= (timeout)::double precision)))
Rows Removed by Filter: 4
Buffers: shared hit=10
Planning:
Buffers: shared hit=6311 dirtied=4
Planning Time: 9.989 ms
Execution Time: 0.073 ms
(9 rows)
The `Planning: Buffers: shared hit=6311` and `Planning Time: 9.989ms` stood out to me. If we remove the `bought_at > NOW()` predicate, query plan time looks like this:
Index Scan using index_cars_on_brand_id_include_sold_at_bought_at on cars (cost=0.70..7.58 rows=1 width=508) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: ((brand_id)::text = 'xxx'::text)
Filter: ((replacement IS NULL) AND ((user_id)::text = 'xxx'::text) AND ((status)::text = 'active'::text) AND (sold_at > now()) AND ((timeout = 0) OR (date_part('epoch'::text, (now() - (GREATEST(updated_at, sold_at))::timestamp with time zone)) <= (timeout)::double precision)))
Rows Removed by Filter: 4
Buffers: shared hit=10
Planning Time: 0.107 ms
Execution Time: 0.055 ms
(7 rows)
This makes me hypothesize that the increased planning time is the planner scanning through `index_cars_on_bought_at`. Eventually I run into a similar report[1] which mentions get_actual_variable_range()as the offender. Indeed, after I performed a VACUUM on the table and planning time and CPU utilization reverted back to normal levels. Unfortunately, since this is a managed database, we don't have access to attach to the machine and grab a CPU profile.
Periodically (every 30min) we delete all rows in batches of 50k from the table for which `bought_at > NOW()`. This means we mostly delete from the beginning of the index, which would cause problem (if I understand correctly) when the planner tries to compute the `min` of `bought_at` column; which it does if `NOW()` happens to be at the exact first point of the histogram. Is my understanding correct?
I know this pattern creates dead tuples that might accumulate and cause slowdowns due to table bloat, however given that we did this for the past 4y and never encountered a similar problem, I'm also looking to understand what might've changed _now_ and this was triggered (assuming this is actually the issue). Table size and/or query volume did not change significantly (apart from the usual/gradual growth it had in the past years).
Thanks in advance!
P.S. The instance has 96 vCPUs, runs on GCP Cloud SQL and is on Postgres v13.20.
В списке pgsql-hackers по дате отправления: