Обсуждение: BUG #17693: Slow performance: Much slower queries on pg_stat_all_tables since 13.4

Поиск
Список
Период
Сортировка

BUG #17693: Slow performance: Much slower queries on pg_stat_all_tables since 13.4

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17693
Logged by:          Alessandro Jeanteur
Email address:      aless.jeant@gmail.com
PostgreSQL version: 13.7
Operating system:   RDS
Description:

We did a point upgrade through RDS's managed service which went from 13.4 to
13.7. (13.9 is unavailable and we have yet to upgrade to 13.8 - however I
couldn't find any documentation or issue history reporting similar symptoms
in the past year of discussions here)

This created a weird bug which is causing a lot of strain on our systems,
because we currently run a lot of queries over the pg_stat_all_tables, in
order to choose candidate tables most appropriate to vacuum preemptively in
a production system (we have a lot of active tables, order of magnitude 100k
tables)

Before this, queries would take <<1 second, now they're always ~5 seconds,
regardless of what we are filtering on.
EXPLAIN ANALYZE reports an odd fixed cost (that's not expected by EXPLAIN's
query plan) of roughly 5 seconds at the last layer of the query on this
view, GroupAggregate.
Similarly, when running the same query through the equivalent functions
(pg_stat_get_mod_since_analyze, pg_stat_get_dead_tuples,
pg_stat_get_live_tuples) we always run into this obscure huge fixed cost.
What is it, and how can we optionally get rid of it, presumably going back
to the previous performance characteristics?

example without a filter, limit 1:
```EXPLAIN ANALYZE SELECT n_mod_since_analyze, n_live_tup, n_dead_tup FROM
pg_stat_all_tables limit 1;
Limit (cost=2.80..4.15 rows=1 width=24) (actual time=5912.121..5912.125
rows=1 loops=1)
 -> Subquery Scan on pg_stat_all_tables (cost=2.80..1515002.24 rows=1119834
width=24) (actual time=5912.120..5912.124 rows=1 loops=1)
 -> GroupAggregate (cost=2.80..1503803.90 rows=1119834 width=292) (actual
time=5912.118..5912.121 rows=1 loops=1)
 Group Key: c.oid, n.nspname, c.relname
 -> Incremental Sort (cost=2.80..1475808.05 rows=1119834 width=132) (actual
time=0.122..0.124 rows=3 loops=1)
 Sort Key: c.oid, n.nspname, c.relname
 Presorted Key: c.oid
 Full-sort Groups: 1 Sort Method: quicksort Average Memory: 33kB Peak
Memory: 33kB
 -> Nested Loop Left Join (cost=1.52..1425415.52 rows=1119834 width=132)
(actual time=0.021..0.106 rows=34 loops=1)
 -> Merge Left Join (cost=1.11..940324.82 rows=1119834 width=72) (actual
time=0.016..0.044 rows=34 loops=1)
 Merge Cond: (c.oid = i.indrelid)
 -> Index Scan using pg_class_oid_index on pg_class c (cost=0.56..700419.45
rows=1119834 width=72) (actual time=0.009..0.025 rows=19 loops=1)
 Filter: (relkind = ANY ('{r,t,m}'::\"char\"[]))
 Rows Removed by Filter: 14
 -> Index Only Scan using pg_index_indrelid_index on pg_index i
(cost=0.55..227140.96 rows=1659053 width=4) (actual time=0.006..0.010
rows=34 loops=1)
 Heap Fetches: 0
 -> Index Scan using pg_namespace_oid_index on pg_namespace n
(cost=0.41..0.43 rows=1 width=68) (actual time=0.001..0.001 rows=1
loops=34)
 Index Cond: (oid = c.relnamespace)
Planning Time: 1.024 ms
Execution Time: 5912.239 ms
```
(with OID unique key filter, the only difference is in the GroupAggregate
plan, where (cost=25.72..25.83 rows=4 width=292))

Thank you in advance,
Best,


Re: BUG #17693: Slow performance: Much slower queries on pg_stat_all_tables since 13.4

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> We did a point upgrade through RDS's managed service which went from 13.4 to
> 13.7. (13.9 is unavailable and we have yet to upgrade to 13.8 - however I
> couldn't find any documentation or issue history reporting similar symptoms
> in the past year of discussions here)

> This created a weird bug which is causing a lot of strain on our systems,
> because we currently run a lot of queries over the pg_stat_all_tables, in
> order to choose candidate tables most appropriate to vacuum preemptively in
> a production system (we have a lot of active tables, order of magnitude 100k
> tables)

> Before this, queries would take <<1 second, now they're always ~5 seconds,
> regardless of what we are filtering on.

RDS is not Postgres; the underlying storage engine is completely
different, as a consequence of which the Postgres layer is pretty
heavily modified (or so we in the community assume, having never
seen any of their source code).  I think you need to take this up
with Amazon.

Just to check, I did create a database with 100K tables in community
Postgres 13.9, and I didn't see any odd behavior with selecting from
pg_stat_all_tables.

            regards, tom lane



Re: BUG #17693: Slow performance: Much slower queries on pg_stat_all_tables since 13.4

От
Greg Stark
Дата:
On Wed, 23 Nov 2022 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> RDS is not Postgres; the underlying storage engine is completely
> different, as a consequence of which the Postgres layer is pretty
> heavily modified (or so we in the community assume, having never
> seen any of their source code).  I think you need to take this up
> with Amazon.

That sounds more like Aurora than RDS. RDS is pretty stock. I think
for the most part the Amazon changes there are with security policies
so that you can do some super-user actions with limitations. I'm not
aware of significant storage changes.

That said, this 5s delay does seem pretty odd. The plan seems to be
running fairly fast despite the large number of tables and indexes and
it generates a "fast start" plan that the LIMIT can cut short
effectively.

The delay only seems to kick in at the upper levels which makes it
seem like one or more of the pg_stat_get_*() functions is being
delayed. Perhaps RDS has some monitoring-related changes in this area.
They do have some changes to integrate metrics from postgres into
their monitoring stack.

It's also possible there's something more mundane blocking these
functions. Postgres 15 will have a more efficient mechanism for
communicating this data but Postgres 14 and prior use a file on disk
which some people find becomes a bottleneck. I wouldn't expect it to
manifest like this though.


> Just to check, I did create a database with 100K tables in community
> Postgres 13.9, and I didn't see any odd behavior with selecting from
> pg_stat_all_tables.

Note that he also has about 1.7M indexes... :)

-- 
greg



Re: BUG #17693: Slow performance: Much slower queries on pg_stat_all_tables since 13.4

От
Andres Freund
Дата:
Hi,

On 2022-11-23 14:13:56 -0500, Greg Stark wrote:
> That said, this 5s delay does seem pretty odd.

Could just be a network configuration issue. In < 15 getting the stats
collector to write out a stats file requires sending a ping message via udp -
which obviously can get lost.

How many schema objects are in that database? I'd try to use pg_ls_dir() etc
to see how large the stats files are - but you might not be permitted to do so
in RDS.

If the file is large, the reads for it could end up being a significant source
of overall IO and you're just seeing the effects of running into disk
throughput limits.

If you're querying pg_stat* very frequently, it might be worth using explicit
transactions, to avoid each query getting a new snapshot of the stats.


> > Just to check, I did create a database with 100K tables in community
> > Postgres 13.9, and I didn't see any odd behavior with selecting from
> > pg_stat_all_tables.
> 
> Note that he also has about 1.7M indexes... :)

I assume that's based on the IOS cost estimate on pg_index_indrelid_index?

Greetings,

Andres Freund