How to deal with analyze gathering irrelevant stats

Поиск
Список
Период
Сортировка
От Rémi Chatenay
Тема How to deal with analyze gathering irrelevant stats
Дата
Msg-id CAA==dd=gp5c4oQe8ffOu7iPUrgikA508robTnUV=AXx+o5EJfQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: How to deal with analyze gathering irrelevant stats  (Justin Pryzby <pryzby@telsasoft.com>)
Список pgsql-performance

Hi,

Thanks in advance for your help. I'm putting as much context and details as possible, but let me know if you have any questions.

What?

We are experiencing some slow queries due to the query planner using an incorrect index. It is using an unoptimized index because the stats are computed during the night when the data is not the same as during the day.

Context

We have a table conversations like that

|id|status|user_id|

and 2 indexes:

CREATE INDEX index_conversations_on_user_id_and_status ON public.conversations USING btree (user_id, status);

CREATE INDEX index_conversations_on_status ON public.conversations USING btree (status)

The slow query is the following: 

SELECT id FROM conversations WHERE status = 'in_progress' AND user_id = 123

We expect the query planner to use the index_conversations_on_user_id_and_status but it sometimes uses the other one.

What's happening ?

There are hundreds of conversations with a status 'in_progress' at a given time during the day but virtually none during the night.

So when the analyze is run during the night, PG then thinks that using the index_conversations_on_status will return almost no rows and so it uses this index instead of the combined one.

When the analyze is run during the day, PG correctly uses the right index (index_conversations_on_user_id_and_status)

[With an analyze run during the day]

Limit (cost=0.43..8.45 rows=1 width=8) (actual time=1.666..1.666 rows=0 loops=1)

-> Index Scan using index_conversations_on_user_id_and_status on conversations (cost=0.43..8.45 rows=1 width=8) (actual_time=1.665..1.665 rows:0 loops:1)

Index Cond: ((user_id = 123) AND ((status)::text = 'in_progress'::text))

Filter: (id <> 1)

Planning Time: 8.642 ms

Execution Time: 1.693 ms

[With an analyze run during the night]

Limit (cost=0.43..8.46 rows=1 width=8) (actual time=272.812..272.812 rows=0 loops=1)

-> Index Scan using index_conversations_on_status on conversations (cost=0.43..8.46 rows=1 width=8) (actual_time=272.812..272.812 rows:0 loops:1)

Index Cond: ((status)::text = 'in_progress'::text))

Filter: (id <> 1) AND (user_id = 123)

Rows Removed by Filter: 559

Planning Time: 0.133 ms

Execution Time: 272.886 ms

The question

We currently run a manual weekly vacuum analyze during the night. I'm wondering what are our possible solutions. One is to manually run the analyze during the day but is there a way to tell PG to run the auto analyze at a given time of the day for example ? I guess we are not the first ones to have data patterns that differ between when the analyze is run and the query is run.

Config

Postgres version: 11

Table Metadata

SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='conversations';

  relname   | relpages | reltuples  | relallvisible | relkind | relnatts | relhassubclass | reloptions | pg_table_size

-------------+----------+------------+---------------+---------+----------+----------------+------------+---------------

conversations |   930265 | 7.3366e+06 |        902732 | r       |       16 | f              |            |    7622991872

Maintenance Setup

We have manual vacuum analyze every week during the night.

GUC Settings

Unsure what's necessary...
    "autovacuum_analyze_threshold"      = "50"
"autovacuum_max_workers" = "3",
"autovacuum_naptime" = "60"
"autovacuum_vacuum_threshold" = "50"

Statistics: n_distinct, MCV, histogram

SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, inherited, null_frac, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, correlation FROM pg_stats WHERE attname='status' AND tablename='conversations' ORDER BY 1 DESC;

frac_mcv |  tablename  | attname | inherited | null_frac | n_distinct | n_mcv | n_hist | correlation

----------+-------------+---------+-----------+-----------+------------+-------+--------+-------------

0.999967 | conversations | status  | f         |         0 |          6 |     5 |        |    0.967121

В списке pgsql-performance по дате отправления:

Предыдущее
От: Don Seiler
Дата:
Сообщение: Re: High COMMIT times
Следующее
От: Justin Pryzby
Дата:
Сообщение: Re: How to deal with analyze gathering irrelevant stats