Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)

Поиск
Список
Период
Сортировка
От Danny Shemesh
Тема Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
Дата
Msg-id CAFZC=QozuGJ3HZVh-vBJJvQbZBOmOJ6Cb-ZKCk3E98Enpu=SYg@mail.gmail.com
обсуждение исходный текст
Ответы Re: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-general
Hey everyone,

I'm working on improving gathered statistics on several large tables (2TB / 500M records);
I've created extended stats on correlated columns, ran analyze, compared the pre and post explains, and it seems as though the extended statistics are ignored - 
the estimation doesn't change much, and I can accurately derive the est. rows from multiplying the single-column freqs from pg_stats with the supplied values.


Context:
- pg 12.8
- relation w/ three columns of interest - user (text), row_type (text), deleted_at (timestamp).
- different users have different distributions for row_type
- deleted_at is set if said record is marked for deletion - stays marked for a long retention period, around 95% of the rows have a value, and around 5% have null
- we query and operate on the 5% of records with deleted_at = null, most of our indexes are partial on that as well

I'll give a simplified example with two users, one being extremely dominant w/ 99.9% of the data; the next holds much less, but still accounts to around 200k rows.

From pg_stats:
attname                | user
n_distinct             | 2.0
most_common_vals       | {A,B}
most_common_freqs      | [0.9996333, 0.00036666667]

attname                | row_type
n_distinct             | 4.0
most_common_vals       | {A,B}
most_common_freqs      | [0.9968, 0.0025333334]

attname                | deleted_at
n_distinct             | 20761.0
null_frac              | 0.043133333
<values / bounds are timestamps, omitted for brevity>

Querying before extended stats:
$> explain select 1 from my_rel where user ='B and row_type = 'A' and deleted_at is null;
Index Only Scan using idx_user_row_type_where_deleted_at_is_null on my_rel  (cost=0.69..2851.78 rows=8213 width=4)
Index Cond: ((user = 'B') AND (row_type = 'B'))


The number is derived from: reltuples*user_b_freq*row_type_a_freq*deleted_at_null_frac = 520982816*0.00036666667*0.9968*0.043133333 = 8213.26586

When explain-analyzing, I get - (actual time=0.051..72.503 rows=174954 loops=1)
So in this specific case, the estimation is off by around 20x (note that this is a simplified case just to showcase the symptom).

I then create extended stats - I've tried to add them on all three columns in all combinations, and in pairs in all combinations - all leading to the same result,
I'll only showcase the three column variant for brevity:

$> create statistics s1 on user, row_type, deleted_at from my_rel;
$> analyze my_rel;
$> explain select 1 from my_rel where user ='B' and row_type = 'A' and deleted_at is null;
Index Only Scan using idx_user_row_type_where_deleted_at_is_null on my_rel  (cost=0.69..1560.01 rows=4491 width=4)

After analyzing, pg_stats contain different values, as the large table is sampled - I have user_b_freq = 0.0002, row_type_a_freq = 0.99686664, deleted_at_null_frac = 0.043233335,
thus the new calculation is: 520982816*0.0002*0.99686664*0.043233335 = 4490.64987.

Now it's off by around 40x - and it seems to still only consider the single column distributions.

Is there anything I'm missing ? I thought that maybe in my case, due to the high null fractions of deleted_at, the extended stats aren't used, but couldn't find an obvious hint from the code that would suggest that.


Appreciate your time !
Danny

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: function currtid2() in SQL and ESQL/C to get the new CTID of a row
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)