Re: row estimate for partial index

Поиск

Список

Период

Сортировка

От	Harmen
Тема	Re: row estimate for partial index
Дата	16 января 2023 г. 11:55:23
Msg-id	20230116085523.GK91568@arp.lijzij.de обсуждение исходный текст
Ответ на	Re: row estimate for partial index (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: row estimate for partial index
Список	pgsql-general

Дерево обсуждения

On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> Harmen <harmen@lijzij.de> writes:
> > Works well enough. However, we now have an org_id which has > 10% of the rows,
> > but only a handful rows where "deleted is null" matches (so the org has a lot
> > of "deleted" contacts). The planner doesn't like this and it falls back to a
> > full table scan for the above query.
> 
> > I've added a dedicated index just for that org_id, to see if that helps:
> >     "org123" btree (id) WHERE deleted IS NULL AND org_id = 123
> > The planner seems to use it now, however the row estimate is way off:
> 
> Yeah, so that indicates that it isn't producing a good selectivity
> estimate for the combination of those two conditions: it will assume
> the org_id and deleted columns are independent, which per your statements
> they are not.
> 
> If you are running a reasonably recent PG version you should be able to
> fix that by setting up "extended statistics" on that pair of columns:
> 
> https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED
> 
> (I might be wrong, but I think that will help even when one of
> the troublesome conditions is a null-check.  If it doesn't, then
> we have something to improve there ...)

Thanks for your explanation, Tom.
I've setup a local test scenario, where I then add a "dependencies" stat, but
that doesn't give a better plan, unfortunately.


This is my test table (I use a boolean field for "deleted" to keep this test
case as simple as possible. In my real case this is a "timestamptz null"
field):

  DROP table if exists contactsbool;
  CREATE table contactsbool (id int not null, org_id int not null, deleted boolean not null, firstname text);
  CREATE index contactsbool_orgs on contactsbool (org_id, id) where not deleted;


Testdata has a very low number of "orgs", and one org has almost only deleted
contacts:

  WITH ids as (select * from generate_series(0, 10000000)) insert into contactsbool select ids.generate_series,
mod(ids.generate_series,7), false, 'hello world' from ids;
 
  UPDATE contactsbool set deleted = true where id > 100 and org_id = 5;
  ANALYZE contactsbool;


Now the new stats:

  CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
  CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;
  ANALYZE contactsbool;


harmen=> explain (analyze) select id, firstname from contactsbool where org_id = 5 and not deleted order by id;
                                                                QUERY PLAN
                 
 

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather Merge  (cost=181983.91..299104.42 rows=1003820 width=16) (actual time=448.244..454.770 rows=14 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=180983.88..182238.66 rows=501910 width=16) (actual time=413.761..413.762 rows=5 loops=3)
         Sort Key: id
         Sort Method: quicksort  Memory: 25kB
         Worker 0:  Sort Method: quicksort  Memory: 25kB
         Worker 1:  Sort Method: quicksort  Memory: 25kB
         ->  Parallel Seq Scan on contactsbool  (cost=0.00..124881.86 rows=501910 width=16) (actual
time=267.318..413.673rows=5 loops=3)
 
               Filter: ((NOT deleted) AND (org_id = 5))
               Rows Removed by Filter: 3333329
 Planning Time: 0.565 ms
 JIT:
   Functions: 12
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 2.444 ms, Inlining 0.000 ms, Optimization 1.163 ms,
Emission 13.288 ms, Total 16.895 ms
 Execution Time: 456.498 ms
(17 rows)


The "rows=501910" is what I don't expect. I expect/want/hope the plan to use
the contactsbool_orgs index.
(If I really (hard) delete the "deleted" contacts everything works perfectly
for all orgs.)

Any ideas?
Thanks again,
Harmen

В списке pgsql-general по дате отправления:

Предыдущее

От: HECTOR INGERTO
Дата: 16 января 2023 г., 11:41:19
Сообщение: RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?

Следующее

От: Fred Habash
Дата: 16 января 2023 г., 15:48:24
Сообщение: Why is a Read-only Table Gets Autovacuumed "to prevent wraparound"

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: row estimate for partial index

Предыдущее

Следующее