Seq scan instead of index scan querying single row from primary key on large table
| От | James Coleman | 
|---|---|
| Тема | Seq scan instead of index scan querying single row from primary key on large table | 
| Дата | |
| Msg-id | CAAaqYe8uY_NoG-SmDzCyzkcggkDs5cN2stiJxvdZXVThZM+XQA@mail.gmail.com обсуждение исходный текст | 
| Ответы | Re: Seq scan instead of index scan querying single row from primary key on large table | 
| Список | pgsql-hackers | 
Hi all,
I've been optimizing queries for a long time, and I don't think I've
ever seen something more surprising to me than this -- sufficiently so
that I wanted to ask if others thought it implied a bug. It's possible
my mental model for the planner is broken in some significant way, or
that I'm overlooking something obvious, so please let me know if
that's the case too.
We have a query like this (all tables renamed to be generic):
SELECT DISTINCT "objects"."pk"
FROM "objects"
LEFT OUTER JOIN "objects" "allowed_objects_objects" ON
"allowed_objects_objects"."pk" = "objects"."allowed_object_fk" AND
"objects"."allowed_object_fk" IS NOT NULL
LEFT OUTER JOIN "facilitated_object_metadata" ON
"facilitated_object_metadata"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "object_audits" ON "object_audits"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "objects" "action_objects_objects" ON
"action_objects_objects"."allowed_object_fk" IS NOT NULL AND
"action_objects_objects"."allowed_object_fk" = "objects"."pk"
LEFT OUTER JOIN "objects" "return_objects_objects" ON
"return_objects_objects"."returned_object_fk" IS NOT NULL AND
"return_objects_objects"."returned_object_fk" = "objects"."pk"
LEFT OUTER JOIN "object_taxs" ON "object_taxs"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "object_configs" ON "object_configs"."object_fk" =
"objects"."pk"
LEFT OUTER JOIN "object_statuses" ON "object_statuses"."object_fk" =
"objects"."pk"
LEFT OUTER JOIN "object_edits" ON "object_edits"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "audit_answers" ON "audit_answers"."requester_type" =
'Transaction' AND "audit_answers"."requester_fk" = "objects"."pk"
WHERE "objects"."pk" = 131690144
LIMIT 1;
This is ORM generated, so granted there's some slight oddity with the
"DISTINCT pk" and then the "LIMIT 1".
The plan generated by the planner changed suddenly one morning this
week, and in a very surprising way: the innermost scan (of "objects")
started choosing a seq scan, despite the cost from that node being
very high and an index scan being possible -- it's the primary key and
we're restricting on a single value, so intuitively we know, and
indeed the planner knows, that there will only be a single row
returned.
Here's the surprising plan:
Limit  (cost=3.42..4.00 rows=1 width=8)
  ->  Limit  (cost=3.42..4.00 rows=1 width=8)
        ->  Nested Loop Left Join  (cost=3.42..7939067.28 rows=13777920 width=8)
              ->  Nested Loop Left Join  (cost=2.86..7766839.95
rows=1059840 width=8)
                    ->  Nested Loop Left Join  (cost=2.44..7753589.13
rows=211968 width=8)
                          ->  Nested Loop Left Join
(cost=1.86..7750902.24 rows=138 width=8)
                                ->  Nested Loop Left Join
(cost=1.29..7750895.86 rows=1 width=8)
                                      ->  Nested Loop Left Join
(cost=1.29..7750895.85 rows=1 width=8)
                                            ->  Nested Loop Left Join
(cost=0.86..7750893.39 rows=1 width=8)
                                                  ->  Nested Loop Left
Join  (cost=0.43..7750890.93 rows=1 width=8)
                                                        ->  Seq Scan
on objects  (cost=0.00..7750888.47 rows=1 width=16)
                                                              Filter:
(pk = 131690144)
                                                        ->  Index Only
Scan using index_facilitated_object_metadata_on_object_fk on
facilitated_object_metadata  (cost=0.43..2.45 rows=1 width=8)
                                                              Index
Cond: (object_fk = 131690144)
                                                  ->  Index Only Scan
using index_objects_on_allowed_object_fk_not_null on objects
action_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                        Index Cond:
(allowed_object_fk = 131690144)
                                            ->  Index Only Scan using
index_objects_on_returned_object_fk_not_null on objects
return_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                  Index Cond:
(returned_object_fk = 131690144)
                                      ->  Seq Scan on object_taxs
(cost=0.00..0.00 rows=1 width=8)
                                            Filter: (object_fk = 131690144)
                                ->  Index Only Scan using
index_object_audits_on_object_fk on object_audits  (cost=0.57..5.00
rows=138 width=8)
                                      Index Cond: (object_fk = 131690144)
                          ->  Materialize  (cost=0.57..41.13 rows=1536 width=8)
                                ->  Index Only Scan using
index_object_configs_on_object_id on object_configs  (cost=0.57..33.45
rows=1536 width=8)
                                      Index Cond: (object_fk = 131690144)
                    ->  Materialize  (cost=0.42..2.84 rows=5 width=8)
                          ->  Index Only Scan using
index_adjustment_responses_on_response_fk on object_edits
(cost=0.42..2.81 rows=5 width=8)
                                Index Cond: (object_fk = 131690144)
              ->  Materialize  (cost=0.56..3.36 rows=13 width=8)
                    ->  Index Only Scan using
index_audit_answers_on_requester_type_and_fk on audit_answers
(cost=0.56..3.30 rows=13 width=8)
                          Index Cond: ((requester_type =
'Object'::bt_object_or_alternate_object) AND (requester_fk =
131690144))
Note the innermost table scan:
Seq Scan on objects  (cost=0.00..7750888.47 rows=1 width=16)
   Filter: (pk = 131690144)
If I set enable_seqscan = off, then I get the old plan:
Limit  (cost=4.12..4.13 rows=1 width=8)
   ->  Limit  (cost=4.12..4.13 rows=1 width=8)
         ->  Nested Loop Left Join  (cost=4.12..188183.54 rows=13777920 width=8)
               ->  Nested Loop Left Join  (cost=3.55..15956.20
rows=1059840 width=8)
                     ->  Nested Loop Left Join  (cost=3.13..2705.38
rows=211968 width=8)
                           ->  Nested Loop Left Join
(cost=2.55..18.49 rows=138 width=8)
                                 ->  Nested Loop Left Join
(cost=1.98..12.11 rows=1 width=8)
                                       ->  Nested Loop Left Join
(cost=1.86..9.96 rows=1 width=8)
                                             ->  Nested Loop Left Join
 (cost=1.43..7.50 rows=1 width=8)
                                                   ->  Nested Loop
Left Join  (cost=0.99..5.04 rows=1 width=8)
                                                         ->  Index
Scan using objects_pkey on objects  (cost=0.57..2.58 rows=1 width=16)
                                                               Index
Cond: (pk = 131690144)
                                                         ->  Index
Only Scan using index_facilitated_object_metadata_on_object_fk on
facilitated_object_metadata  (cost=0.43..2.45 rows=1 width=8)
                                                               Index
Cond: (object_fk = 131690144)
                                                   ->  Index Only Scan
using index_objects_on_allowed_object_fk_not_null on objects
action_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                         Index Cond:
(allowed_object_fk = 131690144)
                                             ->  Index Only Scan using
index_objects_on_returned_object_fk_not_null on objects
return_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                   Index Cond:
(returned_object_fk = 131690144)
                                       ->  Index Only Scan using
index_object_taxs_on_object_fk on object_taxs  (cost=0.12..2.14 rows=1
width=8)
                                             Index Cond: (object_fk = 131690144)
                                 ->  Index Only Scan using
index_object_audits_on_object_fk on object_audits  (cost=0.57..5.00
rows=138 width=8)
                                       Index Cond: (object_fk = 131690144)
                           ->  Materialize  (cost=0.57..41.13 rows=1536 width=8)
                                 ->  Index Only Scan using
index_object_configs_on_object_id on object_configs  (cost=0.57..33.45
rows=1536 width=8)
                                       Index Cond: (object_fk = 131690144)
                     ->  Materialize  (cost=0.42..2.84 rows=5 width=8)
                           ->  Index Only Scan using
index_adjustment_responses_on_response_fk on object_edits
(cost=0.42..2.81 rows=5 width=8)
                                 Index Cond: (object_fk = 131690144)
               ->  Materialize  (cost=0.56..3.36 rows=13 width=8)
                     ->  Index Only Scan using
index_audit_answers_on_requester_type_and_fk on audit_answers
(cost=0.56..3.30 rows=13 width=8)
                           Index Cond: ((requester_type =
'Object'::bt_object_or_alternate_object) AND (requester_fk =
131690144))
Notice the innermost table scan is the expect index scan:
 ->  Index Scan using objects_pkey on objects  (cost=0.57..2.58 rows=1 width=16)
    Index Cond: (pk = 131690144)
The join order stays the same between the two plans. The scan on
object_taxs changes from a seq scan to an index scan (the table is
empty so neither would be a problem) obviously because of disabled
sequence scans.
Analyzing object_audits during the incident did *not* change the plan,
but a later auto-analyze of that same table seemed to change the plan
back. That being said, we're now back at the bad plan (without setting
enable_seqscan = off). I don't understand why that analyze should
really affect anything at all given that join order stays the same
between good/bad plans and given that I'd expect the planner to
strongly prefer the index scan given its far lower cost for the number
of estimated tuples. Analyzing objects doesn't help. Removing the
DISTINCT does not fix the plan.
Removing the LIMIT 1 does fix the plan and removes the double LIMIT
node in the plan. The second LIMIT node is implicit from the DISTINCT,
and it doesn't seem to me that a LIMIT node should change the
cost/path chosen of an identical limit node inside it, and indeed the
cost seems to be the same at both.
The only explanation I can come up with is that the startup cost for
the seq scan is 0 while for the index scan is 0.57, and the LIMIT at
the top level is causing the planner to care about startup cost.
Assuming that's true I think the early return cost multiplication of
the LIMIT is being applied very naively on the seq scan node. Or
perhaps the issue is that the startup cost for a single tuple on a seq
scan like this shouldn't really have a startup cost of 0 -- that cost
is presumably for tuples being returned _without_ having applied the
filter. That seems slightly odd to me, because the cost of getting the
first row out of that node -- in my naive view thinking about it for
all of 5 seconds -- should be calculated based on applying the filter
(and thus the likelihood that that filter matches right away). If we
did that then this path would never win. But that 0.00 startup cost
for the seq scan with a filter shows up in PG14 and PG11 also, not
just PG16, so that's not something that's changed.
To recap: the estimation of rows is correct, the estimated high
(total) cost of the seq scan is correct, but the seq scan is chosen
over the index scan anyway for a plan that returns a single "random"
row based on the primary key.
Am I right to be surprised here?
James Coleman
		
	В списке pgsql-hackers по дате отправления: