Обсуждение: Plan uses wrong index, preferring to scan pkey index instead

От:
Yuri Kunde Schlesner
Дата:

Hi all,
 
Excuse me if I made any mistakes, as this is my first time posting to a mailing list.
 
I'm a user of Quassel, a IRC client that uses postgres a backing store for IRC logs and am running into heavy intermittent performance problems. I've tracked it down to a query that takes a very long time (around 4 minutes) to complete when its data isn't cached.
 
This is the layout of the table being queried and EXPLAIN ANALYZE result for the problematic query:
 
quassel=> \d backlog
                                        Table "public.backlog"
  Column   |            Type             |                          Modifiers
-----------+-----------------------------+-------------------------------------------------------------
 messageid | integer                     | not null default nextval('backlog_messageid_seq'::regclass)
 time      | timestamp without time zone | not null
 bufferid  | integer                     | not null
 type      | integer                     | not null
 flags     | integer                     | not null
 senderid  | integer                     | not null
 message   | text                        |
Indexes:
    "backlog_pkey" PRIMARY KEY, btree (messageid)
    "backlog_bufferid_idx" btree (bufferid, messageid DESC)
Foreign-key constraints:
    "backlog_bufferid_fkey" FOREIGN KEY (bufferid) REFERENCES buffer(bufferid) ON DELETE CASCADE
    "backlog_senderid_fkey" FOREIGN KEY (senderid) REFERENCES sender(senderid) ON DELETE SET NULL
 
quassel=> explain (analyze, buffers) SELECT messageid, time,  type, flags, sender, message
FROM backlog
LEFT JOIN sender ON backlog.senderid = sender.senderid
WHERE bufferid = 39
ORDER BY messageid DESC LIMIT 10;
                                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..37.78 rows=10 width=102) (actual time=154410.353..154410.424 rows=10 loops=1)
   Buffers: shared hit=13952 read=19244
   ->  Nested Loop Left Join  (cost=0.72..145800.61 rows=39345 width=102) (actual time=154410.350..154410.414 rows=10 loops=1)
         Buffers: shared hit=13952 read=19244
         ->  Index Scan Backward using backlog_pkey on backlog  (cost=0.43..63830.21 rows=39345 width=62) (actual time=154410.327..154410.341 rows=10 loops=1)
               Filter: (bufferid = 39)
               Rows Removed by Filter: 1248320
               Buffers: shared hit=13921 read=19244
         ->  Index Scan using sender_pkey on sender  (cost=0.29..2.07 rows=1 width=48) (actual time=0.005..0.005 rows=1 loops=10)
               Index Cond: (backlog.senderid = senderid)
               Buffers: shared hit=31
 Total runtime: 154410.477 ms
(12 rows)
 
This plan is consistently chosen, even after ANALYZEing and REINDEXing the table. It looks like Postgres is opting to do a sequential scan of the backlog_pkey index, filtering rows by bufferid, instead of directly using the backlog_bufferid_idx index that directly maps to the operation being made by the query. I was advised on IRC to try dropping the backlog_pkey index to force Postgres to use the correct one, and that uses a better plan:
 
quassel=> begin;
BEGIN
quassel=> alter table backlog drop constraint backlog_pkey;
ALTER TABLE
quassel=> explain analyze SELECT messageid, time,  type, flags, sender, message
FROM backlog
LEFT JOIN sender ON backlog.senderid = sender.senderid
WHERE bufferid = 39
ORDER BY messageid DESC LIMIT 10;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..40.50 rows=10 width=102) (actual time=63.826..162.134 rows=10 loops=1)
   ->  Nested Loop Left Join  (cost=0.72..156518.91 rows=39345 width=102) (actual time=63.823..162.126 rows=10 loops=1)
         ->  Index Scan using backlog_bufferid_idx on backlog  (cost=0.43..74548.51 rows=39345 width=62) (actual time=63.798..63.814 rows=10 loops=1)
               Index Cond: (bufferid = 39)
         ->  Index Scan using sender_pkey on sender  (cost=0.29..2.07 rows=1 width=48) (actual time=8.532..9.825 rows=1 loops=10)
               Index Cond: (backlog.senderid = senderid)
 Total runtime: 162.377 ms
(7 rows)
 
quassel=> rollback;
ROLLBACK
 
(This query was also run with empty caches.) bufferid=39 in particular has this issue because it hasn't had any messages posted to for a long time, so scanning backlog upwards will take a long time to gather 10 messages from it. In contrast, most other bufferid's have their messages interleaved on the last entries of backlog. I believe this might be throwing Postgres' estimates off.
 
Does anyone know if there's any tweaking I can do in Postgres so that it uses the appropriate plan?
 
Info about my setup:
PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit
Arch Linux, PostgreSQL installed from the official repositories, running inside a Xen HVM VPS.
Connecting to PostgreSQL using psql via UNIX socket.
Changed options: (locale-related ones omitted)
  listen_addresses = 
  max_stack_depth = 2MB
  shared_buffers = 256MB (Issue is also present with default value)
Total RAM: 1GB
 
 
Thanks,
--yuriks
От:
Tom Lane
Дата:

Yuri Kunde Schlesner <> writes:
> Does anyone know if there's any tweaking I can do in Postgres so that it
> uses the appropriate plan?

I suspect that the reason the planner likes the backlog_pkey is that it's
almost perfectly correlated with table order, which greatly reduces the
number of table fetches that need to happen over the course of a indexscan
compared to using the less-well-correlated bufferid+messageid index.
So that way is estimated to be cheaper than using the less-correlated
index ... and that may even be true except for outlier bufferid values
with no recent messages.

You could try fooling around with the planner cost parameters
(particularly random_page_cost) to see if that changes the decision;
but it's usually a bad idea to alter cost parameters on the basis of
tweaking a single query, and even more so for tweaking an outlier
case of a single query.

What I think might be a workable solution, assuming you can stand a little
downtime to do it, is to CLUSTER the table on the bufferid+messageid
index.  This would reverse the correlation advantage and thereby solve
your problem.  Now, ordinarily CLUSTER is only a temporary solution
because the cluster-induced ordering degrades over time.  But I think it
would likely be a very long time until you accumulate so many new messages
that the table as a whole looks well-correlated on messageid alone.

            regards, tom lane


От:
Yuri Kunde Schlesner
Дата:

On Sun, Nov 16, 2014, at 03:18 PM, Tom Lane wrote:
> I suspect that the reason the planner likes the backlog_pkey is that it's
> almost perfectly correlated with table order, which greatly reduces the
> number of table fetches that need to happen over the course of a
> indexscan
> compared to using the less-well-correlated bufferid+messageid index.
> So that way is estimated to be cheaper than using the less-correlated
> index ... and that may even be true except for outlier bufferid values
> with no recent messages.
Indeed, and I can imagine that this is more advantageous in the general
case, as I described in my last message. The problem is that the
variance is too high, with a 500x slowdown between the best and the
worst cases for that plan.

> What I think might be a workable solution, assuming you can stand a
> little
> downtime to do it, is to CLUSTER the table on the bufferid+messageid
> index.  This would reverse the correlation advantage and thereby solve
> your problem.  Now, ordinarily CLUSTER is only a temporary solution
> because the cluster-induced ordering degrades over time.  But I think it
> would likely be a very long time until you accumulate so many new
> messages
> that the table as a whole looks well-correlated on messageid alone.
I tried this and it seems to have solved my problem! The better plan is
consistently chosen now, and it's as fast as former plan on the fast
cases, and much faster on the slow case. I will continue monitoring the
DB to see if it eventually switches back to the former scheme, and if it
does I can just include a re-cluster on my maintenance schedule. Thanks
so much for the suggestion.