Обсуждение: Plan uses wrong index, preferring to scan pkey index instead
Hi all,
Excuse me if I made any mistakes, as this is my first time posting to a mailing list.
I'm a user of Quassel, a IRC client that uses postgres a backing store for IRC logs and am running into heavy intermittent performance problems. I've tracked it down to a query that takes a very long time (around 4 minutes) to complete when its data isn't cached.
This is the layout of the table being queried and EXPLAIN ANALYZE result for the problematic query:
quassel=> \d backlog
Table "public.backlog"
Column | Type | Modifiers
-----------+-----------------------------+-------------------------------------------------------------
messageid | integer | not null default nextval('backlog_messageid_seq'::regclass)
time | timestamp without time zone | not null
bufferid | integer | not null
type | integer | not null
flags | integer | not null
senderid | integer | not null
message | text |
Indexes:
"backlog_pkey" PRIMARY KEY, btree (messageid)
"backlog_bufferid_idx" btree (bufferid, messageid DESC)
Foreign-key constraints:
"backlog_bufferid_fkey" FOREIGN KEY (bufferid) REFERENCES buffer(bufferid) ON DELETE CASCADE
"backlog_senderid_fkey" FOREIGN KEY (senderid) REFERENCES sender(senderid) ON DELETE SET NULL
quassel=> explain (analyze, buffers) SELECT messageid, time, type, flags, sender, message
FROM backlog
LEFT JOIN sender ON backlog.senderid = sender.senderid
WHERE bufferid = 39
ORDER BY messageid DESC LIMIT 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.72..37.78 rows=10 width=102) (actual time=154410.353..154410.424 rows=10 loops=1)
Buffers: shared hit=13952 read=19244
-> Nested Loop Left Join (cost=0.72..145800.61 rows=39345 width=102) (actual time=154410.350..154410.414 rows=10 loops=1)
Buffers: shared hit=13952 read=19244
-> Index Scan Backward using backlog_pkey on backlog (cost=0.43..63830.21 rows=39345 width=62) (actual time=154410.327..154410.341 rows=10 loops=1)
Filter: (bufferid = 39)
Rows Removed by Filter: 1248320
Buffers: shared hit=13921 read=19244
-> Index Scan using sender_pkey on sender (cost=0.29..2.07 rows=1 width=48) (actual time=0.005..0.005 rows=1 loops=10)
Index Cond: (backlog.senderid = senderid)
Buffers: shared hit=31
Total runtime: 154410.477 ms
(12 rows)
This plan is consistently chosen, even after ANALYZEing and REINDEXing the table. It looks like Postgres is opting to do a sequential scan of the backlog_pkey index, filtering rows by bufferid, instead of directly using the backlog_bufferid_idx index that directly maps to the operation being made by the query. I was advised on IRC to try dropping the backlog_pkey index to force Postgres to use the correct one, and that uses a better plan:
quassel=> begin;
BEGIN
quassel=> alter table backlog drop constraint backlog_pkey;
ALTER TABLE
quassel=> explain analyze SELECT messageid, time, type, flags, sender, message
FROM backlog
LEFT JOIN sender ON backlog.senderid = sender.senderid
WHERE bufferid = 39
ORDER BY messageid DESC LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.72..40.50 rows=10 width=102) (actual time=63.826..162.134 rows=10 loops=1)
-> Nested Loop Left Join (cost=0.72..156518.91 rows=39345 width=102) (actual time=63.823..162.126 rows=10 loops=1)
-> Index Scan using backlog_bufferid_idx on backlog (cost=0.43..74548.51 rows=39345 width=62) (actual time=63.798..63.814 rows=10 loops=1)
Index Cond: (bufferid = 39)
-> Index Scan using sender_pkey on sender (cost=0.29..2.07 rows=1 width=48) (actual time=8.532..9.825 rows=1 loops=10)
Index Cond: (backlog.senderid = senderid)
Total runtime: 162.377 ms
(7 rows)
quassel=> rollback;
ROLLBACK
(This query was also run with empty caches.) bufferid=39 in particular has this issue because it hasn't had any messages posted to for a long time, so scanning backlog upwards will take a long time to gather 10 messages from it. In contrast, most other bufferid's have their messages interleaved on the last entries of backlog. I believe this might be throwing Postgres' estimates off.
Does anyone know if there's any tweaking I can do in Postgres so that it uses the appropriate plan?
Info about my setup:
PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit
Arch Linux, PostgreSQL installed from the official repositories, running inside a Xen HVM VPS.
Connecting to PostgreSQL using psql via UNIX socket.
Changed options: (locale-related ones omitted)
listen_addresses =
max_stack_depth = 2MB
shared_buffers = 256MB (Issue is also present with default value)
Total RAM: 1GB
Thanks,
--yuriks
Yuri Kunde Schlesner <yuriks+lists@yuriks.net> writes: > Does anyone know if there's any tweaking I can do in Postgres so that it > uses the appropriate plan? I suspect that the reason the planner likes the backlog_pkey is that it's almost perfectly correlated with table order, which greatly reduces the number of table fetches that need to happen over the course of a indexscan compared to using the less-well-correlated bufferid+messageid index. So that way is estimated to be cheaper than using the less-correlated index ... and that may even be true except for outlier bufferid values with no recent messages. You could try fooling around with the planner cost parameters (particularly random_page_cost) to see if that changes the decision; but it's usually a bad idea to alter cost parameters on the basis of tweaking a single query, and even more so for tweaking an outlier case of a single query. What I think might be a workable solution, assuming you can stand a little downtime to do it, is to CLUSTER the table on the bufferid+messageid index. This would reverse the correlation advantage and thereby solve your problem. Now, ordinarily CLUSTER is only a temporary solution because the cluster-induced ordering degrades over time. But I think it would likely be a very long time until you accumulate so many new messages that the table as a whole looks well-correlated on messageid alone. regards, tom lane
On Sun, Nov 16, 2014, at 03:18 PM, Tom Lane wrote: > I suspect that the reason the planner likes the backlog_pkey is that it's > almost perfectly correlated with table order, which greatly reduces the > number of table fetches that need to happen over the course of a > indexscan > compared to using the less-well-correlated bufferid+messageid index. > So that way is estimated to be cheaper than using the less-correlated > index ... and that may even be true except for outlier bufferid values > with no recent messages. Indeed, and I can imagine that this is more advantageous in the general case, as I described in my last message. The problem is that the variance is too high, with a 500x slowdown between the best and the worst cases for that plan. > What I think might be a workable solution, assuming you can stand a > little > downtime to do it, is to CLUSTER the table on the bufferid+messageid > index. This would reverse the correlation advantage and thereby solve > your problem. Now, ordinarily CLUSTER is only a temporary solution > because the cluster-induced ordering degrades over time. But I think it > would likely be a very long time until you accumulate so many new > messages > that the table as a whole looks well-correlated on messageid alone. I tried this and it seems to have solved my problem! The better plan is consistently chosen now, and it's as fast as former plan on the fast cases, and much faster on the slow case. I will continue monitoring the DB to see if it eventually switches back to the former scheme, and if it does I can just include a re-cluster on my maintenance schedule. Thanks so much for the suggestion.