Обсуждение: BUG #11500: PRIMARY KEY index not being used
The following bug has been logged on the website: Bug reference: 11500 Logged by: Marko Tiikkaja Email address: marko@joh.to PostgreSQL version: 9.1.12 Operating system: Linux Description: Hi, We've been observing a performance problem where a PRIMARY KEY index is not being used. The problem looks like this: pg2=#* explain analyze select * from events where eventid = 132685185 and processed = 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Index Scan using index_events_processed on events (cost=0.00..7.73 rows=1 width=106) (actual time=31.808..31.808 rows=0 loops=1) Index Cond: (processed = 0) Filter: (eventid = 132685185) Total runtime: 31.852 ms (4 rows) pg2=#* explain analyze select * from events where eventid = 132685185 and processed+0 = 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using events_pkey on events (cost=0.00..12.38 rows=1 width=106) (actual time=0.071..0.071 rows=0 loops=1) Index Cond: (eventid = 132685185) Filter: ((processed + 0) = 0) Total runtime: 0.109 ms (4 rows) I'm guessing that's happening because the index on processed is smaller (7GB, relpages=900880 vs 3.7GB, relpages=478225). The statistics say that there are no rows where processed=0 (and it's not far from the truth), but it's still a risky plan compared to the PK lookup. The index index_events_processed is an index on events(processed), which should probably be a partial index on WHERE processed = 0, but I thought I'd report this plan anyway. Any thoughts?
On Fri, Sep 26, 2014 at 11:02 AM, <marko@joh.to> wrote: > The statistics say that > there are no rows where processed=0 (and it's not far from the truth), but > it's still a risky plan compared to the PK lookup. > Any thoughts? PostgreSQL 9.0 introduced this optimization for greater/less operators: > When looking up statistics for greater/less-than comparisons, if the > comparison value is in the first or last histogram bucket, use an index > (if available) to fetch the current actual column minimum or maximum. > This greatly improves the accuracy of estimates for comparison values > near the ends of the data range, particularly if the range is constantly > changing due to addition of new data. Not sure whether it's a good idea a bad idea, but perhaps a solution is to expand this to equality lookups too? Does using "WHERE processed <= 0" work around the problem? (Assuming you don't have any negative numbers in this column). > The index > index_events_processed is an index on events(processed), which should > probably be a partial index on WHERE processed = 0, but I thought I'd > report this plan anyway. I guess you would still have this problem, unless your new index contains the eventid column. Regards, Marti
On 9/26/14 2:02 PM, Marti Raudsepp wrote: > On Fri, Sep 26, 2014 at 11:02 AM, <marko@joh.to> wrote: >> The statistics say that >> there are no rows where processed=0 (and it's not far from the truth), but >> it's still a risky plan compared to the PK lookup. > >> Any thoughts? > > PostgreSQL 9.0 introduced this optimization for greater/less operators: > >> When looking up statistics for greater/less-than comparisons, if the >> comparison value is in the first or last histogram bucket, use an index >> (if available) to fetch the current actual column minimum or maximum. >> This greatly improves the accuracy of estimates for comparison values >> near the ends of the data range, particularly if the range is constantly >> changing due to addition of new data. > > Not sure whether it's a good idea a bad idea, but perhaps a solution > is to expand this to equality lookups too? I'm not sure that's the right idea to be honest. The problem is that the planner is taking a risk by using an index which could contain (theoretically) any number of matching rows, instead of using the primary key which is guaranteed to only contain 0 or 1 rows. Sure, peeking into the index to see that there are indeed some processed=0 rows would probably discourage the planner from using it, but why bother? > Does using "WHERE processed <= 0" work around the problem? (Assuming > you don't have any negative numbers in this column). I unfortunately already dropped the problematic index, so I can't answer that. .marko
marko@joh.to writes: > PostgreSQL version: 9.1.12 > We've been observing a performance problem where a PRIMARY KEY index is not > being used. The problem looks like this: In 9.2 and again in 9.3, we significantly changed the planner's modeling of index access costs, with a view towards making better choices when there are multiple plausible indexes to use. If you can reproduce this misbehavior in 9.3 or later it would be worth looking into; but we're not going to change 9.1's estimation rules at this late date. People running on back branches tend to want plan stability in my experience. regards, tom lane
Hi, Resurrecting since this has been causing us issues again, though this time on a different index. Given a schema similar to this: CREATE TYPE order_state AS ENUM ('INITIAL', 'WAIT_EVENT', 'DONE'); CREATE TABLE orders( order_id bigserial PRIMARY KEY, state order_state NOT NULL DEFAULT 'INITIAL' ); -- there are other indexed columns here, but nothing really reads through the entire index CREATE INDEX orders_wait_event_idx ON orders ((1)) WHERE state = 'WAIT_EVENT'; where 80% of rows go through WAIT_EVENT before DONE. Then we have a frequent query like this: SELECT .. FROM orders WHERE order_id = $1 AND state = 'WAIT_EVENT'; which almost always uses the primary key. But sometimes, perhaps after an autovacuum or something, something changes and postgres decides to start serving that query through the orders_wait_event_idx index. Now those queries need to first suffer through this: Index Scan using orders_wait_event_idx on orders (cost=0.54..94812.85 rows=85043 width=1223) (actual time=0.166..7199.020 rows=84535 loops=1) Buffers: shared hit=15676 read=91962 dirtied=1988 Planning: Buffers: shared hit=807 read=11 dirtied=1 Planning Time: 4.634 ms Execution Time: 7204.117 ms and subsequent executions also take ~100ms, instead of the ~0.05ms through the PRIMARY KEY. Just testing locally, if I add some data: INSERT INTO orders (state) SELECT 'DONE' FROM generate_series(1, 65536 * 12); ANALYZE orders; I get: EXPLAIN (ANALYZE, BUFFERS FALSE) SELECT 1 FROM orders WHERE order_id = 1 AND state = 'WAIT_EVENT'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Index Scan using orders_wait_event_idx on orders (cost=0.12..8.14 rows=1 width=4) (actual time=0.003..0.003 rows=0.00 loops=1) Filter: (order_id = 1) Index Searches: 1 Planning Time: 0.278 ms Execution Time: 0.035 ms (5 rows) which just seems like an insane gamble to take compared to reading the primary key index. If you're right, you save fractions of a millisecond, but if you're wrong, it could be the ten seconds like we've been seeing in production. We've been seeing this on 16.9 in prod, and with the code here I see the planner hitting the casino on latest master as well. Let me know what you think. .m
On Fri, 2025-10-03 at 09:23 +0300, Marko Tiikkaja wrote: > CREATE TYPE order_state AS ENUM ('INITIAL', 'WAIT_EVENT', 'DONE'); > CREATE TABLE orders( > order_id bigserial PRIMARY KEY, > state order_state NOT NULL DEFAULT 'INITIAL' > ); > -- there are other indexed columns here, but nothing really reads > through the entire index > CREATE INDEX orders_wait_event_idx ON orders ((1)) WHERE state = 'WAIT_EVENT'; > > where 80% of rows go through WAIT_EVENT before DONE. Then we have a > frequent query like this: > > SELECT .. > FROM orders > WHERE > order_id = $1 AND > state = 'WAIT_EVENT'; > > which almost always uses the primary key. But sometimes, perhaps > after an autovacuum or something, something changes and postgres > decides to start serving that query through the orders_wait_event_idx > index. Now those queries need to first suffer through this: > > Index Scan using orders_wait_event_idx on orders (cost=0.54..94812.85 rows=85043 width=1223) (actual time=0.166..7199.020rows=84535 loops=1) > Buffers: shared hit=15676 read=91962 dirtied=1988 > Planning: > Buffers: shared hit=807 read=11 dirtied=1 > Planning Time: 4.634 ms > Execution Time: 7204.117 ms > > which just seems like an insane gamble to take compared to reading the > primary key index. If you're right, you save fractions of a > millisecond, but if you're wrong, it could be the ten seconds like > we've been seeing in production. > > We've been seeing this on 16.9 in prod, and with the code here I see > the planner hitting the casino on latest master as well. > > Let me know what you think. Did you check if the optimizer statistics are up to date? The index you created is only useful if only a very small percentage of the rows in the table match the WHERE condition. It may be that the optimizer chooses the index by mistake: deduplication of identical index keys will render the index rather small, and PostgreSQL prefers the smaller index if it thinks that both indexes will do the task equally well. But I think that the root of the problem is that you created an index that you don't want to get used. I understand that you want it for some other statement, but is there really nothing better to index than the constant 1? Yours, Laurenz Albe
On Fri, Oct 3, 2025 at 9:53 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > The index you created is only useful if only a very small percentage of > the rows in the table match the WHERE condition. It may be that the > optimizer chooses the index by mistake: deduplication of identical index > keys will render the index rather small, and PostgreSQL prefers the smaller > index if it thinks that both indexes will do the task equally well. I didn't want to include too much in the small repro that I had, but as I alluded to in the comment, the index has real data in it. Deduplication really plays no real part here. But yes, the index is still physically smaller because it indexes a small subset of the table. > Did you check if the optimizer statistics are up to date? What I'm trying to say is that I don't think there is any data you could put in the stats tables to justify gambling on this index. But feel free to try out my example yourself. This is a bit more like what the production data looks like: INSERT INTO orders (state) SELECT CASE WHEN random() <= 0.8 THEN order_state 'WAIT_EVENT' ELSE order_state 'DONE' END FROM generate_series(1, 65536 * 12); UPDATE orders SET state = 'DONE' WHERE state = 'WAIT_EVENT'; ANALYZE orders; =# EXPLAIN ANALYZE SELECT 1 FROM orders WHERE order_id = 1 AND state = 'WAIT_EVENT'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Index Scan using orders_wait_event_idx on orders (cost=0.38..8.39 rows=1 width=4) (actual time=54.650..54.651 rows=0.00 loops=1) Filter: (order_id = 1) Index Searches: 1 Buffers: shared hit=5239 Planning: Buffers: shared hit=30 Planning Time: 1.221 ms Execution Time: 54.682 ms (8 rows) =# EXPLAIN ANALYZE SELECT 1 FROM orders WHERE order_id = 1 AND state = 'WAIT_EVENT'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Index Scan using orders_wait_event_idx on orders (cost=0.38..8.39 rows=1 width=4) (actual time=0.459..0.459 rows=0.00 loops=1) Filter: (order_id = 1) Index Searches: 1 Buffers: shared hit=495 Planning Time: 0.091 ms Execution Time: 0.476 ms (6 rows) =# EXPLAIN ANALYZE SELECT 1 FROM orders WHERE order_id = 1 AND state::text = 'WAIT_EVENT'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using orders_pkey on orders (cost=0.42..8.45 rows=1 width=4) (actual time=0.039..0.039 rows=0.00 loops=1) Index Cond: (order_id = 1) Filter: ((state)::text = 'WAIT_EVENT'::text) Rows Removed by Filter: 1 Index Searches: 1 Buffers: shared hit=10 Planning: Buffers: shared hit=99 Planning Time: 2.864 ms Execution Time: 0.077 ms (10 rows) .m
Marko Tiikkaja <marko@joh.to> writes: > ... we have a frequent query like this: > SELECT .. > FROM orders > WHERE > order_id = $1 AND > state = 'WAIT_EVENT'; > which almost always uses the primary key. But sometimes, perhaps > after an autovacuum or something, something changes and postgres > decides to start serving that query through the orders_wait_event_idx > index. You haven't given us a lot to go on: no reproducible test case, no clear description of what triggers the issue, not even the complete EXPLAIN output for the problem query. But it's hard to believe that the planner would estimate a probe on a unique index as costing more than 94812.85 units, which is what this fragment seems to suggest. If you look at eqsel() you will observe that the presence of a unique index overrides any information from statistics, so there is no "casino" behavior here IMO: it should realize that "order_id = $1" selects a single row no matter what. I'm wondering if it didn't use that index because it *couldn't* use that index, for who-knows-what reason. There are operations that will leave an index marked as invalid and hence unsafe to query on. Anyway, -EINSUFFICIENTINFORMATION. regards, tom lane
On Fri, Oct 3, 2025 at 16:31 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marko Tiikkaja <marko@joh.to> writes:
> ... we have a frequent query like this:
> SELECT ..
> FROM orders
> WHERE
> order_id = $1 AND
> state = 'WAIT_EVENT';
> which almost always uses the primary key. But sometimes, perhaps
> after an autovacuum or something, something changes and postgres
> decides to start serving that query through the orders_wait_event_idx
> index.
You haven't given us a lot to go on: no reproducible test case,
I've provided two. Both make the planner look bad.
no clear description of what triggers the issue,
..
not even the complete
EXPLAIN output for the problem query. But it's hard to believe that
the planner would estimate a probe on a unique index as costing more
than 94812.85 units, which is what this fragment seems to suggest.
That was an after-the-fact demonstration of how expensive gambling on the index can be.
If you look at eqsel() you will observe that the presence of a
unique index overrides any information from statistics, so there
is no "casino" behavior here IMO: it should realize that
"order_id = $1" selects a single row no matter what
Then explain what's going on in the test case. I'll be at the slot machine with the planner.
.m
On Sat, 4 Oct 2025 at 15:40, Marko Tiikkaja <marko@joh.to> wrote: > > On Fri, Oct 3, 2025 at 16:31 Tom Lane <tgl@sss.pgh.pa.us> wrote: >> You haven't given us a lot to go on: no reproducible test case, > > I've provided two. Both make the planner look bad. I expect what might be happening here is that ANALYZE runs when there are no state = 'WAIT_EVENT' rows and thinks using the index on that column is a better idea than the primary key index. Both selectivities will be clamped at 1 row, so the index with fewer pages is likely to win. Then the problem happens after lots of state = 'WAIT_EVENT' rows have been added, but not enough to trigger an auto-analyze. In that case, the planner would still think there's 1 row, but there could be any number of rows. create table t (a bigint primary key, b int not null); create index on t (b); insert into t select x, x%10+1 from generate_Series(1,1000000)x; analyze t; explain select * from t where a=1 and b=0; QUERY PLAN ------------------------------------------------------------------ Index Scan using t_b_idx on t (cost=0.42..8.40 rows=1 width=12) Index Cond: (b = 0) Filter: (a = 1) (3 rows) Just to see the cost comparison between the two, drop the index on "b": drop index t_b_idx; explain select * from t where a=1 and b=0; QUERY PLAN ----------------------------------------------------------------- Index Scan using t_pkey on t (cost=0.42..8.45 rows=1 width=12) <- a bit more costly. Index Cond: (a = 1) Filter: (b = 0) (3 rows) Basically, the planner isn't very good at taking risk into account when it comes to this sort of thing. Our cost model doesn't have a risk factor. We have added a few things into add_path() over the years to try to do something better in these scenarios. They often help. However, there's none for this scenario. The attached patch adds one. I'm not proposing we do this, but if you try my example above with it applied, you'll see the t_pkey index is used. I feel the hack is too fragile to be a proper fix. You might be able to work around the issue by adjusting the query. Ordinarily, you could do something like "WHERE indexed_col + 0 = <value>" to bypass using an index on that column (assuming the column is an integer type). I can't off the top of my head think of a way to do that with an enum column. Another idea on how you could fix is to adjust the orders_wait_event_idx index somehow. Maybe make it non-partial and index the enum column and INCLUDE another column to make the index wider and more costly to scan. Or maybe move it to another tablespace with higher page costs. David
Вложения
On Fri, Oct 3, 2025 at 8:19 PM David Rowley <dgrowleyml@gmail.com> wrote:
I can't off the top of my head think of a way to
do that with an enum column.
Casting the enum to text seems to work:
EXPLAIN SELECT 1 FROM orders WHERE order_id = 1 AND state = 'WAIT_EVENT';
QUERY PLAN
------------------------------------------------------------------------------------
Index Scan using orders_wait_event_idx on orders (cost=0.12..2.34 rows=1 width=4)
Filter: (order_id = 1)
(2 rows)
EXPLAIN SELECT 1 FROM orders WHERE order_id = 1 AND state::text = 'WAIT_EVENT';
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using orders_pkey on orders (cost=0.42..2.65 rows=1 width=4)
Index Cond: (order_id = 1)
Filter: ((state)::text = 'WAIT_EVENT'::text)
(3 rows)
QUERY PLAN
------------------------------------------------------------------------------------
Index Scan using orders_wait_event_idx on orders (cost=0.12..2.34 rows=1 width=4)
Filter: (order_id = 1)
(2 rows)
EXPLAIN SELECT 1 FROM orders WHERE order_id = 1 AND state::text = 'WAIT_EVENT';
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using orders_pkey on orders (cost=0.42..2.65 rows=1 width=4)
Index Cond: (order_id = 1)
Filter: ((state)::text = 'WAIT_EVENT'::text)
(3 rows)
David Rowley <dgrowleyml@gmail.com> writes: > On Sat, 4 Oct 2025 at 15:40, Marko Tiikkaja <marko@joh.to> wrote: >> On Fri, Oct 3, 2025 at 16:31 Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> You haven't given us a lot to go on: no reproducible test case, >> I've provided two. Both make the planner look bad. You've provided *no* reproducible test case that makes the planner look bad. The filled-in test case has two possibilities that are both pretty cheap and the planner knows they are pretty cheap, so it hardly matters which one it takes. You showed us a fragment of a case where it chose a very expensive scan that it shouldn't have, but no useful information about how to reproduce that. > I expect what might be happening here is that ANALYZE runs when there > are no state = 'WAIT_EVENT' rows and thinks using the index on that > column is a better idea than the primary key index. Marko's fragment at [1] shows Index Scan using orders_wait_event_idx on orders (cost=0.54..94812.85 rows=85043 width=1223) (actual time=0.166..7199.020 rows=84535 loops=1) Buffers: shared hit=15676 read=91962 dirtied=1988 Planning: Buffers: shared hit=807 read=11 dirtied=1 Planning Time: 4.634 ms Execution Time: 7204.117 ms That's a pretty darn accurate rowcount estimate, so "out of date stats" doesn't seem to be the problem. I don't think it could possibly have believed that scanning the pkey index would fetch that many rows, or cost that much. So why didn't it pick that index? No way to tell from this amount of info, but I'm suspecting a moving part that we don't know about. (Hmmm ... why is there not a "Filter: (order_id = $1)" line here?) regards, tom lane [1] https://www.postgresql.org/message-id/CAL9smLB1szUHLMfpN19FKiCHRCs4WvfjqXbxKaCUjmDzEtT%3Dng%40mail.gmail.com
On Sun, 5 Oct 2025 at 06:17, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Index Scan using orders_wait_event_idx on orders > (cost=0.54..94812.85 rows=85043 width=1223) (actual > time=0.166..7199.020 rows=84535 loops=1) > That's a pretty darn accurate rowcount estimate, so "out of date > stats" doesn't seem to be the problem. I don't think it could > possibly have believed that scanning the pkey index would fetch that > many rows, or cost that much. So why didn't it pick that index? > No way to tell from this amount of info, but I'm suspecting a > moving part that we don't know about. > > (Hmmm ... why is there not a "Filter: (order_id = $1)" line here?) Marko did mention: On Sat, 4 Oct 2025 at 15:40, Marko Tiikkaja <marko@joh.to> wrote: > That was an after-the-fact demonstration of how expensive gambling on the index can be. I assumed since the EXPLAIN didn't match the query that the EXPLAIN output was fabricated afterwards from the server without the "order_id" qual to try to illustrate the index that was used and the row numbers that index had to visit. It would be good to get confirmation of that from Marko. Otherwise, indeed, something very strange is going on if the planner produced the above plan for the given query. David
On Sun, Oct 5, 2025 at 3:34 AM David Rowley <dgrowleyml@gmail.com> wrote: > Marko did mention: > > On Sat, 4 Oct 2025 at 15:40, Marko Tiikkaja <marko@joh.to> wrote: > > That was an after-the-fact demonstration of how expensive gambling on the index can be. > > I assumed since the EXPLAIN didn't match the query that the EXPLAIN > output was fabricated afterwards from the server without the > "order_id" qual to try to illustrate the index that was used and the > row numbers that index had to visit. It would be good to get > confirmation of that from Marko. That's exactly it. I can't believe I actually have to do this, but run: CREATE TYPE order_state AS ENUM ('INITIAL', 'WAIT_EVENT', 'DONE'); CREATE TABLE orders( order_id bigserial PRIMARY KEY, state order_state NOT NULL DEFAULT 'INITIAL' ); CREATE INDEX orders_wait_event_idx ON orders ((1)) WHERE state = 'WAIT_EVENT'; INSERT INTO orders (state) SELECT CASE WHEN random() <= 0.8 THEN order_state 'WAIT_EVENT' ELSE order_state 'DONE' END FROM generate_series(1, 65536 * 12); UPDATE orders SET state = 'DONE' WHERE state = 'WAIT_EVENT'; ANALYZE orders; EXPLAIN (ANALYZE, BUFFERS) SELECT 1 FROM orders WHERE order_id = 1 AND state = 'WAIT_EVENT'; and I get: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Index Scan using orders_wait_event_idx on orders (cost=0.38..1.99 rows=1 width=4) (actual time=22.482..22.482 rows=0 loops=1) Filter: (order_id = 1) Buffers: shared hit=4746 Planning: Buffers: shared hit=18 Planning Time: 0.085 ms Execution Time: 22.488 ms (7 rows) .m
On Sat, Oct 4, 2025 at 8:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > On Sat, 4 Oct 2025 at 15:40, Marko Tiikkaja <marko@joh.to> wrote: > >> On Fri, Oct 3, 2025 at 16:31 Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> You haven't given us a lot to go on: no reproducible test case, > > >> I've provided two. Both make the planner look bad. > > You've provided *no* reproducible test case that makes the planner > look bad. The filled-in test case has two possibilities that are > both pretty cheap and the planner knows they are pretty cheap, so > it hardly matters which one it takes. You showed us a fragment > of a case where it chose a very expensive scan that it shouldn't > have, but no useful information about how to reproduce that. Hardly matters? I'd say 54.682 ms vs. 0.077 ms is a big deal. Especially because the planner doesn't seem to have any idea what the upper bound on the first one could be. .m
Marko Tiikkaja <marko@joh.to> writes: > I can't believe I actually have to do this, but run: That's not what was asked for. Your original report showed a case where the planner chose orders_wait_event_idx despite estimating that it would have to scan 80K-plus rows. That's the case where I don't understand how it could think that the primary key index would be even more expensive. I'm not particularly concerned about the case you show here, because if you run the EXPLAIN a second time it gets a lot cheaper. I believe the reason is that the first time is visiting a bunch of just-deleted rows and so it has to stop and update their hint bits, both in the heap and the index. That is not a cost that the planner can reasonably predict in advance, and even if we could it's not really fair to blame it on the choice of index. The pkey alternative only visits one row and therefore only updates one hint bit, but we'd have to update the rest sooner or later. So IMO the speed difference is largely illusory because it arises from deferring maintenance. You can demonstrate this by changing the "ANALYZE orders" in the test case to "VACUUM ANALYZE orders": the EXPLAIN still chooses orders_wait_event_idx, but now it's fast because all that work got done by VACUUM. regards, tom lane
On Sun, Oct 5, 2025 at 9:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not particularly concerned about the case you show here, > because if you run the EXPLAIN a second time it gets a lot > cheaper. Sure, it get cheaper, but it's still a bad plan, and significantly slower than the obvious one. > I believe the reason is that the first time is > visiting a bunch of just-deleted rows and so it has to stop > and update their hint bits, both in the heap and the index. > That is not a cost that the planner can reasonably predict > in advance, and even if we could it's not really fair to > blame it on the choice of index. The pkey alternative only > visits one row and therefore only updates one hint bit, but > we'd have to update the rest sooner or later. So IMO the > speed difference is largely illusory because it arises from > deferring maintenance. On the production server the query still consistently took ~100ms even after all that maintenance had been done. Compared to <0.05ms when going through the primary key. But it sounds like you're saying the planner is working as expected here, so I'll just drop the issue. I've already implemented a workaround in production via a no-op function. .m
On Mon, 6 Oct 2025 at 23:39, Marko Tiikkaja <marko@joh.to> wrote: > On the production server the query still consistently took ~100ms even > after all that maintenance had been done. Compared to <0.05ms when > going through the primary key. This part surprises me. Is this on a standby server or primary? Are you certain that no rows are matching the state = 'WAIT_EVENT' qual? David
On Mon, Oct 6, 2025 at 2:35 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Mon, 6 Oct 2025 at 23:39, Marko Tiikkaja <marko@joh.to> wrote: > > On the production server the query still consistently took ~100ms even > > after all that maintenance had been done. Compared to <0.05ms when > > going through the primary key. > > This part surprises me. Is this on a standby server or primary? It's a primary. > Are > you certain that no rows are matching the state = 'WAIT_EVENT' qual? There are, around 80k rows, but it fluctuates quite a bit. I spent some time working on my test program to gather more data, so I'll give this one more go. All of these are from production, so the schema is slightly different. Here's what the plan looks like when things are good: Index Scan using orders_pkey on orders (cost=0.58..2.20 rows=1 width=4) (actual time=0.172..0.172 rows=0 loops=1) Index Cond: (orderid = '7000000000'::bigint) Filter: (orderstatusid = 3) Buffers: shared hit=3 read=1 Planning: Buffers: shared hit=1 Planning Time: 0.140 ms Execution Time: 0.178 ms Then sometimes auto-analyze just doesn't sample enough pages which are WAIT_EVENT (orderstatusid = 3), so the plan changes into this: Index Scan using index_orders_wait_event on orders (cost=0.50..2.09 rows=1 width=4) (actual time=9512.921..9512.921 rows=0 loops=1) Filter: (orderid = '7000000000'::bigint) Rows Removed by Filter: 88520 Buffers: shared hit=161591 read=6066 dirtied=139 written=125 Planning: Buffers: shared hit=1 Planning Time: 0.139 ms Execution Time: 9512.934 ms It's only this slow for the first execution, then the following executions take between 50ms and 100ms. My program isn't intelligent enough to catch it, but if I reproduce an equivalent manually, it looks like: Index Scan using index_orders_wait_event on orders (actual time=89.004..89.004 rows=0 loops=1) Filter: ((orderid + 1) = '7000000000'::bigint) Rows Removed by Filter: 83280 Buffers: shared hit=91856 Planning: Buffers: shared hit=1 Planning Time: 0.198 ms Execution Time: 89.014 ms (8 rows) So still multiple orders of magnitude slower than the PRIMARY KEY plan. Some time later another auto-analyze happens to see enough of these (orderstatusid = 3) rows, and the plan changes back to the PRIMARY KEY one. The table is around 600 GB in size, so having a VACUUM running through the table 24/7 to combat this isn't feasible. But it doesn't looke like it would help too much, as the partial index is constantly around 300 MB in size, even after an autovacuum. If there was a VACUUM INDEX, that probably could be running 24/7, but as far as I know there isn't a way to do just one index like that. REINDEXing now and then trying to get more aggressive micro-vacuuming on the index by running SELECTs that scan through it could also work; not sure. In short, if: 1) there was no index bloat 2) the statistics were perfectly accurate for the pre-existing data 3) the distribution of data couldn't change one could save up to 0.178 ms for the first execution or around 0.03 ms for subsequent ones. But all of three of those are a risk, and sometimes we end up losing 9512.934 ms for the first execution or around 89 ms for subsequent ones. If there's more info I can provide without too much effort I'll be happy to. .m
On Sat, 11 Oct 2025 at 03:53, Marko Tiikkaja <marko@joh.to> wrote: > > On Mon, Oct 6, 2025 at 2:35 PM David Rowley <dgrowleyml@gmail.com> wrote: > > Are > > you certain that no rows are matching the state = 'WAIT_EVENT' qual? > > There are, around 80k rows, but it fluctuates quite a bit. Thanks for confirming that. That is subtly different from what you indicated in the last report, and it does seem like my suspicions in [1] were correct after all. Unfortunately, you've hit a limitation in our cost model and we can't really do much about that as a bug fix as it comes down to design. The no-op function you've added (which I assume is a stable plpgsql function to return the input order_state enum) is an adequate workaround to stop the wait_event index being used for the problem query. David [1] https://postgr.es/m/CAApHDvp=+EN2o=8fD47nyKKOSiQbLdYGLfUb=2obLWwQxUjyfg@mail.gmail.com