Обсуждение: Poor man's partitioned index .... not being used?
Hi all, look at this short story please: foo=# CREATE TABLE Test(id int NOT NULL PRIMARY KEY); CREATE TABLE foo=# INSERT INTO test SELECT row_number() OVER() FROM pg_class a CROSS JOIN pg_class b; INSERT 0 388129 foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; QUERY PLAN ---------------------------------------------------------------------------Index Only Scan using test_pkey on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows) foo=# ALTER TABLE Test DROP CONSTRAINT Test_pkey; ALTER TABLE foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; QUERY PLAN -------------------------------------------------------Seq Scan on test (cost=0.00..6569.61 rows=1 width=4) Filter: (id = 8934) (2 rows) foo=# SELECT max(id)/2 FROM Test;?column? ---------- 194064 (1 row) foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE id < 194064; CREATE INDEX foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE id >= 194064; CREATE INDEX foo=# ANALYZE Test; ANALYZE foo=# EXPLAIN SELECT * FROM Test WHERE id = 8934; QUERY PLAN --------------------------------------------------------------------------Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows) foo=# DROP INDEX Test_pk0; DROP INDEX foo=# DROP INDEX Test_pk1; DROP INDEX foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0; CREATE INDEX foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE mod(id,2) = 1; CREATE INDEX foo=# ANALYZE Test; ANALYZE foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; QUERY PLAN -------------------------------------------------------Seq Scan on test (cost=0.00..6569.61 rows=1 width=4) Filter: (id = 8934) (2 rows) Why is that index never used? PS: there is a performance question behind this, big table, heavily used index, the hope was that with this simple scheme of partitioning just the index one might distribute the load better. I know, if the load really is so big, why not partition the entire table. But just for hecks, why not this way? regards, -Gunther
On Thu, 21 Mar 2019 at 15:51, Gunther <raj@gusw.net> wrote: > foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0; > CREATE INDEX > foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE mod(id,2) = 1; > CREATE INDEX > foo=# ANALYZE Test; > ANALYZE > foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; > QUERY PLAN > ------------------------------------------------------- > Seq Scan on test (cost=0.00..6569.61 rows=1 width=4) > Filter: (id = 8934) > (2 rows) > > Why is that index never used? When the planner looks at partial indexes to see if they'll suit the scan, the code that does the matching (predicate_implied_by()) simply does not go to that much trouble to determine if it matches. If you look at operator_predicate_proof() you'll see it requires the expression on at least one side of the OpExpr to match your predicate. Yours matches on neither side since "id" is wrapped up in a mod() function call. Certainly, predicate_implied_by() is by no means finished, new smarts have been added to it over the years to allow it to prove more cases, but each time something is added we still need to carefully weigh up the additional overhead of the new code vs. possible benefits. It may be possible to do something with immutable functions found in the expr but someone doing so might have a hard time proving that it's always safe to do so. For example, arg 2 of your mod() call is a Const. If it had been another Var then it wouldn't be safe to use. What other unsafe cases are there? Is there a way we can always identify unsafe cases during planning? ... are the sorts of questions someone implementing this would be faced with. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>>>>> "Gunther" == Gunther <raj@gusw.net> writes: Gunther> foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0; Gunther> CREATE INDEX Gunther> foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; Gunther> QUERY PLAN Gunther> ------------------------------------------------------- Gunther> Seq Scan on test (cost=0.00..6569.61 rows=1 width=4) Gunther> Filter: (id = 8934) Gunther> (2 rows) Gunther> Why is that index never used? Because the expression mod(id,2) does not appear in the query, and there is no logic in the implication prover to prove that (mod(id,2) = 0) is implied by (id = 8934). If you did WHERE mod(id,2) = mod(8934,2) AND id = 8934 then the index would likely be used - because the prover can then treat mod(id,2) as an atom (call it X), constant-fold mod(8934,2) to 0 because mod() is immutable, and then observe that (X = 0) proves that (X = 0). Pretty much the only simple implications that the prover can currently deduce are: - identical immutable subexpressions are equivalent - strict operator expressions imply scalar IS NOT NULL - (A op1 B) implies (B op2 A) if op2 is op1's declared commutator - Btree semantics: if <, <=, =, >=, > are all members of a btree opfamily, and <> is the declared negator of =, then implications like (X < A) and (A <= B) implies (X < B) can be deduced. -- Andrew (irc:RhodiumToad)
Thanks David Rowley and Andrew Gierth.
If you did WHERE mod(id,2) = mod(8934,2) AND id = 8934 then the index would likely be used - because the prover can then treat mod(id,2) as an atom (call it X), constant-fold mod(8934,2) to 0 because mod() is immutable, and then observe that (X = 0) proves that (X = 0).
foo=# EXPLAIN SELECT * FROM Test WHERE mod(id,2) = mod(8934,2) AND id = 8934; QUERY PLAN --------------------------------------------------------------------------Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows)
Yes indeed! It's being used that way! Interesting. Only that we can't use it if id was a variable? Hmm ...
foo=# PREPARE testplan(int) AS foo-# SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1; PREPARE foo=# EXPLAIN EXECUTE testplan(8934); QUERY PLAN --------------------------------------------------------------------------Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows)
That's quite alright actually. Now the questions is, could we use this in a nested loop query plan? That's where I think it can't work:
foo=# CREATE TABLE Test2 AS SELECT * FROM Test WHERE random() < 0.01 ORDER BY id DESC; SELECT 3730 integrator=# ANALYZE Test2; ANALYZE foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; QUERY PLAN -----------------------------------------------------------------------------Limit (cost=110.25..135.67 rows=10 width=8) -> Hash Right Join (cost=110.25..9591.02 rows=3730 width=8) Hash Cond: ((mod(b.id, 2) = mod(a.id, 2)) AND (b.id = a.id)) -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) -> Hash (cost=54.30..54.30 rows=3730 width=4) -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) (6 rows) foo=# SET enable_hashjoin TO off; SET foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; QUERY PLAN --------------------------------------------------------------------------------Limit (cost=47214.73..47227.86 rows=10 width=8) -> Merge Right Join (cost=47214.73..52113.16 rows=3730 width=8) Merge Cond: (((mod(b.id, 2)) = (mod(a.id, 2))) AND (b.id = a.id)) -> Sort (cost=46939.15..47909.47 rows=388129 width=4) Sort Key: (mod(b.id, 2)), b.id -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) -> Sort (cost=275.58..284.91 rows=3730 width=4) Sort Key: (mod(a.id, 2)), a.id -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) (9 rows) foo=# SET enable_mergejoin TO off; SET foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; QUERY PLAN --------------------------------------------------------------------------------Limit (cost=0.00..102516.78 rows=10 width=8) -> Nested Loop Left Join (cost=0.00..38238760.24 rows=3730 width=8) Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2))) -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) -> Materialize (cost=0.00..9056.93 rows=388129 width=4) -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) (6 rows)
It looks like it doesn't want to evaluate the mod(a.id, 2) before it moves to the index query for the nested loop.
Notably the partitioned table approach should do that, but it has a different expression for the partition. No mod function but MODULUS and REMAINDER.
I wonder if there was a way of marking such expressions as safe in the query, like suggesting a certain evaluation order, i.e.,
SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = EVAL(mod(a.id,2)) AND b.id = a.id) LIMIT 10;
It's OK though. It just goes to show that in a case like this, it is best to just go with the partitioned table anyway.
regards,
-Gunther
On Fri, 22 Mar 2019 at 07:57, Gunther <raj@gusw.net> wrote: > foo=# PREPARE testplan(int) AS > foo-# SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1; > PREPARE > foo=# EXPLAIN EXECUTE testplan(8934); > QUERY PLAN > -------------------------------------------------------------------------- > Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) > Index Cond: (id = 8934) > (2 rows) > > That's quite alright actually. Now the questions is, could we use this in a nested loop query plan? That's where I thinkit can't work: Not really. In that case, the parameters were replaced with the specified values (a.k.a custom plan). That happens for the first 5 executions of a prepared statement, and in this case likely the planner will continue to use the custom plan since the generic plan won't know that the partial index is okay to use and the plan costs would likely go up enough that the custom plan would continue to be favoured. > foo=# SET enable_mergejoin TO off; > SET > foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; > QUERY PLAN > -------------------------------------------------------------------------------- > Limit (cost=0.00..102516.78 rows=10 width=8) > -> Nested Loop Left Join (cost=0.00..38238760.24 rows=3730 width=8) > Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2))) > -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) > -> Materialize (cost=0.00..9056.93 rows=388129 width=4) > -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) > (6 rows) > > It looks like it doesn't want to evaluate the mod(a.id, 2) before it moves to the index query for the nested loop. Whether partial indexes can be used are not is determined using only quals that can be applied at the scan level. In this case your qual is a join qual, and since no other qual exists that can be evaluated at the scan level where the index can be used, then it's not considered. In any case, nothing there guarantees that one of your indexes will match all records. For it to work, both of you indexes would have to be scanned. It's not clear why you think that would be any better than scanning just one index. I imagine it would only ever be a win if you could eliminate one of the index scans with some qual that guarantees that the index can't contain any records matching your query. > I wonder if there was a way of marking such expressions as safe in the query, like suggesting a certain evaluation order,i.e., > > SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = EVAL(mod(a.id,2)) AND b.id = a.id) LIMIT 10; > > It's OK though. It just goes to show that in a case like this, it is best to just go with the partitioned table anyway. It sounds like you might want something like partition-wise join that exists in PG11. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services