Обсуждение: How to hint 2 coulms IS NOT DISTINCT FROM each other
Hi
I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' .
The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals.
Example:
CREATE TABLE a (
id INTEGER PRIMARY KEY,
key VARCHAR,
value VARCHAR
);
CREATE INDEX ON a (key);
INSERT INTO a
VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz');
CREATE VIEW view_a AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
);
CREATE VIEW view_a_eq AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key = table_b.key
);
CREATE VIEW view_a_distinct AS (
SELECT table_a.id,
table_a.key,
table_a.value,
table_b.key as b_key
FROM a AS table_a
JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key IS NOT DISTINCT FROM table_b.key
);
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=12.69..34.42 rows=4 width=100)
Hash Cond: (table_b.id = table_a.id)
-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)
-> Hash (cost=12.64..12.64 rows=4 width=68)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
We only get index scan on table_a
EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test';
QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop (cost=8.36..25.53 rows=1 width=100)
Join Filter: (table_a.id = table_b.id)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
-> Materialize (cost=4.18..12.66 rows=4 width=36)
-> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
We get index scan on both tables and the where clause is pushed all the way down
EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test';
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=12.69..34.43 rows=1 width=100)
Hash Cond: (table_b.id = table_a.id)
Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text))
-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)
-> Hash (cost=12.64..12.64 rows=4 width=68)
-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)
Recheck Cond: ((key)::text = 'test'::text)
-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)
Index Cond: ((key)::text = 'test'::text)
Same as the first example
In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables.
It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls.
SELECT * FROM view_a_eq WHERE key IS NULL;
id | key | value | b_key
----+-----+-------+-------
(0 rows)
Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: > Hi > > I was wondering if there is a way to hint that two columns in two different > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if > table_a.key = 'test' THEN table_b.key = 'test' . > > The equals operator already does this but it does not handle NULLS very well > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and > doesn't establish the same inference rules as equals. The whole idea behing Postgres' query planner is that you don't have to use any hints. Late model versions of postgres handle nulls fine, but nulls are never "equal" to anything else. I.e. where xxx is null works with indexes. Where x=y does not, since null <> null. Suggestion for getting help, put a large-ish aka production sized amount of data into your db, run your queries with explain analyze and feed them to https://explain.depesz.com/ and post the links here along with the slow queries. A lot of times the fix is non-obvious if you're coming from another db with a different set of troubleshooting skills for slow queries.
Scott Marlowe <scott.marlowe@gmail.com> writes: > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >> I was wondering if there is a way to hint that two columns in two different >> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if >> table_a.key = 'test' THEN table_b.key = 'test' . >> >> The equals operator already does this but it does not handle NULLS very well >> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and >> doesn't establish the same inference rules as equals. > The whole idea behing Postgres' query planner is that you don't have > to use any hints. Late model versions of postgres handle nulls fine, > but nulls are never "equal" to anything else. I.e. where xxx is null > works with indexes. Where x=y does not, since null <> null. The bigger picture here is that if you've designed a data representation that requires that a null be considered "equal to" another null, you're really going to be fighting against the basic semantics of SQL. You'd be best off to rethink the representation. We've not seen enough info about your requirements to suggest just how, though. regards, tom lane
On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: >> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >>> I was wondering if there is a way to hint that two columns in two different >>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if >>> table_a.key = 'test' THEN table_b.key = 'test' . >>> >>> The equals operator already does this but it does not handle NULLS very well >>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and >>> doesn't establish the same inference rules as equals. > >> The whole idea behing Postgres' query planner is that you don't have >> to use any hints. Late model versions of postgres handle nulls fine, >> but nulls are never "equal" to anything else. I.e. where xxx is null >> works with indexes. Where x=y does not, since null <> null. > > The bigger picture here is that if you've designed a data representation > that requires that a null be considered "equal to" another null, you're > really going to be fighting against the basic semantics of SQL. You'd > be best off to rethink the representation. We've not seen enough info > about your requirements to suggest just how, though. Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be optimized"? It ought to be, at least in some cases. Internally indexes handle nulls so you should be able to implement them to satisfy those kinds of scans. I guess that's an easy thing to say though. merlin
On Fri, Oct 28, 2016 at 1:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Scott Marlowe <scott.marlowe@gmail.com> writes: >>> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >>>> I was wondering if there is a way to hint that two columns in two different >>>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if >>>> table_a.key = 'test' THEN table_b.key = 'test' . >>>> >>>> The equals operator already does this but it does not handle NULLS very well >>>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and >>>> doesn't establish the same inference rules as equals. >> >>> The whole idea behing Postgres' query planner is that you don't have >>> to use any hints. Late model versions of postgres handle nulls fine, >>> but nulls are never "equal" to anything else. I.e. where xxx is null >>> works with indexes. Where x=y does not, since null <> null. >> >> The bigger picture here is that if you've designed a data representation >> that requires that a null be considered "equal to" another null, you're >> really going to be fighting against the basic semantics of SQL. You'd >> be best off to rethink the representation. We've not seen enough info >> about your requirements to suggest just how, though. > > Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be > optimized"? It ought to be, at least in some cases. Internally > indexes handle nulls so you should be able to implement them to > satisfy those kinds of scans. I guess that's an easy thing to say > though. hm. ISTM "a IS NOT DISTINCT FROM b"...could be rewritten into ((a IS NULL AND b IS NULL) OR (a = b)) ... which is an indexable expression. merlin
On 10/28/2016 9:29 AM, Kim Rose Carlsen wrote:
JOIN a AS table_bON table_a.id = table_b.idAND table_a.key = table_b.key
Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.
-- john r pierce, recycling bits in santa cruz
> The bigger picture here is that if you've designed a data representation
> that requires that a null be considered "equal to" another null, you're
> really going to be fighting against the basic semantics of SQL. You'd
> be best off to rethink the representation. We've not seen enough info
> about your requirements to suggest just how, though.
> that requires that a null be considered "equal to" another null, you're
> really going to be fighting against the basic semantics of SQL. You'd
> be best off to rethink the representation. We've not seen enough info
> about your requirements to suggest just how, though.
Sometimes I do wake up in the night scared and afraid that I have used NULLs the wrong way
my whole life. I usually use NULLs to denote a value has not been provided. In my dreams I fear
I should have used empty string instead but if that is true, then I no longer know what is right and
wrong :).
In this specific case, its a resource management table. If a resource is not allocated to any
customers, then customer_id is set to null else the customer_id is set.
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, October 28, 2016 8:17:01 PM
To: Scott Marlowe
Cc: Kim Rose Carlsen; pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
Sent: Friday, October 28, 2016 8:17:01 PM
To: Scott Marlowe
Cc: Kim Rose Carlsen; pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> I was wondering if there is a way to hint that two columns in two different
>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>> table_a.key = 'test' THEN table_b.key = 'test' .
>>
>> The equals operator already does this but it does not handle NULLS very well
>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> doesn't establish the same inference rules as equals.
> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.
The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.
regards, tom lane
> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> I was wondering if there is a way to hint that two columns in two different
>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>> table_a.key = 'test' THEN table_b.key = 'test' .
>>
>> The equals operator already does this but it does not handle NULLS very well
>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> doesn't establish the same inference rules as equals.
> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.
The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.
regards, tom lane
>> JOIN a AS table_b
>> ON table_a.id = table_b.id
>> AND table_a.key = table_b.key
> Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.
Its not as much as for using the index, but to be able to push the where clause inside both JOINED tables. The = operator already does this. It gives the planner the option to join the table in using either id = id or key = key. It can deduce that if I have a WHERE condition with table_a.key = 'Something', then table_b.key must also be 'Something'. It can then decide to filter table_b on key. When using IS NOT DISTINCT FROM, the planner is not considering the same options. Now its like it doesn't know table_a.key is same the same as table_b.key.
I would somehow expect the IS NOT DISTINCT FROM operator to do the same. As it establish the same rules.. If a = 'test' and a IS NOT DISTINCT FROM b then b = 'test' also
-
Kim Carlsen
> > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
> > Hi
> >
> > I was wondering if there is a way to hint that two columns in two different
> > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
> > table_a.key = 'test' THEN table_b.key = 'test' .
> >
> > The equals operator already does this but it does not handle NULLS very well
> > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
> > doesn't establish the same inference rules as equals.
>
> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.
>
> Suggestion for getting help, put a large-ish aka production sized
> amount of data into your db, run your queries with explain analyze and
> feed them to https://explain.depesz.com/ and post the links here along
> with the slow queries. A lot of times the fix is non-obvious if you're
> coming from another db with a different set of troubleshooting skills
> for slow queries.
> >
> > I was wondering if there is a way to hint that two columns in two different
> > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
> > table_a.key = 'test' THEN table_b.key = 'test' .
> >
> > The equals operator already does this but it does not handle NULLS very well
> > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
> > doesn't establish the same inference rules as equals.
>
> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.
>
> Suggestion for getting help, put a large-ish aka production sized
> amount of data into your db, run your queries with explain analyze and
> feed them to https://explain.depesz.com/ and post the links here along
> with the slow queries. A lot of times the fix is non-obvious if you're
> coming from another db with a different set of troubleshooting skills
> for slow queries.
The problem is how to reduce the problem into its core, without introducing
all the unnecessary.
Maybe simplifying the problem, also makes it impossible to say where I go
wrong. It might be that I try to push too much logic into the SQL layer
and Im adding too many layers of abstraction to accomplish what I want.
So let me try and elaborate a little more.
I have couple a tables describing resources (circuits) and allocation
of resources to customers and products.
First layer is a view called view_circuit. This view (left) join any table
the circuit table reference through a foreign key (it gives exactly the same
rows and columns as circuit table + some extra information like customer_id).
Second layer is 2 views
1) a view describing if the circuit is active or inactive, lets call it
view_circuit_product_main
2) a view describing line_speed about the circuit, lets call it
view_circuit_product
These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...))
if this has any relevance.
Third layer
Next step is to add a view that tells both (joins the two views together
on circuit_id). lets call the new view view_circuit_with_status
This view is defined as
CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);
SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
Since customer_id is exposed through view_circuit the planner assumes view_circuit.customer_id = 1 and from there attempts to join
view_circuit_product_main and view_circuit_product using circuit_id.
This is not running optimal.
However if we change our query to allow the inference rule to take place, the query is executed very fast.
SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND s_customer_id = 1 AND p_customer_id = 1;
If a circuit is not assigned to any customers customer_id is set to NULL. This is the reason I can't use = operator. If I do use = then I can't find circuit which are unassigned, but the query do run effective.
I can see this still ends up being quite abstract, but the point is it would be quite beneficial if IS NOT DISTINCT used the same rules as = operator.
I have attached the 2 query plans
Bad plan: https://explain.depesz.com/s/SZN
Good plan: https://explain.depesz.com/s/61Ro
-
Kim Carlsen
Do you use potatoes for long posts here?
On Fri, Oct 28, 2016 at 2:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote: >> > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: > >> > Hi >> > >> > I was wondering if there is a way to hint that two columns in two >> > different >> > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume >> > if >> > table_a.key = 'test' THEN table_b.key = 'test' . >> > >> > The equals operator already does this but it does not handle NULLS very >> > well >> > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and >> > doesn't establish the same inference rules as equals. >> >> The whole idea behing Postgres' query planner is that you don't have >> to use any hints. Late model versions of postgres handle nulls fine, >> but nulls are never "equal" to anything else. I.e. where xxx is null >> works with indexes. Where x=y does not, since null <> null. >> >> Suggestion for getting help, put a large-ish aka production sized >> amount of data into your db, run your queries with explain analyze and >> feed them to https://explain.depesz.com/ and post the links here along >> with the slow queries. A lot of times the fix is non-obvious if you're >> coming from another db with a different set of troubleshooting skills >> for slow queries. > > The problem is how to reduce the problem into its core, without introducing > all the unnecessary. > > Maybe simplifying the problem, also makes it impossible to say where I go > wrong. It might be that I try to push too much logic into the SQL layer > and Im adding too many layers of abstraction to accomplish what I want. > So let me try and elaborate a little more. > > I have couple a tables describing resources (circuits) and allocation > of resources to customers and products. > > First layer is a view called view_circuit. This view (left) join any table > the circuit table reference through a foreign key (it gives exactly the same > rows and columns as circuit table + some extra information like > customer_id). > > Second layer is 2 views > 1) a view describing if the circuit is active or inactive, lets call it > view_circuit_product_main > 2) a view describing line_speed about the circuit, lets call it > view_circuit_product > > These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...)) > if this has any relevance. > > Third layer > Next step is to add a view that tells both (joins the two views together > on circuit_id). lets call the new view view_circuit_with_status > > This view is defined as > > CREATE VIEW view_circuit_with_status AS ( > SELECT r.*, > s.circuit_status, > s.customer_id AS s_customer_id, > p.line_speed, > p.customer_id AS p_customer_id > FROM view_circuit r > JOIN view_circuit_product_main s > ON r.circuit_id = s.circuit_id > AND r.customer_id IS NOT DISTINCT FROM s.customer_id > JOIN view_circuit_product p > ON r.circuit_id = p.circuit_id > AND r.customer_id IS NOT DISTINCT FROM s.customer_id > ); > > SELECT * FROM view_circuit_with_status WHERE customer_id = 1; > > Since customer_id is exposed through view_circuit the planner assumes > view_circuit.customer_id = 1 and from there attempts to join > view_circuit_product_main and view_circuit_product using circuit_id. > This is not running optimal. > > However if we change our query to allow the inference rule to take place, > the query is executed very fast. > > SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND > s_customer_id = 1 AND p_customer_id = 1; > > If a circuit is not assigned to any customers customer_id is set to NULL. > This is the reason I can't use = operator. If I do use = then I can't find > circuit which are unassigned, but the query do run effective. > > I can see this still ends up being quite abstract, but the point is it would > be quite beneficial if IS NOT DISTINCT used the same rules as = operator. > > I have attached the 2 query plans > > Bad plan: https://explain.depesz.com/s/SZN > Good plan: https://explain.depesz.com/s/61Ro try this :-D create or replace function indf(anyelement, anyelement) returns anyelement as $$ select $1 = $2 or ($1 is null and $2 is null); $$ language sql; CREATE VIEW view_circuit_with_status AS ( SELECT r.*, s.circuit_status, s.customer_id AS s_customer_id, p.line_speed, p.customer_id AS p_customer_id FROM view_circuit r JOIN view_circuit_product_main s ON r.circuit_id = s.circuit_id AND indf(r.customer_id, s.customer_id) JOIN view_circuit_product p ON r.circuit_id = p.circuit_id AND indf(r.customer_id, s.customer_id) merlin
CREATE VIEW view_circuit_with_status AS (SELECT r.*,s.circuit_status,s.customer_id AS s_customer_id,p.line_speed,p.customer_id AS p_customer_idFROM view_circuit rJOIN view_circuit_product_main s ON r.circuit_id = s.circuit_idAND r.customer_id IS NOT DISTINCT FROM s.customer_idJOIN view_circuit_product pON r.circuit_id = p.circuit_idAND r.customer_id IS NOT DISTINCT FROM s.customer_id);SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
A couple of random thoughts:
1.
TABLE circuit (circuit_id)
TABLE customer (customer_id)
TABLE circuit_customer (circuit_id, customer_id)
2.
INSERT INTO customer (0, 'Customer number standing in for customer not assigned');
While IS DISTINCT FROM could very well use some more love I surmise the reason it hasn't gotten it is because most data model set things up so that it never needs to be used.
David J.
> try this :-D
> create or replace function indf(anyelement, anyelement) returns anyelement as
> $$
> select $1 = $2 or ($1 is null and $2 is null);
> $$ language sql;
>
> CREATE VIEW view_circuit_with_status AS (
> SELECT r.*,
> s.circuit_status,
> s.customer_id AS s_customer_id,
> p.line_speed,
> p.customer_id AS p_customer_id
> FROM view_circuit r
> JOIN view_circuit_product_main s
> ON r.circuit_id = s.circuit_id
> AND indf(r.customer_id, s.customer_id)
> JOIN view_circuit_product p
> ON r.circuit_id = p.circuit_id
> AND indf(r.customer_id, s.customer_id)
>
> merlin
> $$
> select $1 = $2 or ($1 is null and $2 is null);
> $$ language sql;
>
> CREATE VIEW view_circuit_with_status AS (
> SELECT r.*,
> s.circuit_status,
> s.customer_id AS s_customer_id,
> p.line_speed,
> p.customer_id AS p_customer_id
> FROM view_circuit r
> JOIN view_circuit_product_main s
> ON r.circuit_id = s.circuit_id
> AND indf(r.customer_id, s.customer_id)
> JOIN view_circuit_product p
> ON r.circuit_id = p.circuit_id
> AND indf(r.customer_id, s.customer_id)
>
> merlin
This doesn't do much good. This doesn't tell the planner that the 3 customer_ids are actually of same value, and it therefore can't filter them as it sees fit.
On Sat, Oct 29, 2016 at 6:55 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >> try this :-D > >> create or replace function indf(anyelement, anyelement) returns anyelement >> as >> $$ >> select $1 = $2 or ($1 is null and $2 is null); >> $$ language sql; >> >> CREATE VIEW view_circuit_with_status AS ( >> SELECT r.*, >> s.circuit_status, >> s.customer_id AS s_customer_id, >> p.line_speed, >> p.customer_id AS p_customer_id >> FROM view_circuit r >> JOIN view_circuit_product_main s >> ON r.circuit_id = s.circuit_id >> AND indf(r.customer_id, s.customer_id) >> JOIN view_circuit_product p >> ON r.circuit_id = p.circuit_id >> AND indf(r.customer_id, s.customer_id) >> >> merlin > > This doesn't do much good. This doesn't tell the planner that the 3 > customer_ids are actually of same value, and it therefore can't filter them > as it sees fit. You do know you can index on a function, and the planner then keeps stats on it when you run analyze right?
>> This doesn't do much good. This doesn't tell the planner that the 3
>> customer_ids are actually of same value, and it therefore can't filter them
>> as it sees fit.
> You do know you can index on a function, and the planner then keeps
> stats on it when you run analyze right?
>> as it sees fit.
> You do know you can index on a function, and the planner then keeps
> stats on it when you run analyze right?
Yes, but I don't think it will make any difference. I don't think I can solve this with
an index lookup. I think my savior is the inference that the 2 columns are of
same value and the planner are free to choose which order to do the filter and join
with this extra information.
I have tried creating a function called
zero_if_null(int) : int that just select COALESCE($1, 0)
and adding a index on (zero_if_null(customer_id)) on table that contains customer_id. The only thing I get from is the planner now only knows how to compare customer_id, but it still doesn't know that they are of same value, only I know that and I want to declare it for the planner.
I could probably rewrite the whole view in one query, and then fix it with a proper index. But I think I will loose alot of readability.
I could also change the structure to save an explicit state, instead of a calculated state. But then I get some redundancy I need to make sure always stays the same.
In the end one of these will probably be the solution.
I guess the question is more or less,
why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are there any alternatives? And a plausible use case for when it would be useful.
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >>> This doesn't do much good. This doesn't tell the planner that the 3 > >>> customer_ids are actually of same value, and it therefore can't filter >>> them >>> as it sees fit. > >> You do know you can index on a function, and the planner then keeps >> stats on it when you run analyze right? > > Yes, but I don't think it will make any difference. I don't think I can > solve this with > an index lookup. I think my savior is the inference that the 2 columns are > of > same value and the planner are free to choose which order to do the filter > and join > with this extra information. > > I have tried creating a function called > zero_if_null(int) : int that just select COALESCE($1, 0) > and adding a index on (zero_if_null(customer_id)) on table that contains > customer_id. The only thing I get from is the planner now only knows how to > compare customer_id, but it still doesn't know that they are of same value, > only I know that and I want to declare it for the planner. > > I could probably rewrite the whole view in one query, and then fix it with a > proper index. But I think I will loose alot of readability. > > I could also change the structure to save an explicit state, instead of a > calculated state. But then I get some redundancy I need to make sure always > stays the same. > > In the end one of these will probably be the solution. > > I guess the question is more or less, > > why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are > there any alternatives? And a plausible use case for when it would be > useful. Well, the *behavior* is mandated by the sql standard. Our implementation is slow however. I'm surprised the attached function didn't help, it can be inlined and I was able to get bitmap or which is pretty good. As I said upthread I think INDF could theoretically run as fast as equality -- it just doesn't today. As your joins are written you could probably convert this by reserving a customer_id to the work that you're trying to do with null, say, 0, or -1. This is a somewhat dubious practice but seems a better fit for your use case. I don't think INDF is good in this usage. merlin
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
> > I have tried creating a function called
> > zero_if_null(int) : int that just select COALESCE($1, 0)
> > and adding a index on (zero_if_null(customer_id)) on table that contains
> > customer_id. The only thing I get from is the planner now only knows how to
> > compare customer_id, but it still doesn't know that they are of same value,
> > only I know that and I want to declare it for the planner.
> > zero_if_null(int) : int that just select COALESCE($1, 0)
> > and adding a index on (zero_if_null(customer_id)) on table that contains
> > customer_id. The only thing I get from is the planner now only knows how to
> > compare customer_id, but it still doesn't know that they are of same value,
> > only I know that and I want to declare it for the planner.
> Well, the *behavior* is mandated by the sql standard. Our
> implementation is slow however.
Sorry I'm not following, what behavior is mandated by the sql standard?
> I'm surprised the attached function
> didn't help, it can be inlined and I was able to get bitmap or which> is pretty good. As I said upthread I think INDF could theoretically
> run as fast as equality -- it just doesn't today.
It might be harsh to say that it doesn't help at all. I does half the running time,
but I need it to run an order of magnitude faster. Here is the plan with the
empty_if_null (customer_id is actually varchar)
https://explain.depesz.com/s/M1LV with empty_if_null + functional index
https://explain.depesz.com/s/eOL with indf
> As your joins are written you could probably convert this by reserving
> a customer_id to the work that you're trying to do with null, say, 0,
> or -1. This is a somewhat dubious practice but seems a better fit for
> your use case. I don't think INDF is good in this usage.
>
> merlin
> a customer_id to the work that you're trying to do with null, say, 0,
> or -1. This is a somewhat dubious practice but seems a better fit for
> your use case. I don't think INDF is good in this usage.
>
> merlin
This will work well, I think.
But I'm not sure I can mentally accept an unfilled value should not be
null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will
work well with this.
It might raise another problem, that the nulls are generated through LEFT
JOINS where now rows are defined. Then the 0 or -1 value need to be
a computed value. Won't this throw of index lookups? (I might be
more confused in this area).
On Mon, Oct 31, 2016 at 9:28 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: > On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: > >> > I have tried creating a function called >> > zero_if_null(int) : int that just select COALESCE($1, 0) >> > and adding a index on (zero_if_null(customer_id)) on table that contains >> > customer_id. The only thing I get from is the planner now only knows how >> > to >> > compare customer_id, but it still doesn't know that they are of same >> > value, >> > only I know that and I want to declare it for the planner. > > >> Well, the *behavior* is mandated by the sql standard. Our >> implementation is slow however. > > Sorry I'm not following, what behavior is mandated by the sql standard? The semantics of IS DISTINCT FROM, basically, equality with special consideration for nulls. > But I'm not sure I can mentally accept an unfilled value should not be > null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will > work well with this. Yeah, this is a dubious tactic and I would normally only consider using it for surrogate identifiers. > It might raise another problem, that the nulls are generated through LEFT > JOINS where now rows are defined. Then the 0 or -1 value need to be > a computed value. Won't this throw of index lookups? (I might be > more confused in this area). Not following this. BTW, if you want a fast plan over the current data without consideration of aesthetics, try this: CREATE VIEW view_circuit_with_status AS ( SELECT r.*, s.circuit_status, s.customer_id AS s_customer_id, p.line_speed, p.customer_id AS p_customer_id FROM view_circuit r JOIN view_circuit_product_main s ON r.circuit_id = s.circuit_id AND r.customer_id, s.customer_id JOIN view_circuit_product p ON r.circuit_id = p.circuit_id AND r.customer_id, s.customer_id UNION ALL SELECT r.*, s.circuit_status, s.customer_id AS s_customer_id, p.line_speed, p.customer_id AS p_customer_id FROM view_circuit r JOIN view_circuit_product_main s ON r.circuit_id = s.circuit_id AND r.customer_id IS NULL AND s.customer_id IS NULL JOIN view_circuit_product p ON r.circuit_id = p.circuit_id
>> It might raise another problem, that the nulls are generated through LEFT
>> JOINS where no rows are defined. Then the 0 or -1 value need to be
>> a computed value. Won't this throw off index lookups? (I might be
>> more confused in this area).
>
>Not following this.
>> a computed value. Won't this throw off index lookups? (I might be
>> more confused in this area).
>
>Not following this.
The nulls are generated by something like this
SELECT c.circuit_id,
cc.customer_id
FROM circuit AS c
LEFT JOIN circuit_customer AS cc
ON c.circuit_id = cc.circuit_id
To make a magic '0' customer we would be required to use
COALESCE(cc.customer_id, '0')
I dont think the optimizer will do anything clever with the '0' we have
computed from null.
I could ofc. by default assign all unassigned circuits to '0' in
circuit_customer. I'm not a fan though.
>BTW, if you want a fast plan over the current
>data without consideration of aesthetics, try this:
>
>CREATE VIEW view_circuit_with_status AS (
> SELECT r.*,
> s.circuit_status,
> s.customer_id AS s_customer_id,
> p.line_speed,
> p.customer_id AS p_customer_id
> FROM view_circuit r
> JOIN view_circuit_product_main s
> ON r.circuit_id = s.circuit_id
> AND r.customer_id, s.customer_id
> JOIN view_circuit_product p
> ON r.circuit_id = p.circuit_id
> AND r.customer_id, s.customer_id
> UNION ALL SELECT r.*,
> s.circuit_status,
> s.customer_id AS s_customer_id,
> p.line_speed,
> p.customer_id AS p_customer_id
> FROM view_circuit r
> JOIN view_circuit_product_main s
> ON r.circuit_id = s.circuit_id
> AND r.customer_id IS NULL
> AND s.customer_id IS NULL
> JOIN view_circuit_product p
> ON r.circuit_id = p.circuit_id>
>data without consideration of aesthetics, try this:
>
>CREATE VIEW view_circuit_with_status AS (
> SELECT r.*,
> s.circuit_status,
> s.customer_id AS s_customer_id,
> p.line_speed,
> p.customer_id AS p_customer_id
> FROM view_circuit r
> JOIN view_circuit_product_main s
> ON r.circuit_id = s.circuit_id
> AND r.customer_id, s.customer_id
> JOIN view_circuit_product p
> ON r.circuit_id = p.circuit_id
> AND r.customer_id, s.customer_id
> UNION ALL SELECT r.*,
> s.circuit_status,
> s.customer_id AS s_customer_id,
> p.line_speed,
> p.customer_id AS p_customer_id
> FROM view_circuit r
> JOIN view_circuit_product_main s
> ON r.circuit_id = s.circuit_id
> AND r.customer_id IS NULL
> AND s.customer_id IS NULL
> JOIN view_circuit_product p
> ON r.circuit_id = p.circuit_id>
I will have to figure something out, but this specific case is still problematic
since we would like to filter this view using different criteria's, like circuit_no,
products or customers.
But with all these detours, I assume that a change to IS NOT DISTINCT FROM,
is difficult or not wanted?
On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >>> It might raise another problem, that the nulls are generated through LEFT > >>> JOINS where no rows are defined. Then the 0 or -1 value need to be >>> a computed value. Won't this throw off index lookups? (I might be >>> more confused in this area). >> >>Not following this. > > The nulls are generated by something like this > SELECT c.circuit_id, > cc.customer_id > FROM circuit AS c > LEFT JOIN circuit_customer AS cc > ON c.circuit_id = cc.circuit_id > > To make a magic '0' customer we would be required to use > COALESCE(cc.customer_id, '0') > I dont think the optimizer will do anything clever with the '0' we have > computed from null. It would if you explicitly indexed it as such; CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0')); > I could ofc. by default assign all unassigned circuits to '0' in > circuit_customer. I'm not a fan though. hm, why not? null generally means 'unknown' and that's why it fails any equality test. >>BTW, if you want a fast plan over the current >>data without consideration of aesthetics, try this: >> >>CREATE VIEW view_circuit_with_status AS ( >> SELECT r.*, >> s.circuit_status, >> s.customer_id AS s_customer_id, >> p.line_speed, >> p.customer_id AS p_customer_id >> FROM view_circuit r >> JOIN view_circuit_product_main s >> ON r.circuit_id = s.circuit_id >> AND r.customer_id, s.customer_id >> JOIN view_circuit_product p >> ON r.circuit_id = p.circuit_id >> AND r.customer_id, s.customer_id >> UNION ALL SELECT r.*, >> s.circuit_status, >> s.customer_id AS s_customer_id, >> p.line_speed, >> p.customer_id AS p_customer_id >> FROM view_circuit r >> JOIN view_circuit_product_main s >> ON r.circuit_id = s.circuit_id >> AND r.customer_id IS NULL >> AND s.customer_id IS NULL >> JOIN view_circuit_product p >> ON r.circuit_id = p.circuit_id> > > I will have to figure something out, but this specific case is still > problematic > since we would like to filter this view using different criteria's, like > circuit_no, > products or customers. the above is logically equivalent to IS NOT DISTINCT FROM; you should be able to query it as you would have done the original view. > But with all these detours, I assume that a change to IS NOT DISTINCT FROM, > is difficult or not wanted? Well, not exactly. In your case you are trying to treat null as a specific value and pass it through join operations. TBH, this is a pretty dubious approach: null is not supposed to be equal to anything and any join vs null should come up empty -- logically at least. INDF works around this of course but it's not a recommended approach (my usage is generally restricted to, "has this value changed since yesterday? etc"). I'm not an expert backend structures for indexing and optimization but I know enough to suspect that optimizing INDF might cause implementation headaches in various places, as do other irregular syntactical approaches in SQL. I think minimally optimizing INDF would require converting it to an operator on par with '=' which is a pretty large infrastructure change for an edge optimization case. The fact that there are solid optimization strategies already on the table (UNION ALL, expr index COALESCE()) does not help. merlin
On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >> The nulls are generated by something like this >> SELECT c.circuit_id, >> cc.customer_id >> FROM circuit AS c >> LEFT JOIN circuit_customer AS cc >> ON c.circuit_id = cc.circuit_id >> >> To make a magic '0' customer we would be required to use >> COALESCE(cc.customer_id, '0') >> I dont think the optimizer will do anything clever with the '0' we have >> computed from null. > > It would if you explicitly indexed it as such; > CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0')); Merlin, it's a LEFT JOIN. There probably are no NULLs in the circuit_customer.customer_id column, so that COALESCE isn't going to achieve anything at all. I haven't been following this particular discussion in detail, so unfortunately I can't contribute more than that remark at the moment. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.
On Fri, Nov 4, 2016 at 9:38 AM, Alban Hertroys <haramrae@gmail.com> wrote: > On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote: >>> The nulls are generated by something like this >>> SELECT c.circuit_id, >>> cc.customer_id >>> FROM circuit AS c >>> LEFT JOIN circuit_customer AS cc >>> ON c.circuit_id = cc.circuit_id >>> >>> To make a magic '0' customer we would be required to use >>> COALESCE(cc.customer_id, '0') >>> I dont think the optimizer will do anything clever with the '0' we have >>> computed from null. >> >> It would if you explicitly indexed it as such; >> CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0')); > > Merlin, it's a LEFT JOIN. There probably are no NULLs in the > circuit_customer.customer_id column, so that COALESCE isn't going to > achieve anything at all. Hang on -- upthread the context was inner join, and the gripe was join fast with '=', slow with INDF. When he said the nulls were 'generated', I didn't follow that they were part of the original query. If the nulls are generated along with the query, sure, an index won't help. I maintain my earlier point; with respect to the original query, to get from performance of INDF to =, you have three options: a) expr index the nulls (assuming they are physically stored) b) convert to ((a = b) or a is null and b is null) which can help with a bitmap or plan c) covert to union all equivalent of "b" merlin
> Hang on -- upthread the context was inner join, and the gripe was join
> fast with '=', slow with INDF. When he said the nulls were
> 'generated', I didn't follow that they were part of the original
> query. If the nulls are generated along with the query, sure, an
> index won't help.
>
> I maintain my earlier point; with respect to the original query, to
> get from performance of INDF to =, you have three options:
> a) expr index the nulls (assuming they are physically stored)
> b) convert to ((a = b) or a is null and b is null) which can help with
> a bitmap or plan
> c) covert to union all equivalent of "b"
>
> merlin
> 'generated', I didn't follow that they were part of the original
> query. If the nulls are generated along with the query, sure, an
> index won't help.
>
> I maintain my earlier point; with respect to the original query, to
> get from performance of INDF to =, you have three options:
> a) expr index the nulls (assuming they are physically stored)
> b) convert to ((a = b) or a is null and b is null) which can help with
> a bitmap or plan
> c) covert to union all equivalent of "b"
>
> merlin
a) and b) would be workaround that would run an order of magnitude slower. The query
starts with a full table scan of a large table. If the planner had started elsewhere it could
have reduced the result to 1-2 rows from the start. It won't choose this plan without the help
from =.
c) could be a acceptable workaround, but it would clutter up if you would want more
than one column to be IS NOT DISTINCT FROM. You end up with 2^n unions to simulate
IS NOT DISTINCT FROM.
Without knowing the work required, I will still argue that having IS NOT DISTINCT FROM
use the same transitive rules as equality, would be a better approach.
With fear of talking about things I know little(nothing) of, I think the description of EquivalenceClasses
in postgres/src/backend/optimizer/README, should be extended to also include EquivalenceClasses
of IS NOT DISTINCT FROM.