Обсуждение: How to hint 2 coulms IS NOT DISTINCT FROM each other

Поиск

Список

Период

Сортировка

How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

28 октября 2016 г., 19:29:34

I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' .

The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals.

Example:

CREATE TABLE a (

id INTEGER PRIMARY KEY,

key VARCHAR,

value VARCHAR

);

CREATE INDEX ON a (key);

INSERT INTO a

VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz');

CREATE VIEW view_a AS (

SELECT table_a.id,

table_a.key,

table_a.value,

table_b.key as b_key

FROM a AS table_a

JOIN a AS table_b

ON table_a.id = table_b.id

);

CREATE VIEW view_a_eq AS (

SELECT table_a.id,

table_a.key,

table_a.value,

table_b.key as b_key

FROM a AS table_a

JOIN a AS table_b

ON table_a.id = table_b.id

AND table_a.key = table_b.key

);

CREATE VIEW view_a_distinct AS (

SELECT table_a.id,

table_a.key,

table_a.value,

table_b.key as b_key

FROM a AS table_a

JOIN a AS table_b

ON table_a.id = table_b.id

AND table_a.key IS NOT DISTINCT FROM table_b.key

);

EXPLAIN SELECT * FROM view_a WHERE key = 'test';

QUERY PLAN

------------------------------------------------------------------------------------

Hash Join (cost=12.69..34.42 rows=4 width=100)

Hash Cond: (table_b.id = table_a.id)

-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)

-> Hash (cost=12.64..12.64 rows=4 width=68)

-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)

Recheck Cond: ((key)::text = 'test'::text)

-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)

Index Cond: ((key)::text = 'test'::text)

We only get index scan on table_a

EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test';

QUERY PLAN

------------------------------------------------------------------------------------

Nested Loop (cost=8.36..25.53 rows=1 width=100)

Join Filter: (table_a.id = table_b.id)

-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)

Recheck Cond: ((key)::text = 'test'::text)

-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)

Index Cond: ((key)::text = 'test'::text)

-> Materialize (cost=4.18..12.66 rows=4 width=36)

-> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36)

Recheck Cond: ((key)::text = 'test'::text)

-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)

Index Cond: ((key)::text = 'test'::text)

We get index scan on both tables and the where clause is pushed all the way down

EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test';

QUERY PLAN

------------------------------------------------------------------------------------

Hash Join (cost=12.69..34.43 rows=1 width=100)

Hash Cond: (table_b.id = table_a.id)

Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text))

-> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36)

-> Hash (cost=12.64..12.64 rows=4 width=68)

-> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68)

Recheck Cond: ((key)::text = 'test'::text)

-> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0)

Index Cond: ((key)::text = 'test'::text)

Same as the first example

In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables.

It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls.

SELECT * FROM view_a_eq WHERE key IS NULL;

id | key | value | b_key

----+-----+-------+-------

(0 rows)

Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Scott Marlowe

Дата:

28 октября 2016 г., 20:36:50

On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
> Hi
>
> I was wondering if there is a way to hint that two columns in two different
> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
> table_a.key = 'test' THEN table_b.key = 'test' .
>
> The equals operator already does this but it does not handle NULLS very well
> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
> doesn't establish the same inference rules as equals.

The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.

Suggestion for getting help, put a large-ish aka production sized
amount of data into your db, run your queries with explain analyze and
feed them to https://explain.depesz.com/ and post the links here along
with the slow queries. A lot of times the fix is non-obvious if you're
coming from another db with a different set of troubleshooting skills
for slow queries.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Tom Lane

Дата:

28 октября 2016 г., 21:17:10

Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> I was wondering if there is a way to hint that two columns in two different
>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>> table_a.key = 'test' THEN table_b.key = 'test' .
>>
>> The equals operator already does this but it does not handle NULLS very well
>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> doesn't establish the same inference rules as equals.

> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.

The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL.  You'd
be best off to rethink the representation.  We've not seen enough info
about your requirements to suggest just how, though.

            regards, tom lane

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

28 октября 2016 г., 21:20:22

On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Scott Marlowe <scott.marlowe@gmail.com> writes:
>> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> I was wondering if there is a way to hint that two columns in two different
>>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>>> table_a.key = 'test' THEN table_b.key = 'test' .
>>>
>>> The equals operator already does this but it does not handle NULLS very well
>>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>>> doesn't establish the same inference rules as equals.
>
>> The whole idea behing Postgres' query planner is that you don't have
>> to use any hints. Late model versions of postgres handle nulls fine,
>> but nulls are never "equal" to anything else. I.e. where xxx is null
>> works with indexes. Where x=y does not, since null <> null.
>
> The bigger picture here is that if you've designed a data representation
> that requires that a null be considered "equal to" another null, you're
> really going to be fighting against the basic semantics of SQL.  You'd
> be best off to rethink the representation.  We've not seen enough info
> about your requirements to suggest just how, though.

Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be
optimized"?  It ought to be, at least in some cases. Internally
indexes handle nulls so you should be able to implement them to
satisfy those kinds of scans.  I guess that's an easy thing to say
though.

merlin

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

28 октября 2016 г., 21:28:08

On Fri, Oct 28, 2016 at 1:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Scott Marlowe <scott.marlowe@gmail.com> writes:
>>> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>>> I was wondering if there is a way to hint that two columns in two different
>>>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>>>> table_a.key = 'test' THEN table_b.key = 'test' .
>>>>
>>>> The equals operator already does this but it does not handle NULLS very well
>>>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>>>> doesn't establish the same inference rules as equals.
>>
>>> The whole idea behing Postgres' query planner is that you don't have
>>> to use any hints. Late model versions of postgres handle nulls fine,
>>> but nulls are never "equal" to anything else. I.e. where xxx is null
>>> works with indexes. Where x=y does not, since null <> null.
>>
>> The bigger picture here is that if you've designed a data representation
>> that requires that a null be considered "equal to" another null, you're
>> really going to be fighting against the basic semantics of SQL.  You'd
>> be best off to rethink the representation.  We've not seen enough info
>> about your requirements to suggest just how, though.
>
> Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be
> optimized"?  It ought to be, at least in some cases. Internally
> indexes handle nulls so you should be able to implement them to
> satisfy those kinds of scans.  I guess that's an easy thing to say
> though.

hm. ISTM "a IS NOT DISTINCT FROM b"...could be rewritten into

((a IS NULL AND b IS NULL) OR (a = b)) ... which is an indexable expression.

merlin

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

John R Pierce

Дата:

28 октября 2016 г., 21:28:42

On 10/28/2016 9:29 AM, Kim Rose Carlsen wrote:

JOIN a AS table_b
ON table_a.id = table_b.id
AND table_a.key = table_b.key

Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.

-- 
john r pierce, recycling bits in santa cruz

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

28 октября 2016 г., 21:39:36

> The bigger picture here is that if you've designed a data representation
> that requires that a null be considered "equal to" another null, you're
> really going to be fighting against the basic semantics of SQL. You'd
> be best off to rethink the representation. We've not seen enough info
> about your requirements to suggest just how, though.

Sometimes I do wake up in the night scared and afraid that I have used NULLs the wrong way

my whole life. I usually use NULLs to denote a value has not been provided. In my dreams I fear

I should have used empty string instead but if that is true, then I no longer know what is right and

wrong :).

In this specific case, its a resource management table. If a resource is not allocated to any

customers, then customer_id is set to null else the customer_id is set.

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, October 28, 2016 8:17:01 PM
To: Scott Marlowe
Cc: Kim Rose Carlsen; pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other

Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> I was wondering if there is a way to hint that two columns in two different
>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>> table_a.key = 'test' THEN table_b.key = 'test' .
>>
>> The equals operator already does this but it does not handle NULLS very well
>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> doesn't establish the same inference rules as equals.

> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.

The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL. You'd
be best off to rethink the representation. We've not seen enough info
about your requirements to suggest just how, though.

regards, tom lane

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

28 октября 2016 г., 21:53:56

>> JOIN a AS table_b

>> ON table_a.id = table_b.id

>> AND table_a.key = table_b.key

> Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.

Its not as much as for using the index, but to be able to push the where clause inside both JOINED tables.

The = operator already does this. It gives the planner the option to join the table in using either id = id or key = key. It can deduce that if I have a WHERE condition with table_a.key = 'Something', then table_b.key must also be 'Something'. It can then decide to filter table_b on key. When using IS NOT DISTINCT FROM, the planner is not considering the same options. Now its like it doesn't know table_a.key is same the same as table_b.key.

I would somehow expect the IS NOT DISTINCT FROM operator to do the same. As it establish the same rules.. If a = 'test' and a IS NOT DISTINCT FROM b then b = 'test' also

Kim Carlsen

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

28 октября 2016 г., 22:54:25

> > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:

> > Hi
> >
> > I was wondering if there is a way to hint that two columns in two different
> > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
> > table_a.key = 'test' THEN table_b.key = 'test' .
> >
> > The equals operator already does this but it does not handle NULLS very well
> > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
> > doesn't establish the same inference rules as equals.
>
> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.
>
> Suggestion for getting help, put a large-ish aka production sized
> amount of data into your db, run your queries with explain analyze and
> feed them to https://explain.depesz.com/ and post the links here along
> with the slow queries. A lot of times the fix is non-obvious if you're
> coming from another db with a different set of troubleshooting skills
> for slow queries.

The problem is how to reduce the problem into its core, without introducing

all the unnecessary.

Maybe simplifying the problem, also makes it impossible to say where I go

wrong. It might be that I try to push too much logic into the SQL layer

and Im adding too many layers of abstraction to accomplish what I want.

So let me try and elaborate a little more.

I have couple a tables describing resources (circuits) and allocation

of resources to customers and products.

First layer is a view called view_circuit. This view (left) join any table

the circuit table reference through a foreign key (it gives exactly the same

rows and columns as circuit table + some extra information like customer_id).

Second layer is 2 views

1) a view describing if the circuit is active or inactive, lets call it

view_circuit_product_main

2) a view describing line_speed about the circuit, lets call it

view_circuit_product

These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...))

if this has any relevance.

Third layer

Next step is to add a view that tells both (joins the two views together

on circuit_id). lets call the new view view_circuit_with_status

This view is defined as

CREATE VIEW view_circuit_with_status AS (

SELECT r.*,

s.circuit_status,

s.customer_id AS s_customer_id,

p.line_speed,

p.customer_id AS p_customer_id

FROM view_circuit r

JOIN view_circuit_product_main s

ON r.circuit_id = s.circuit_id

AND r.customer_id IS NOT DISTINCT FROM s.customer_id

JOIN view_circuit_product p

ON r.circuit_id = p.circuit_id

AND r.customer_id IS NOT DISTINCT FROM s.customer_id

);

SELECT * FROM view_circuit_with_status WHERE customer_id = 1;

Since customer_id is exposed through view_circuit the planner assumes view_circuit.customer_id = 1 and from there attempts to join

view_circuit_product_main and view_circuit_product using circuit_id.

This is not running optimal.

However if we change our query to allow the inference rule to take place, the query is executed very fast.

SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND s_customer_id = 1 AND p_customer_id = 1;

If a circuit is not assigned to any customers customer_id is set to NULL. This is the reason I can't use = operator. If I do use = then I can't find circuit which are unassigned, but the query do run effective.

I can see this still ends up being quite abstract, but the point is it would be quite beneficial if IS NOT DISTINCT used the same rules as = operator.

I have attached the 2 query plans

Bad plan: https://explain.depesz.com/s/SZN

Good plan: https://explain.depesz.com/s/61Ro

Kim Carlsen

Do you use potatoes for long posts here?

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

28 октября 2016 г., 23:05:27

On Fri, Oct 28, 2016 at 2:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>
>> > Hi
>> >
>> > I was wondering if there is a way to hint that two columns in two
>> > different
>> > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume
>> > if
>> > table_a.key = 'test' THEN table_b.key = 'test' .
>> >
>> > The equals operator already does this but it does not handle NULLS very
>> > well
>> > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> > doesn't establish the same inference rules as equals.
>>
>> The whole idea behing Postgres' query planner is that you don't have
>> to use any hints. Late model versions of postgres handle nulls fine,
>> but nulls are never "equal" to anything else. I.e. where xxx is null
>> works with indexes. Where x=y does not, since null <> null.
>>
>> Suggestion for getting help, put a large-ish aka production sized
>> amount of data into your db, run your queries with explain analyze and
>> feed them to https://explain.depesz.com/ and post the links here along
>> with the slow queries. A lot of times the fix is non-obvious if you're
>> coming from another db with a different set of troubleshooting skills
>> for slow queries.
>
> The problem is how to reduce the problem into its core, without introducing
> all the unnecessary.
>
> Maybe simplifying the problem, also makes it impossible to say where I go
> wrong. It might be that I try to push too much logic into the SQL layer
> and Im adding too many layers of abstraction to accomplish what I want.
> So let me try and elaborate a little more.
>
> I have couple a tables describing resources (circuits) and allocation
> of resources to customers and products.
>
> First layer is a view called view_circuit. This view (left) join any table
> the circuit table reference through a foreign key (it gives exactly the same
> rows and columns as circuit table + some extra information like
> customer_id).
>
> Second layer is 2 views
> 1) a view describing if the circuit is active or inactive, lets call it
>    view_circuit_product_main
> 2) a view describing line_speed about the circuit, lets call it
>    view_circuit_product
>
> These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...))
> if this has any relevance.
>
> Third layer
> Next step is to add a view that tells both (joins the two views together
> on circuit_id). lets call the new view view_circuit_with_status
>
> This view is defined as
>
> CREATE VIEW view_circuit_with_status AS (
>    SELECT r.*,
>           s.circuit_status,
>           s.customer_id AS s_customer_id,
>           p.line_speed,
>           p.customer_id AS p_customer_id
>      FROM view_circuit r
>      JOIN view_circuit_product_main s
>        ON r.circuit_id = s.circuit_id
>       AND r.customer_id IS NOT DISTINCT FROM s.customer_id
>      JOIN view_circuit_product p
>        ON r.circuit_id = p.circuit_id
>       AND r.customer_id IS NOT DISTINCT FROM s.customer_id
> );
>
> SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
>
> Since customer_id is exposed through view_circuit the planner assumes
> view_circuit.customer_id = 1 and from there attempts to join
> view_circuit_product_main and view_circuit_product using circuit_id.
> This is not running optimal.
>
> However if we change our query to allow the inference rule to take place,
> the query is executed very fast.
>
> SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND
> s_customer_id = 1 AND p_customer_id = 1;
>
> If a circuit is not assigned to any customers customer_id is set to NULL.
> This is the reason I can't use = operator. If I do use = then I can't find
> circuit which are unassigned, but the query do run effective.
>
> I can see this still ends up being quite abstract, but the point is it would
> be quite beneficial if IS NOT DISTINCT used the same rules as = operator.
>
> I have attached the 2 query plans
>
> Bad plan: https://explain.depesz.com/s/SZN
> Good plan: https://explain.depesz.com/s/61Ro

try this :-D
create or replace function indf(anyelement, anyelement) returns anyelement as
$$
  select $1 = $2 or ($1 is null and $2 is null);
$$ language sql;

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND indf(r.customer_id, s.customer_id)
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND indf(r.customer_id, s.customer_id)

merlin

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

"David G. Johnston"

Дата:

28 октября 2016 г., 23:07:28

On Fri, Oct 28, 2016 at 12:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote:

CREATE VIEW view_circuit_with_status AS (
SELECT r.*,
s.circuit_status,
s.customer_id AS s_customer_id,
p.line_speed,
p.customer_id AS p_customer_id
FROM view_circuit r
JOIN view_circuit_product_main s
ON r.circuit_id = s.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
JOIN view_circuit_product p
ON r.circuit_id = p.circuit_id
AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);

SELECT * FROM view_circuit_with_status WHERE customer_id = 1;

A couple of random thoughts:

TABLE circuit (circuit_id)

TABLE customer (customer_id)

TABLE circuit_customer (circuit_id, customer_id)

INSERT INTO customer (0, 'Customer number standing in for customer not assigned');

While IS DISTINCT FROM could very well use some more love I surmise the reason it hasn't gotten it is because most data model set things up so that it never needs to be used.

David J.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

29 октября 2016 г., 15:55:39

> try this :-D

> create or replace function indf(anyelement, anyelement) returns anyelement as
> $$
>   select $1 = $2 or ($1 is null and $2 is null);
> $$ language sql;
>
> CREATE VIEW view_circuit_with_status AS (
>    SELECT r.*,
>           s.circuit_status,
>           s.customer_id AS s_customer_id,
>           p.line_speed,
>           p.customer_id AS p_customer_id
>      FROM view_circuit r
>      JOIN view_circuit_product_main s
>        ON r.circuit_id = s.circuit_id
>       AND indf(r.customer_id, s.customer_id)
>      JOIN view_circuit_product p
>        ON r.circuit_id = p.circuit_id
>       AND indf(r.customer_id, s.customer_id)
>
> merlin

This doesn't do much good. This doesn't tell the planner that the 3 customer_ids are actually of same value, and it therefore can't filter them as it sees fit.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Scott Marlowe

Дата:

29 октября 2016 г., 16:02:17

On Sat, Oct 29, 2016 at 6:55 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> try this :-D
>
>> create or replace function indf(anyelement, anyelement) returns anyelement
>> as
>> $$
>>   select $1 = $2 or ($1 is null and $2 is null);
>> $$ language sql;
>>
>> CREATE VIEW view_circuit_with_status AS (
>>    SELECT r.*,
>>           s.circuit_status,
>>           s.customer_id AS s_customer_id,
>>           p.line_speed,
>>           p.customer_id AS p_customer_id
>>      FROM view_circuit r
>>      JOIN view_circuit_product_main s
>>        ON r.circuit_id = s.circuit_id
>>       AND indf(r.customer_id, s.customer_id)
>>      JOIN view_circuit_product p
>>        ON r.circuit_id = p.circuit_id
>>       AND indf(r.customer_id, s.customer_id)
>>
>> merlin
>
> This doesn't do much good. This doesn't tell the planner that the 3
> customer_ids are actually of same value, and it therefore can't filter them
> as it sees fit.

You do know you can index on a function, and the planner then keeps
stats on it when you run analyze right?

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

29 октября 2016 г., 16:27:47

>> This doesn't do much good. This doesn't tell the planner that the 3

>> customer_ids are actually of same value, and it therefore can't filter them
>> as it sees fit.

> You do know you can index on a function, and the planner then keeps
> stats on it when you run analyze right?

Yes, but I don't think it will make any difference. I don't think I can solve this with

an index lookup. I think my savior is the inference that the 2 columns are of

same value and the planner are free to choose which order to do the filter and join

with this extra information.

I have tried creating a function called

zero_if_null(int) : int that just select COALESCE($1, 0)

and adding a index on (zero_if_null(customer_id)) on table that contains customer_id. The only thing I get from is the planner now only knows how to compare customer_id, but it still doesn't know that they are of same value, only I know that and I want to declare it for the planner.

I could probably rewrite the whole view in one query, and then fix it with a proper index. But I think I will loose alot of readability.

I could also change the structure to save an explicit state, instead of a calculated state. But then I get some redundancy I need to make sure always stays the same.

In the end one of these will probably be the solution.

I guess the question is more or less,

why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are there any alternatives? And a plausible use case for when it would be useful.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

31 октября 2016 г., 16:19:22

On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> This doesn't do much good. This doesn't tell the planner that the 3
>
>>> customer_ids are actually of same value, and it therefore can't filter
>>> them
>>> as it sees fit.
>
>> You do know you can index on a function, and the planner then keeps
>> stats on it when you run analyze right?
>
> Yes, but I don't think it will make any difference. I don't think I can
> solve this with
> an index lookup. I think my savior is the inference that the 2 columns are
> of
> same value and the planner are free to choose which order to do the filter
> and join
> with this extra information.
>
> I have tried creating a function called
> zero_if_null(int) : int that just select COALESCE($1, 0)
> and adding a index on (zero_if_null(customer_id)) on table that contains
> customer_id. The only thing I get from is the planner now only knows how to
> compare customer_id, but it still doesn't know that they are of same value,
> only I know that and I want to declare it for the planner.
>
> I could probably rewrite the whole view in one query, and then fix it with a
> proper index. But I think I will loose alot of readability.
>
> I could also change the structure to save an explicit state, instead of a
> calculated state. But then I get some redundancy I need to make sure always
> stays the same.
>
> In the end one of these will probably be the solution.
>
> I guess the question is more or less,
>
> why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are
> there any alternatives? And a plausible use case for when it would be
> useful.

Well, the *behavior* is mandated by the sql standard.  Our
implementation is slow however.  I'm surprised the attached function
didn't help, it can be inlined and I was able to get bitmap or which
is pretty good.  As I said upthread I think INDF could theoretically
run as fast as equality -- it just doesn't today.

As your joins are written you could probably convert this by reserving
a customer_id to the work that you're trying to do with null, say, 0,
or -1.  This is a somewhat dubious practice but seems a better fit for
your use case.  I don't think INDF is good in this usage.

merlin

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

31 октября 2016 г., 17:28:20

On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:

> > I have tried creating a function called
> > zero_if_null(int) : int that just select COALESCE($1, 0)
> > and adding a index on (zero_if_null(customer_id)) on table that contains
> > customer_id. The only thing I get from is the planner now only knows how to
> > compare customer_id, but it still doesn't know that they are of same value,
> > only I know that and I want to declare it for the planner.

> Well, the *behavior* is mandated by the sql standard. Our
> implementation is slow however.

Sorry I'm not following, what behavior is mandated by the sql standard?

> I'm surprised the attached function

> didn't help, it can be inlined and I was able to get bitmap or which
> is pretty good. As I said upthread I think INDF could theoretically
> run as fast as equality -- it just doesn't today.

It might be harsh to say that it doesn't help at all. I does half the running time,

but I need it to run an order of magnitude faster. Here is the plan with the

empty_if_null (customer_id is actually varchar)

https://explain.depesz.com/s/M1LV with empty_if_null + functional index

https://explain.depesz.com/s/eOL with indf

> As your joins are written you could probably convert this by reserving
> a customer_id to the work that you're trying to do with null, say, 0,
> or -1. This is a somewhat dubious practice but seems a better fit for
> your use case. I don't think INDF is good in this usage.
>
> merlin

This will work well, I think.

But I'm not sure I can mentally accept an unfilled value should not be

null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will

work well with this.

It might raise another problem, that the nulls are generated through LEFT

JOINS where now rows are defined. Then the 0 or -1 value need to be

a computed value. Won't this throw of index lookups? (I might be

more confused in this area).

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

01 ноября 2016 г., 16:13:44

On Mon, Oct 31, 2016 at 9:28 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
> On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>
>> > I have tried creating a function called
>> > zero_if_null(int) : int that just select COALESCE($1, 0)
>> > and adding a index on (zero_if_null(customer_id)) on table that contains
>> > customer_id. The only thing I get from is the planner now only knows how
>> > to
>> > compare customer_id, but it still doesn't know that they are of same
>> > value,
>> > only I know that and I want to declare it for the planner.
>
>
>> Well, the *behavior* is mandated by the sql standard.  Our
>> implementation is slow however.
>
> Sorry I'm not following, what behavior is mandated by the sql standard?

The semantics of IS DISTINCT FROM, basically, equality with special
consideration for nulls.

> But I'm not sure I can mentally accept an unfilled value should not be
> null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will
> work well with this.

Yeah, this is a dubious tactic and I would normally only consider
using it for surrogate identifiers.

> It might raise another problem, that the nulls are generated through LEFT
> JOINS where now rows are defined. Then the 0 or -1 value need to be
> a computed value. Won't this throw of index lookups? (I might be
> more confused in this area).

Not following this.  BTW, if you want a fast plan over the current
data without consideration of aesthetics, try this:

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND r.customer_id, s.customer_id
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND r.customer_id, s.customer_id
  UNION ALL SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND r.customer_id IS NULL
      AND  s.customer_id IS NULL
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

04 ноября 2016 г., 16:08:41

>> It might raise another problem, that the nulls are generated through LEFT

>> JOINS where no rows are defined. Then the 0 or -1 value need to be
>> a computed value. Won't this throw off index lookups? (I might be
>> more confused in this area).
>
>Not following this.

The nulls are generated by something like this

SELECT c.circuit_id,

cc.customer_id

FROM circuit AS c

LEFT JOIN circuit_customer AS cc

ON c.circuit_id = cc.circuit_id

To make a magic '0' customer we would be required to use

COALESCE(cc.customer_id, '0')

I dont think the optimizer will do anything clever with the '0' we have

computed from null.

I could ofc. by default assign all unassigned circuits to '0' in

circuit_customer. I'm not a fan though.

>BTW, if you want a fast plan over the current
>data without consideration of aesthetics, try this:
>
>CREATE VIEW view_circuit_with_status AS (
> SELECT r.*,
>          s.circuit_status,
>          s.customer_id AS s_customer_id,
>          p.line_speed,
>          p.customer_id AS p_customer_id
>     FROM view_circuit r
>     JOIN view_circuit_product_main s
>      ON r.circuit_id = s.circuit_id
>      AND r.customer_id, s.customer_id
>     JOIN view_circuit_product p
>       ON r.circuit_id = p.circuit_id
>      AND r.customer_id, s.customer_id
> UNION ALL SELECT r.*,
>          s.circuit_status,
>          s.customer_id AS s_customer_id,
>          p.line_speed,
>          p.customer_id AS p_customer_id
>     FROM view_circuit r
>     JOIN view_circuit_product_main s
>       ON r.circuit_id = s.circuit_id
>      AND r.customer_id IS NULL
>      AND s.customer_id IS NULL
>     JOIN view_circuit_product p
>       ON r.circuit_id = p.circuit_id>

I will have to figure something out, but this specific case is still problematic

since we would like to filter this view using different criteria's, like circuit_no,

products or customers.

But with all these detours, I assume that a change to IS NOT DISTINCT FROM,

is difficult or not wanted?

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

04 ноября 2016 г., 16:41:16

On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> It might raise another problem, that the nulls are generated through LEFT
>
>>> JOINS where no rows are defined. Then the 0 or -1 value need to be
>>> a computed value. Won't this throw off index lookups? (I might be
>>> more confused in this area).
>>
>>Not following this.
>
> The nulls are generated by something like this
>     SELECT c.circuit_id,
>                    cc.customer_id
>        FROM circuit AS c
> LEFT JOIN circuit_customer AS cc
>              ON c.circuit_id = cc.circuit_id
>
> To make a magic '0' customer we would be required to use
>   COALESCE(cc.customer_id, '0')
> I dont think the optimizer will do anything clever with the '0' we have
> computed from null.

It would if you explicitly indexed it as such;
CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));

> I could ofc. by default assign all unassigned circuits to '0' in
> circuit_customer. I'm not a fan though.

hm, why not?  null generally means 'unknown' and that's why it fails
any equality test.

>>BTW, if you want a fast plan over the current
>>data without consideration of aesthetics, try this:
>>
>>CREATE VIEW view_circuit_with_status AS (
>>    SELECT r.*,
>>          s.circuit_status,
>>          s.customer_id AS s_customer_id,
>>          p.line_speed,
>>          p.customer_id AS p_customer_id
>>     FROM view_circuit r
>>     JOIN view_circuit_product_main s
>>      ON r.circuit_id = s.circuit_id
>>      AND r.customer_id, s.customer_id
>>     JOIN view_circuit_product p
>>       ON r.circuit_id = p.circuit_id
>>      AND r.customer_id, s.customer_id
>>  UNION ALL SELECT r.*,
>>          s.circuit_status,
>>          s.customer_id AS s_customer_id,
>>          p.line_speed,
>>          p.customer_id AS p_customer_id
>>     FROM view_circuit r
>>     JOIN view_circuit_product_main s
>>       ON r.circuit_id = s.circuit_id
>>      AND r.customer_id IS NULL
>>      AND  s.customer_id IS NULL
>>     JOIN view_circuit_product p
>>       ON r.circuit_id = p.circuit_id>
>
> I will have to figure something out, but this specific case is still
> problematic
> since we would like to filter this view using different criteria's, like
> circuit_no,
> products or customers.

the above is logically equivalent to IS NOT DISTINCT FROM; you should
be able to query it as you would have done the original view.

> But with all these detours, I assume that a change to IS NOT DISTINCT FROM,
> is difficult or not wanted?

Well, not exactly.  In your case you are trying to treat null as a
specific value and pass it through join operations.

TBH, this is a pretty dubious approach: null is not supposed to be
equal to anything and any join vs null should come up empty --
logically at least.  INDF works around this of course but it's not a
recommended approach (my usage is generally restricted to, "has this
value changed since yesterday? etc").

I'm not an expert backend structures for indexing and optimization but
I know enough to suspect that optimizing INDF might cause
implementation headaches in various places, as do other irregular
syntactical approaches in SQL.  I think minimally optimizing INDF
would require converting it to an operator on par with '=' which is a
pretty large infrastructure change for an edge optimization case.
The fact that there are solid optimization strategies already on the
table (UNION ALL, expr index COALESCE()) does not help.

merlin

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Alban Hertroys

Дата:

04 ноября 2016 г., 17:38:28

On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> The nulls are generated by something like this
>>     SELECT c.circuit_id,
>>                    cc.customer_id
>>        FROM circuit AS c
>> LEFT JOIN circuit_customer AS cc
>>              ON c.circuit_id = cc.circuit_id
>>
>> To make a magic '0' customer we would be required to use
>>   COALESCE(cc.customer_id, '0')
>> I dont think the optimizer will do anything clever with the '0' we have
>> computed from null.
>
> It would if you explicitly indexed it as such;
> CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));

Merlin, it's a LEFT JOIN. There probably are no NULLs in the
circuit_customer.customer_id column, so that COALESCE isn't going to
achieve anything at all.

I haven't been following this particular discussion in detail, so
unfortunately I can't contribute more than that remark at the moment.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Merlin Moncure

Дата:

07 ноября 2016 г., 17:23:00

On Fri, Nov 4, 2016 at 9:38 AM, Alban Hertroys <haramrae@gmail.com> wrote:
> On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> The nulls are generated by something like this
>>>     SELECT c.circuit_id,
>>>                    cc.customer_id
>>>        FROM circuit AS c
>>> LEFT JOIN circuit_customer AS cc
>>>              ON c.circuit_id = cc.circuit_id
>>>
>>> To make a magic '0' customer we would be required to use
>>>   COALESCE(cc.customer_id, '0')
>>> I dont think the optimizer will do anything clever with the '0' we have
>>> computed from null.
>>
>> It would if you explicitly indexed it as such;
>> CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));
>
> Merlin, it's a LEFT JOIN. There probably are no NULLs in the
> circuit_customer.customer_id column, so that COALESCE isn't going to
> achieve anything at all.

Hang on -- upthread the context was inner join, and the gripe was join
fast with '=', slow with INDF.  When he said the nulls were
'generated', I didn't follow that they were part of the original
query.  If the nulls are generated along with the query, sure, an
index won't help.

I maintain my earlier point; with respect to the original query, to
get from performance of INDF to =, you have three options:
a) expr index the nulls  (assuming they are physically stored)
b) convert to ((a = b) or a is null and b is null) which can help with
a bitmap or plan
c) covert to union all equivalent of "b"

merlin

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От

Kim Rose Carlsen

Дата:

10 ноября 2016 г., 21:40:43

> Hang on -- upthread the context was inner join, and the gripe was join

> fast with '=', slow with INDF. When he said the nulls were
> 'generated', I didn't follow that they were part of the original
> query. If the nulls are generated along with the query, sure, an
> index won't help.
>
> I maintain my earlier point; with respect to the original query, to
> get from performance of INDF to =, you have three options:
> a) expr index the nulls (assuming they are physically stored)
> b) convert to ((a = b) or a is null and b is null) which can help with
> a bitmap or plan
> c) covert to union all equivalent of "b"
>
> merlin

a) and b) would be workaround that would run an order of magnitude slower. The query

starts with a full table scan of a large table. If the planner had started elsewhere it could

have reduced the result to 1-2 rows from the start. It won't choose this plan without the help

from =.

c) could be a acceptable workaround, but it would clutter up if you would want more

than one column to be IS NOT DISTINCT FROM. You end up with 2^n unions to simulate

IS NOT DISTINCT FROM.

Without knowing the work required, I will still argue that having IS NOT DISTINCT FROM

use the same transitive rules as equality, would be a better approach.

With fear of talking about things I know little(nothing) of, I think the description of EquivalenceClasses

in postgres/src/backend/optimizer/README, should be extended to also include EquivalenceClasses

of IS NOT DISTINCT FROM.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: How to hint 2 coulms IS NOT DISTINCT FROM each other