Обсуждение: How to hint 2 coulms IS NOT DISTINCT FROM each other

Поиск
Список
Период
Сортировка

How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

Hi

I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' .

The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals.

Example:

CREATE TABLE a (
  id INTEGER PRIMARY KEY,
  key VARCHAR,
  value VARCHAR
 
);

CREATE INDEX ON a (key);

INSERT INTO a 
  VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz');

CREATE VIEW view_a AS (
  SELECT table_a.id,
         table_a.key, 
         table_a.value,
         table_b.key as b_key
    FROM a AS table_a
    JOIN a AS table_b
      ON table_a.id = table_b.id
);

 
CREATE VIEW view_a_eq AS (
  SELECT table_a.id,
         table_a.key, 
         table_a.value,
         table_b.key as b_key
    FROM a AS table_a
    JOIN a AS table_b
      ON table_a.id = table_b.id
     AND table_a.key = table_b.key
);

CREATE VIEW view_a_distinct AS (
  SELECT table_a.id,
         table_a.key, 
         table_a.value,
         table_b.key as b_key
    FROM a AS table_a
    JOIN a AS table_b
      ON table_a.id = table_b.id
     AND table_a.key IS NOT DISTINCT FROM table_b.key
);

EXPLAIN SELECT * FROM view_a WHERE key = 'test';
 
 
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Hash Join  (cost=12.69..34.42 rows=4 width=100)
   Hash Cond: (table_b.id = table_a.id)
   ->  Seq Scan on a table_b  (cost=0.00..18.50 rows=850 width=36)
   ->  Hash  (cost=12.64..12.64 rows=4 width=68)
         ->  Bitmap Heap Scan on a table_a  (cost=4.18..12.64 rows=4 width=68)
               Recheck Cond: ((key)::text = 'test'::text)
               ->  Bitmap Index Scan on a_key_idx  (cost=0.00..4.18 rows=4 width=0)
                     Index Cond: ((key)::text = 'test'::text)
 

We only get index scan on table_a

EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test';
 
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Nested Loop  (cost=8.36..25.53 rows=1 width=100)
   Join Filter: (table_a.id = table_b.id)
   ->  Bitmap Heap Scan on a table_a  (cost=4.18..12.64 rows=4 width=68)
         Recheck Cond: ((key)::text = 'test'::text)
         ->  Bitmap Index Scan on a_key_idx  (cost=0.00..4.18 rows=4 width=0)
               Index Cond: ((key)::text = 'test'::text)
   ->  Materialize  (cost=4.18..12.66 rows=4 width=36)
         ->  Bitmap Heap Scan on a table_b  (cost=4.18..12.64 rows=4 width=36)
               Recheck Cond: ((key)::text = 'test'::text)
               ->  Bitmap Index Scan on a_key_idx  (cost=0.00..4.18 rows=4 width=0)
                     Index Cond: ((key)::text = 'test'::text)
We get index scan on both tables and the where clause is pushed all the way down

 
EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test';
 
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Hash Join  (cost=12.69..34.43 rows=1 width=100)
   Hash Cond: (table_b.id = table_a.id)
   Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text))
   ->  Seq Scan on a table_b  (cost=0.00..18.50 rows=850 width=36)
   ->  Hash  (cost=12.64..12.64 rows=4 width=68)
         ->  Bitmap Heap Scan on a table_a  (cost=4.18..12.64 rows=4 width=68)
               Recheck Cond: ((key)::text = 'test'::text)
               ->  Bitmap Index Scan on a_key_idx  (cost=0.00..4.18 rows=4 width=0)
                     Index Cond: ((key)::text = 'test'::text)
Same as the first example

In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables.

It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls.

SELECT * FROM view_a_eq WHERE key IS NULL;
 id | key | value | b_key 
----+-----+-------+-------
(0 rows)

Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Scott Marlowe
Дата:
On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
> Hi
>
> I was wondering if there is a way to hint that two columns in two different
> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
> table_a.key = 'test' THEN table_b.key = 'test' .
>
> The equals operator already does this but it does not handle NULLS very well
> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
> doesn't establish the same inference rules as equals.

The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.

Suggestion for getting help, put a large-ish aka production sized
amount of data into your db, run your queries with explain analyze and
feed them to https://explain.depesz.com/ and post the links here along
with the slow queries. A lot of times the fix is non-obvious if you're
coming from another db with a different set of troubleshooting skills
for slow queries.


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Tom Lane
Дата:
Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> I was wondering if there is a way to hint that two columns in two different
>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>> table_a.key = 'test' THEN table_b.key = 'test' .
>>
>> The equals operator already does this but it does not handle NULLS very well
>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> doesn't establish the same inference rules as equals.

> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.

The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL.  You'd
be best off to rethink the representation.  We've not seen enough info
about your requirements to suggest just how, though.

            regards, tom lane


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Scott Marlowe <scott.marlowe@gmail.com> writes:
>> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> I was wondering if there is a way to hint that two columns in two different
>>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>>> table_a.key = 'test' THEN table_b.key = 'test' .
>>>
>>> The equals operator already does this but it does not handle NULLS very well
>>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>>> doesn't establish the same inference rules as equals.
>
>> The whole idea behing Postgres' query planner is that you don't have
>> to use any hints. Late model versions of postgres handle nulls fine,
>> but nulls are never "equal" to anything else. I.e. where xxx is null
>> works with indexes. Where x=y does not, since null <> null.
>
> The bigger picture here is that if you've designed a data representation
> that requires that a null be considered "equal to" another null, you're
> really going to be fighting against the basic semantics of SQL.  You'd
> be best off to rethink the representation.  We've not seen enough info
> about your requirements to suggest just how, though.

Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be
optimized"?  It ought to be, at least in some cases. Internally
indexes handle nulls so you should be able to implement them to
satisfy those kinds of scans.  I guess that's an easy thing to say
though.

merlin


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Fri, Oct 28, 2016 at 1:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Oct 28, 2016 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Scott Marlowe <scott.marlowe@gmail.com> writes:
>>> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>>> I was wondering if there is a way to hint that two columns in two different
>>>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>>>> table_a.key = 'test' THEN table_b.key = 'test' .
>>>>
>>>> The equals operator already does this but it does not handle NULLS very well
>>>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>>>> doesn't establish the same inference rules as equals.
>>
>>> The whole idea behing Postgres' query planner is that you don't have
>>> to use any hints. Late model versions of postgres handle nulls fine,
>>> but nulls are never "equal" to anything else. I.e. where xxx is null
>>> works with indexes. Where x=y does not, since null <> null.
>>
>> The bigger picture here is that if you've designed a data representation
>> that requires that a null be considered "equal to" another null, you're
>> really going to be fighting against the basic semantics of SQL.  You'd
>> be best off to rethink the representation.  We've not seen enough info
>> about your requirements to suggest just how, though.
>
> Well, ISTM OP is simply asking, "why can't IS [NOT] DISTINCT FROM be
> optimized"?  It ought to be, at least in some cases. Internally
> indexes handle nulls so you should be able to implement them to
> satisfy those kinds of scans.  I guess that's an easy thing to say
> though.

hm. ISTM "a IS NOT DISTINCT FROM b"...could be rewritten into

((a IS NULL AND b IS NULL) OR (a = b)) ... which is an indexable expression.

merlin


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
John R Pierce
Дата:
On 10/28/2016 9:29 AM, Kim Rose Carlsen wrote:
    JOIN a AS table_b
      ON table_a.id = table_b.id
     AND table_a.key = table_b.key

Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.



-- 
john r pierce, recycling bits in santa cruz

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

> The bigger picture here is that if you've designed a data representation
> that requires that a null be considered "equal to" another null, you're
> really going to be fighting against the basic semantics of SQL.  You'd
> be best off to rethink the representation.  We've not seen enough info
> about your requirements to suggest just how, though.

Sometimes I do wake up in the night scared and afraid that I have used NULLs the wrong way 
my whole life. I usually use NULLs to denote a value has not been provided. In my dreams I fear 
I should have used empty string instead but if that is true, then I no longer know what is right and 
wrong :).

In this specific case, its a resource management table. If a resource is not allocated to any 
customers, then customer_id is set to null else the customer_id is set.



From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, October 28, 2016 8:17:01 PM
To: Scott Marlowe
Cc: Kim Rose Carlsen; pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
 
Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> I was wondering if there is a way to hint that two columns in two different
>> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
>> table_a.key = 'test' THEN table_b.key = 'test' .
>>
>> The equals operator already does this but it does not handle NULLS very well
>> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> doesn't establish the same inference rules as equals.

> The whole idea behing Postgres' query planner is that you don't have
> to use any hints. Late model versions of postgres handle nulls fine,
> but nulls are never "equal" to anything else. I.e. where xxx is null
> works with indexes. Where x=y does not, since null <> null.

The bigger picture here is that if you've designed a data representation
that requires that a null be considered "equal to" another null, you're
really going to be fighting against the basic semantics of SQL.  You'd
be best off to rethink the representation.  We've not seen enough info
about your requirements to suggest just how, though.

                        regards, tom lane

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:
>>    JOIN a AS table_b
>>     ON table_a.id = table_b.id
>>     AND table_a.key = table_b.key

> Anyways, to use an index for that join, you'd need a composite index on id *AND* key, not two separate indexes.
Its not as much as for using the index, but to be able to push the where clause inside both JOINED tables. 

The = operator already does this. It gives the planner the option to join the table in using either id = id or key = key. It can deduce that if I have a WHERE condition with table_a.key = 'Something', then table_b.key must also be 'Something'. It can then decide to filter table_b on key. When using IS NOT DISTINCT FROM, the planner is not considering the same options. Now its like it doesn't know table_a.key is same the same as table_b.key.

I would somehow expect the IS NOT DISTINCT FROM operator to do the same. As it establish the same rules.. If a = 'test' and a IS NOT DISTINCT FROM b then b = 'test' also

-
Kim Carlsen

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:

> Hi
>
> I was wondering if there is a way to hint that two columns in two different
> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if
> table_a.key = 'test' THEN table_b.key = 'test' .
>
> The equals operator already does this but it does not handle NULLS very well
> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
> doesn't establish the same inference rules as equals.

The whole idea behing Postgres' query planner is that you don't have
to use any hints. Late model versions of postgres handle nulls fine,
but nulls are never "equal" to anything else. I.e. where xxx is null
works with indexes. Where x=y does not, since null <> null.

Suggestion for getting help, put a large-ish aka production sized
amount of data into your db, run your queries with explain analyze and
feed them to https://explain.depesz.com/ and post the links here along
with the slow queries. A lot of times the fix is non-obvious if you're
coming from another db with a different set of troubleshooting skills
> for slow queries.

The problem is how to reduce the problem into its core, without introducing 
all the unnecessary.

Maybe simplifying the problem, also makes it impossible to say where I go 
wrong. It might be that I try to push too much logic into the SQL layer 
and Im adding too many layers of abstraction to accomplish what I want. 
So let me try and elaborate a little more.

I have couple a tables describing resources (circuits) and allocation 
of resources to customers and products.

First layer is a view called view_circuit. This view (left) join any table 
the circuit table reference through a foreign key (it gives exactly the same
rows and columns as circuit table + some extra information like customer_id).

Second layer is 2 views
1) a view describing if the circuit is active or inactive, lets call it 
   view_circuit_product_main
2) a view describing line_speed about the circuit, lets call it 
   view_circuit_product

These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...)) 
if this has any relevance.

Third layer
Next step is to add a view that tells both (joins the two views together
on circuit_id). lets call the new view view_circuit_with_status

This view is defined as

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND r.customer_id IS NOT DISTINCT FROM s.customer_id
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);

SELECT * FROM view_circuit_with_status WHERE customer_id = 1;

Since customer_id is exposed through view_circuit the planner assumes view_circuit.customer_id = 1 and from there attempts to join 
view_circuit_product_main and view_circuit_product using circuit_id. 
This is not running optimal.

However if we change our query to allow the inference rule to take place, the query is executed very fast.

SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND s_customer_id = 1 AND p_customer_id = 1;

If a circuit is not assigned to any customers customer_id is set to NULL. This is the reason I can't use = operator. If I do use = then I can't find circuit which are unassigned, but the query do run effective.

I can see this still ends up being quite abstract, but the point is it would be quite beneficial if IS NOT DISTINCT used the same rules as = operator.

I have attached the 2 query plans


-
Kim Carlsen
Do you use potatoes for long posts here?







Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Fri, Oct 28, 2016 at 2:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>
>> > Hi
>> >
>> > I was wondering if there is a way to hint that two columns in two
>> > different
>> > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume
>> > if
>> > table_a.key = 'test' THEN table_b.key = 'test' .
>> >
>> > The equals operator already does this but it does not handle NULLS very
>> > well
>> > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and
>> > doesn't establish the same inference rules as equals.
>>
>> The whole idea behing Postgres' query planner is that you don't have
>> to use any hints. Late model versions of postgres handle nulls fine,
>> but nulls are never "equal" to anything else. I.e. where xxx is null
>> works with indexes. Where x=y does not, since null <> null.
>>
>> Suggestion for getting help, put a large-ish aka production sized
>> amount of data into your db, run your queries with explain analyze and
>> feed them to https://explain.depesz.com/ and post the links here along
>> with the slow queries. A lot of times the fix is non-obvious if you're
>> coming from another db with a different set of troubleshooting skills
>> for slow queries.
>
> The problem is how to reduce the problem into its core, without introducing
> all the unnecessary.
>
> Maybe simplifying the problem, also makes it impossible to say where I go
> wrong. It might be that I try to push too much logic into the SQL layer
> and Im adding too many layers of abstraction to accomplish what I want.
> So let me try and elaborate a little more.
>
> I have couple a tables describing resources (circuits) and allocation
> of resources to customers and products.
>
> First layer is a view called view_circuit. This view (left) join any table
> the circuit table reference through a foreign key (it gives exactly the same
> rows and columns as circuit table + some extra information like
> customer_id).
>
> Second layer is 2 views
> 1) a view describing if the circuit is active or inactive, lets call it
>    view_circuit_product_main
> 2) a view describing line_speed about the circuit, lets call it
>    view_circuit_product
>
> These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...))
> if this has any relevance.
>
> Third layer
> Next step is to add a view that tells both (joins the two views together
> on circuit_id). lets call the new view view_circuit_with_status
>
> This view is defined as
>
> CREATE VIEW view_circuit_with_status AS (
>    SELECT r.*,
>           s.circuit_status,
>           s.customer_id AS s_customer_id,
>           p.line_speed,
>           p.customer_id AS p_customer_id
>      FROM view_circuit r
>      JOIN view_circuit_product_main s
>        ON r.circuit_id = s.circuit_id
>       AND r.customer_id IS NOT DISTINCT FROM s.customer_id
>      JOIN view_circuit_product p
>        ON r.circuit_id = p.circuit_id
>       AND r.customer_id IS NOT DISTINCT FROM s.customer_id
> );
>
> SELECT * FROM view_circuit_with_status WHERE customer_id = 1;
>
> Since customer_id is exposed through view_circuit the planner assumes
> view_circuit.customer_id = 1 and from there attempts to join
> view_circuit_product_main and view_circuit_product using circuit_id.
> This is not running optimal.
>
> However if we change our query to allow the inference rule to take place,
> the query is executed very fast.
>
> SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND
> s_customer_id = 1 AND p_customer_id = 1;
>
> If a circuit is not assigned to any customers customer_id is set to NULL.
> This is the reason I can't use = operator. If I do use = then I can't find
> circuit which are unassigned, but the query do run effective.
>
> I can see this still ends up being quite abstract, but the point is it would
> be quite beneficial if IS NOT DISTINCT used the same rules as = operator.
>
> I have attached the 2 query plans
>
> Bad plan: https://explain.depesz.com/s/SZN
> Good plan: https://explain.depesz.com/s/61Ro

try this :-D
create or replace function indf(anyelement, anyelement) returns anyelement as
$$
  select $1 = $2 or ($1 is null and $2 is null);
$$ language sql;

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND indf(r.customer_id, s.customer_id)
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND indf(r.customer_id, s.customer_id)

merlin


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
"David G. Johnston"
Дата:
On Fri, Oct 28, 2016 at 12:54 PM, Kim Rose Carlsen <krc@hiper.dk> wrote:

​​

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND r.customer_id IS NOT DISTINCT FROM s.customer_id
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND r.customer_id IS NOT DISTINCT FROM s.customer_id
);

SELECT * FROM view_circuit_with_status WHERE customer_id = 1;


​A couple of random thoughts:

1.
TABLE circuit (circuit_id)
TABLE customer (customer_id)
TABLE circuit_customer (circuit_id, customer_id)

2.
INSERT INTO customer (0, 'Customer number standing in for customer not assigned');
While IS DISTINCT FROM could very well use some more love I surmise the reason it hasn't gotten it is because most data model set things up so that it never needs to be used.

​David J.​

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

> try this :-D

create or replace function indf(anyelement, anyelement) returns anyelement as
$$
  select $1 = $2 or ($1 is null and $2 is null);
$$ language sql;

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND indf(r.customer_id, s.customer_id)
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND indf(r.customer_id, s.customer_id)

merlin

This doesn't do much good. This doesn't tell the planner that the 3 customer_ids are actually of same value, and it therefore can't filter them as it sees fit.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Scott Marlowe
Дата:
On Sat, Oct 29, 2016 at 6:55 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> try this :-D
>
>> create or replace function indf(anyelement, anyelement) returns anyelement
>> as
>> $$
>>   select $1 = $2 or ($1 is null and $2 is null);
>> $$ language sql;
>>
>> CREATE VIEW view_circuit_with_status AS (
>>    SELECT r.*,
>>           s.circuit_status,
>>           s.customer_id AS s_customer_id,
>>           p.line_speed,
>>           p.customer_id AS p_customer_id
>>      FROM view_circuit r
>>      JOIN view_circuit_product_main s
>>        ON r.circuit_id = s.circuit_id
>>       AND indf(r.customer_id, s.customer_id)
>>      JOIN view_circuit_product p
>>        ON r.circuit_id = p.circuit_id
>>       AND indf(r.customer_id, s.customer_id)
>>
>> merlin
>
> This doesn't do much good. This doesn't tell the planner that the 3
> customer_ids are actually of same value, and it therefore can't filter them
> as it sees fit.

You do know you can index on a function, and the planner then keeps
stats on it when you run analyze right?


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

>> This doesn't do much good. This doesn't tell the planner that the 3

>> customer_ids are actually of same value, and it therefore can't filter them
>> as it sees fit.

> You do know you can index on a function, and the planner then keeps
> stats on it when you run analyze right?

Yes, but I don't think it will make any difference. I don't think I can solve this with
an index lookup. I think my savior is the inference that the 2 columns are of 
same value and the planner are free to choose which order to do the filter and join
with this extra information.

I have tried creating a function called 
zero_if_null(int) : int that just select COALESCE($1, 0)
and adding a index on (zero_if_null(customer_id)) on table that contains customer_id. The only thing I get from is the planner now only knows how to compare customer_id, but it still doesn't know that they are of same value, only I know that and I want to declare it for the planner. 

I could probably rewrite the whole view in one query, and then fix it with a proper index. But I think I will loose alot of readability. 

I could also change the structure to save an explicit state, instead of a calculated state. But then I get some redundancy I need to make sure always stays the same.

In the end one of these will probably be the solution.

I guess the question is more or less,

why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are there any alternatives? And a plausible use case for when it would be useful.

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> This doesn't do much good. This doesn't tell the planner that the 3
>
>>> customer_ids are actually of same value, and it therefore can't filter
>>> them
>>> as it sees fit.
>
>> You do know you can index on a function, and the planner then keeps
>> stats on it when you run analyze right?
>
> Yes, but I don't think it will make any difference. I don't think I can
> solve this with
> an index lookup. I think my savior is the inference that the 2 columns are
> of
> same value and the planner are free to choose which order to do the filter
> and join
> with this extra information.
>
> I have tried creating a function called
> zero_if_null(int) : int that just select COALESCE($1, 0)
> and adding a index on (zero_if_null(customer_id)) on table that contains
> customer_id. The only thing I get from is the planner now only knows how to
> compare customer_id, but it still doesn't know that they are of same value,
> only I know that and I want to declare it for the planner.
>
> I could probably rewrite the whole view in one query, and then fix it with a
> proper index. But I think I will loose alot of readability.
>
> I could also change the structure to save an explicit state, instead of a
> calculated state. But then I get some redundancy I need to make sure always
> stays the same.
>
> In the end one of these will probably be the solution.
>
> I guess the question is more or less,
>
> why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are
> there any alternatives? And a plausible use case for when it would be
> useful.

Well, the *behavior* is mandated by the sql standard.  Our
implementation is slow however.  I'm surprised the attached function
didn't help, it can be inlined and I was able to get bitmap or which
is pretty good.  As I said upthread I think INDF could theoretically
run as fast as equality -- it just doesn't today.

As your joins are written you could probably convert this by reserving
a customer_id to the work that you're trying to do with null, say, 0,
or -1.  This is a somewhat dubious practice but seems a better fit for
your use case.  I don't think INDF is good in this usage.

merlin


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:

> I have tried creating a function called
> zero_if_null(int) : int that just select COALESCE($1, 0)
> and adding a index on (zero_if_null(customer_id)) on table that contains
> customer_id. The only thing I get from is the planner now only knows how to
> compare customer_id, but it still doesn't know that they are of same value,
> > only I know that and I want to declare it for the planner.


> Well, the *behavior* is mandated by the sql standard.  Our
implementation is slow however.  

Sorry I'm not following, what behavior is mandated by the sql standard?

I'm surprised the attached function
didn't help, it can be inlined and I was able to get bitmap or which
is pretty good.  As I said upthread I think INDF could theoretically
run as fast as equality -- it just doesn't today.

It might be harsh to say that it doesn't help at all. I does half the running time, 
but I need it to run an order of magnitude faster. Here is the plan with the 
empty_if_null (customer_id is actually varchar)

https://explain.depesz.com/s/M1LV with empty_if_null + functional index

As your joins are written you could probably convert this by reserving
a customer_id to the work that you're trying to do with null, say, 0,
or -1.  This is a somewhat dubious practice but seems a better fit for
your use case.  I don't think INDF is good in this usage.

merlin

This will work well, I think. 

But I'm not sure I can mentally accept an unfilled value should not be 
null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will
work well with this.

It might raise another problem, that the nulls are generated through LEFT
JOINS where now rows are defined. Then the 0 or -1 value need to be 
a computed value. Won't this throw of index lookups? (I might be 
more confused in this area).

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Mon, Oct 31, 2016 at 9:28 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
> On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>
>> > I have tried creating a function called
>> > zero_if_null(int) : int that just select COALESCE($1, 0)
>> > and adding a index on (zero_if_null(customer_id)) on table that contains
>> > customer_id. The only thing I get from is the planner now only knows how
>> > to
>> > compare customer_id, but it still doesn't know that they are of same
>> > value,
>> > only I know that and I want to declare it for the planner.
>
>
>> Well, the *behavior* is mandated by the sql standard.  Our
>> implementation is slow however.
>
> Sorry I'm not following, what behavior is mandated by the sql standard?

The semantics of IS DISTINCT FROM, basically, equality with special
consideration for nulls.

> But I'm not sure I can mentally accept an unfilled value should not be
> null (eg. 0, '', '0000-01-01'). But I can see how the equals operator will
> work well with this.

Yeah, this is a dubious tactic and I would normally only consider
using it for surrogate identifiers.

> It might raise another problem, that the nulls are generated through LEFT
> JOINS where now rows are defined. Then the 0 or -1 value need to be
> a computed value. Won't this throw of index lookups? (I might be
> more confused in this area).

Not following this.  BTW, if you want a fast plan over the current
data without consideration of aesthetics, try this:

CREATE VIEW view_circuit_with_status AS (
   SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND r.customer_id, s.customer_id
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id
      AND r.customer_id, s.customer_id
  UNION ALL SELECT r.*,
          s.circuit_status,
          s.customer_id AS s_customer_id,
          p.line_speed,
          p.customer_id AS p_customer_id
     FROM view_circuit r
     JOIN view_circuit_product_main s
       ON r.circuit_id = s.circuit_id
      AND r.customer_id IS NULL
      AND  s.customer_id IS NULL
     JOIN view_circuit_product p
       ON r.circuit_id = p.circuit_id


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:

>> It might raise another problem, that the nulls are generated through LEFT

>> JOINS where no rows are defined. Then the 0 or -1 value need to be
>> a computed value. Won't this throw off index lookups? (I might be
>> more confused in this area).
>
>Not following this.  

The nulls are generated by something like this
    SELECT c.circuit_id,
                   cc.customer_id
       FROM circuit AS c
LEFT JOIN circuit_customer AS cc
             ON c.circuit_id = cc.circuit_id

To make a magic '0' customer we would be required to use 
  COALESCE(cc.customer_id, '0')
I dont think the optimizer will do anything clever with the '0' we have 
computed from null.

I could ofc. by default assign all unassigned circuits to '0' in 
circuit_customer. I'm not a fan though.

>BTW, if you want a fast plan over the current
>data without consideration of aesthetics, try this:
>
>CREATE VIEW view_circuit_with_status AS (
>    SELECT r.*,
>          s.circuit_status,
>          s.customer_id AS s_customer_id,
>          p.line_speed,
>          p.customer_id AS p_customer_id
>     FROM view_circuit r
>     JOIN view_circuit_product_main s
>      ON r.circuit_id = s.circuit_id
>      AND r.customer_id, s.customer_id
>     JOIN view_circuit_product p
>       ON r.circuit_id = p.circuit_id
>      AND r.customer_id, s.customer_id
>  UNION ALL SELECT r.*,
>          s.circuit_status,
>          s.customer_id AS s_customer_id,
>          p.line_speed,
>          p.customer_id AS p_customer_id
>     FROM view_circuit r
>     JOIN view_circuit_product_main s
>       ON r.circuit_id = s.circuit_id
>      AND r.customer_id IS NULL
>      AND  s.customer_id IS NULL
>     JOIN view_circuit_product p
>       ON r.circuit_id = p.circuit_id>

I will have to figure something out, but this specific case is still problematic
since we would like to filter this view using different criteria's, like circuit_no, 
products or customers.

But with all these detours, I assume that a change to IS NOT DISTINCT FROM, 
is difficult or not wanted?

Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> It might raise another problem, that the nulls are generated through LEFT
>
>>> JOINS where no rows are defined. Then the 0 or -1 value need to be
>>> a computed value. Won't this throw off index lookups? (I might be
>>> more confused in this area).
>>
>>Not following this.
>
> The nulls are generated by something like this
>     SELECT c.circuit_id,
>                    cc.customer_id
>        FROM circuit AS c
> LEFT JOIN circuit_customer AS cc
>              ON c.circuit_id = cc.circuit_id
>
> To make a magic '0' customer we would be required to use
>   COALESCE(cc.customer_id, '0')
> I dont think the optimizer will do anything clever with the '0' we have
> computed from null.

It would if you explicitly indexed it as such;
CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));

> I could ofc. by default assign all unassigned circuits to '0' in
> circuit_customer. I'm not a fan though.

hm, why not?  null generally means 'unknown' and that's why it fails
any equality test.

>>BTW, if you want a fast plan over the current
>>data without consideration of aesthetics, try this:
>>
>>CREATE VIEW view_circuit_with_status AS (
>>    SELECT r.*,
>>          s.circuit_status,
>>          s.customer_id AS s_customer_id,
>>          p.line_speed,
>>          p.customer_id AS p_customer_id
>>     FROM view_circuit r
>>     JOIN view_circuit_product_main s
>>      ON r.circuit_id = s.circuit_id
>>      AND r.customer_id, s.customer_id
>>     JOIN view_circuit_product p
>>       ON r.circuit_id = p.circuit_id
>>      AND r.customer_id, s.customer_id
>>  UNION ALL SELECT r.*,
>>          s.circuit_status,
>>          s.customer_id AS s_customer_id,
>>          p.line_speed,
>>          p.customer_id AS p_customer_id
>>     FROM view_circuit r
>>     JOIN view_circuit_product_main s
>>       ON r.circuit_id = s.circuit_id
>>      AND r.customer_id IS NULL
>>      AND  s.customer_id IS NULL
>>     JOIN view_circuit_product p
>>       ON r.circuit_id = p.circuit_id>
>
> I will have to figure something out, but this specific case is still
> problematic
> since we would like to filter this view using different criteria's, like
> circuit_no,
> products or customers.

the above is logically equivalent to IS NOT DISTINCT FROM; you should
be able to query it as you would have done the original view.

> But with all these detours, I assume that a change to IS NOT DISTINCT FROM,
> is difficult or not wanted?

Well, not exactly.  In your case you are trying to treat null as a
specific value and pass it through join operations.

TBH, this is a pretty dubious approach: null is not supposed to be
equal to anything and any join vs null should come up empty --
logically at least.  INDF works around this of course but it's not a
recommended approach (my usage is generally restricted to, "has this
value changed since yesterday? etc").

I'm not an expert backend structures for indexing and optimization but
I know enough to suspect that optimizing INDF might cause
implementation headaches in various places, as do other irregular
syntactical approaches in SQL.  I think minimally optimizing INDF
would require converting it to an operator on par with '=' which is a
pretty large infrastructure change for an edge optimization case.
The fact that there are solid optimization strategies already on the
table (UNION ALL, expr index COALESCE()) does not help.

merlin


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Alban Hertroys
Дата:
On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>> The nulls are generated by something like this
>>     SELECT c.circuit_id,
>>                    cc.customer_id
>>        FROM circuit AS c
>> LEFT JOIN circuit_customer AS cc
>>              ON c.circuit_id = cc.circuit_id
>>
>> To make a magic '0' customer we would be required to use
>>   COALESCE(cc.customer_id, '0')
>> I dont think the optimizer will do anything clever with the '0' we have
>> computed from null.
>
> It would if you explicitly indexed it as such;
> CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));

Merlin, it's a LEFT JOIN. There probably are no NULLs in the
circuit_customer.customer_id column, so that COALESCE isn't going to
achieve anything at all.

I haven't been following this particular discussion in detail, so
unfortunately I can't contribute more than that remark at the moment.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Merlin Moncure
Дата:
On Fri, Nov 4, 2016 at 9:38 AM, Alban Hertroys <haramrae@gmail.com> wrote:
> On 4 November 2016 at 14:41, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Fri, Nov 4, 2016 at 8:08 AM, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>> The nulls are generated by something like this
>>>     SELECT c.circuit_id,
>>>                    cc.customer_id
>>>        FROM circuit AS c
>>> LEFT JOIN circuit_customer AS cc
>>>              ON c.circuit_id = cc.circuit_id
>>>
>>> To make a magic '0' customer we would be required to use
>>>   COALESCE(cc.customer_id, '0')
>>> I dont think the optimizer will do anything clever with the '0' we have
>>> computed from null.
>>
>> It would if you explicitly indexed it as such;
>> CREATE INDEX ON circuit_customer((COALESCE(customer_id, '0'));
>
> Merlin, it's a LEFT JOIN. There probably are no NULLs in the
> circuit_customer.customer_id column, so that COALESCE isn't going to
> achieve anything at all.

Hang on -- upthread the context was inner join, and the gripe was join
fast with '=', slow with INDF.  When he said the nulls were
'generated', I didn't follow that they were part of the original
query.  If the nulls are generated along with the query, sure, an
index won't help.

I maintain my earlier point; with respect to the original query, to
get from performance of INDF to =, you have three options:
a) expr index the nulls  (assuming they are physically stored)
b) convert to ((a = b) or a is null and b is null) which can help with
a bitmap or plan
c) covert to union all equivalent of "b"

merlin


Re: How to hint 2 coulms IS NOT DISTINCT FROM each other

От
Kim Rose Carlsen
Дата:


> Hang on -- upthread the context was inner join, and the gripe was join

> fast with '=', slow with INDF.  When he said the nulls were
> 'generated', I didn't follow that they were part of the original
> query.  If the nulls are generated along with the query, sure, an
> index won't help.
>
> I maintain my earlier point; with respect to the original query, to
> get from performance of INDF to =, you have three options:
> a) expr index the nulls  (assuming they are physically stored)
> b) convert to ((a = b) or a is null and b is null) which can help with
> a bitmap or plan
> c) covert to union all equivalent of "b"

> merlin

a) and b) would be workaround that would run an order of magnitude slower. The query 
starts with a full table scan of a large table. If the planner had started elsewhere it could
have reduced the result to 1-2 rows from the start. It won't choose this plan without the help
from =.

c) could be a acceptable workaround, but it would clutter up if you would want more 
than one column to be IS NOT DISTINCT FROM. You end up with 2^n unions to simulate 
IS NOT DISTINCT FROM.

Without knowing the work required, I will still argue that having IS NOT DISTINCT FROM 
use the same transitive rules as equality,  would be a better approach. 

With fear of talking about things I know little(nothing) of, I think the description of EquivalenceClasses 
in postgres/src/backend/optimizer/README, should be extended to also include EquivalenceClasses
of IS NOT DISTINCT FROM.