Обсуждение: [GENERAL] query planner placement of sort/limit w.r.t. joins

Поиск
Список
Период
Сортировка

[GENERAL] query planner placement of sort/limit w.r.t. joins

От
Dave Vitek
Дата:
Hi all,

I have a query I'd like to speed up.  I am wondering whether the query
planner is capable of coming up with a certain kind of plan for this
query, and if I might tickle it into doing so, or if I have to
explicitly use subqueries to get what I want.

Imagine we have tables A, B, and C each having a one to one relationship.

SELECT A.val, B.val, C.val FROM A
    JOIN B ON A.b_id = B.id
    JOIN C ON B.c_id = C.id
    ORDER BY A.rank
    LIMIT 100;


Assume there are btree indices for all the "id" columns, but there is no
index on "rank."  None of the fields are nullable.  id columns are
unique.  Assume that there are about 1M tuples in the result set without
the LIMIT.  ANALYZE has run recently.

In this case, I want postgres to do a sequential scan and a top-N sort
against A since the ORDER BY only depends on columns of A, and then do
100 index scans to implement the joins.

What I'm observing is that it is doing sequential scans on A, B, and C,
joining, and then sorting.  The cost of the sequential scans of B and C
is large enough to be painful.

Is what I want in the query planner's vocabulary?  It would need to
exploit the fact that the _id columns are not nullable, and either
exploit the uniqueness of the id columns or do an extra LIMIT step after
the join.  I think I want it to effectively give the same result I
expect (haven't tested) it would give for:

SELECT D.val, B.val, C.val FROM
    (SELECT * FROM A ORDER BY A.rank LIMIT 100) AS D
    JOIN B ON D.b_id = B.id
    JOIN C ON B.c_id = C.id
    LIMIT 100;

Perhaps there are reasons why this optimization is not safe that I
haven't thought about?

The query is being generated for use by an ORM, so changing it to use a
subquery for A and the ORDER BY+LIMIT is not so easy (plus, with
different user input, it might need B or C in the ORDER BY).


- Dave



Re: [GENERAL] query planner placement of sort/limit w.r.t. joins

От
David Rowley
Дата:
On 29 April 2017 at 07:59, Dave Vitek <dvitek@grammatech.com> wrote:
> Is what I want in the query planner's vocabulary?  It would need to exploit
> the fact that the _id columns are not nullable, and either exploit the
> uniqueness of the id columns or do an extra LIMIT step after the join.  I
> think I want it to effectively give the same result I expect (haven't
> tested) it would give for:

Unfortunately, it's not a plan that the current planner will consider.

> SELECT D.val, B.val, C.val FROM
>    (SELECT * FROM A ORDER BY A.rank LIMIT 100) AS D
>    JOIN B ON D.b_id = B.id
>    JOIN C ON B.c_id = C.id
>    LIMIT 100;
>
> Perhaps there are reasons why this optimization is not safe that I haven't
> thought about?

Yeah, I think so. What happens if an A row cannot find a match in B or
C? This version of the query will end up returning fewer rows due to
that, but the original version would consider other rows with a higher
rank.

We've danced around a bit with using foreign keys as proofs that rows
will exist for other optimisations in the past, but it's tricky ground
since foreign keys are not updated immediately, so there are windows
where they may not actually hold true to their word.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [GENERAL] query planner placement of sort/limit w.r.t. joins

От
"David G. Johnston"
Дата:
On Fri, Apr 28, 2017 at 3:24 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 29 April 2017 at 07:59, Dave Vitek <dvitek@grammatech.com> wrote:
> Is what I want in the query planner's vocabulary?  It would need to exploit
> the fact that the _id columns are not nullable, and either exploit the
> uniqueness of the id columns or do an extra LIMIT step after the join.  I
> think I want it to effectively give the same result I expect (haven't
> tested) it would give for:

Unfortunately, it's not a plan that the current planner will consider.

> SELECT D.val, B.val, C.val FROM
>    (SELECT * FROM A ORDER BY A.rank LIMIT 100) AS D
>    JOIN B ON D.b_id = B.id
>    JOIN C ON B.c_id = C.id
>    LIMIT 100;
>
> Perhaps there are reasons why this optimization is not safe that I haven't
> thought about?

Yeah, I think so. What happens if an A row cannot find a match in B or
C? This version of the query will end up returning fewer rows due to
that, but the original version would consider other rows with a higher
rank.

We've danced around a bit with using foreign keys as proofs that rows
will exist for other optimisations in the past, but it's tricky ground
since foreign keys are not updated immediately, so there are windows
where they may not actually hold true to their word.

​​I read this query as having a relation cardinality of one-to-one mandatory - which precludes the scenario described.

Is the above saying that, today, there is no planning benefit to setting up two deferrable references constraints to enforce the non-optional requirement?

I know I'm guilty of not enforcing the non-optional part of the constraint.  Mostly due to not really realizing it but also having to deal the added syntax to perform inserts.  ORMs I suspect generally would be unaccommodating here as well...

David J.​

Re: [GENERAL] query planner placement of sort/limit w.r.t. joins

От
David Rowley
Дата:
On 29 April 2017 at 11:37, David G. Johnston <david.g.johnston@gmail.com> wrote:
>> > Perhaps there are reasons why this optimization is not safe that I
>> > haven't
>> > thought about?
>>
>> Yeah, I think so. What happens if an A row cannot find a match in B or
>> C? This version of the query will end up returning fewer rows due to
>> that, but the original version would consider other rows with a higher
>> rank.
>>
>> We've danced around a bit with using foreign keys as proofs that rows
>> will exist for other optimisations in the past, but it's tricky ground
>> since foreign keys are not updated immediately, so there are windows
>> where they may not actually hold true to their word.
>
>
> I read this query as having a relation cardinality of one-to-one mandatory -
> which precludes the scenario described.

What makes you say so?

It's pretty easy to show how the queries are not the same.

create table a (
  id int primary key,
  b_id int not null,
  val int not null,
  rank int not null
);

create table b (
  id int primary key,
  c_id int not null,
  val int not null
);

create table c (
  id int primary key,
  val int not null
);
insert into a select x,x,x,x from generate_series(1,150) x;
insert into b select x,x,x from generate_series(51,150) x;
insert into c select x,x from generate_series(51,150) x;

SELECT A.val, B.val, C.val FROM A
   JOIN B ON A.b_id = B.id
   JOIN C ON B.c_id = C.id
   ORDER BY A.rank
   LIMIT 100; -- returns 100 rows

 SELECT D.val, B.val, C.val FROM
   (SELECT * FROM A ORDER BY A.rank LIMIT 100) AS D
   JOIN B ON D.b_id = B.id
   JOIN C ON B.c_id = C.id
   LIMIT 100; -- returns 50 rows


> Is the above saying that, today, there is no planning benefit to setting up
> two deferrable references constraints to enforce the non-optional
> requirement?

There is no place in the planner where a foreign key is used as a
proof that a joined row must exist, with the exception of row
estimations for statistics.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services