Обсуждение: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

Поиск
Список
Период
Сортировка

Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
johno
Дата:
Hi there,

I am trying to optimize a simple query that returns first 100 rows that have been updated since a given timestamp (ordered by timestamp and id desc).  If there are several rows with the same timestamp I need to a second condition, that states that I want to return rows having the given timestamp and id > given id.

The obvious query is 

SELECT * FROM register_uz_accounting_entities
    WHERE effective_on > '2014-07-11' OR (effective_on = '2014-07-11' AND id > 1459)
    ORDER BY effective_on, id
    LIMIT 100

With a composite index on (effective_on, id)

Query plan

"Limit  (cost=4613.70..4613.95 rows=100 width=1250) (actual time=0.122..0.130 rows=22 loops=1)"
"  Buffers: shared hit=28"
"  ->  Sort  (cost=4613.70..4617.33 rows=1453 width=1250) (actual time=0.120..0.122 rows=22 loops=1)"
"        Sort Key: effective_on, id"
"        Sort Method: quicksort  Memory: 30kB"
"        Buffers: shared hit=28"
"        ->  Bitmap Heap Scan on register_uz_accounting_entities  (cost=35.42..4558.17 rows=1453 width=1250) (actual time=0.036..0.083 rows=22 loops=1)"
"              Recheck Cond: ((effective_on > '2014-07-11'::date) OR ((effective_on = '2014-07-11'::date) AND (id > 1459)))"
"              Buffers: shared hit=28"
"              ->  BitmapOr  (cost=35.42..35.42 rows=1453 width=0) (actual time=0.026..0.026 rows=0 loops=1)"
"                    Buffers: shared hit=6"
"                    ->  Bitmap Index Scan on idx2  (cost=0.00..6.49 rows=275 width=0) (actual time=0.017..0.017 rows=15 loops=1)"
"                          Index Cond: (effective_on > '2014-07-11'::date)"
"                          Buffers: shared hit=3"
"                    ->  Bitmap Index Scan on idx2  (cost=0.00..28.21 rows=1178 width=0) (actual time=0.007..0.007 rows=7 loops=1)"
"                          Index Cond: ((effective_on = '2014-07-11'::date) AND (id > 1459))"
"                          Buffers: shared hit=3"
"Total runtime: 0.204 ms"


Everything works as expected. However if I change the constraint to a timestamp/date more in the past (resulting in far more matching rows) the query slows down drastically.

SELECT * FROM register_uz_accounting_entities
WHERE effective_on > '2014-06-11' OR (effective_on = '2014-06-11' AND id > 1459)
ORDER BY effective_on, id
LIMIT 100
 
"Limit  (cost=0.42..649.46 rows=100 width=1250) (actual time=516.125..516.242 rows=100 loops=1)"
"  Buffers: shared hit=576201"
"  ->  Index Scan using idx2 on register_uz_accounting_entities  (cost=0.42..106006.95 rows=16333 width=1250) (actual time=516.122..516.229 rows=100 loops=1)"
"        Filter: ((effective_on > '2014-06-11'::date) OR ((effective_on = '2014-06-11'::date) AND (id > 1459)))"
"        Rows Removed by Filter: 797708"
"        Buffers: shared hit=576201"
"Total runtime: 516.304 ms"


I've tried to optimize this query by pushing down the limit and order by's into explicit subselects.

SELECT * FROM (
   SELECT * FROM register_uz_accounting_entities 
   WHERE effective_on > '2014-06-11'
   ORDER BY effective_on, id LIMIT 100
   ) t1
UNION
  SELECT * FROM (
    SELECT * FROM register_uz_accounting_entities
    WHERE effective_on = '2014-06-11' AND id > 1459
    ORDER BY effective_on, id LIMIT 100
) t2
ORDER BY effective_on, id
LIMIT 100
 
-- query plan
"Limit  (cost=684.29..684.54 rows=100 width=1250) (actual time=2.648..2.708 rows=100 loops=1)"
"  Buffers: shared hit=203"
"  ->  Sort  (cost=684.29..684.79 rows=200 width=1250) (actual time=2.646..2.672 rows=100 loops=1)"
"        Sort Key: register_uz_accounting_entities.effective_on, register_uz_accounting_entities.id"
"        Sort Method: quicksort  Memory: 79kB"
"        Buffers: shared hit=203"
"        ->  HashAggregate  (cost=674.65..676.65 rows=200 width=1250) (actual time=1.738..1.971 rows=200 loops=1)"
"              Buffers: shared hit=203"
"              ->  Append  (cost=0.42..661.15 rows=200 width=1250) (actual time=0.054..0.601 rows=200 loops=1)"
"                    Buffers: shared hit=203"
"                    ->  Limit  (cost=0.42..338.62 rows=100 width=1250) (actual time=0.053..0.293 rows=100 loops=1)"
"                          Buffers: shared hit=101"
"                          ->  Index Scan using idx2 on register_uz_accounting_entities  (cost=0.42..22669.36 rows=6703 width=1250) (actual time=0.052..0.260 rows=100 loops=1)"
"                                Index Cond: (effective_on > '2014-06-11'::date)"
"                                Buffers: shared hit=101"
"                    ->  Limit  (cost=0.42..318.53 rows=100 width=1250) (actual time=0.037..0.228 rows=100 loops=1)"
"                          Buffers: shared hit=102"
"                          ->  Index Scan using idx2 on register_uz_accounting_entities register_uz_accounting_entities_1  (cost=0.42..30888.88 rows=9710 width=1250) (actual time=0.036..0.187 rows=100 loops=1)"
"                                Index Cond: ((effective_on = '2014-06-11'::date) AND (id > 1459))"
"                                Buffers: shared hit=102"
"Total runtime: 3.011 ms"

=> Very fast.

The question is... why is the query planner unable to make this optimization for the slow query? What am I missing?


Thanks in advance.


Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
David G Johnston
Дата:
johno wrote
> The question is... why is the query planner unable to make this
> optimization for the slow query? What am I missing?

Short answer - your first and last queries are not relationally equivalent
and the optimizer cannot change the behavior of the query which it is
optimizing.  i.e. you did not make an optimization but rather choose to
reformulate the question so that it could be answered more easily while
still providing an acceptable answer.

The question main question is better phrased as:

Give me 100 updated at t(0) but only that are subsequent to a given ID.  If
there are less than 100 such records give me enough additional rows having t
> t(0) so that the total number of rows returned is equal to 100.

Both queries give the same answer but only due to the final LIMIT 100. They
arrive there in different ways which necessitates generating different
plans.  At a basic level it is unable to push down LIMIT into a WHERE clause
and it cannot add additional sub-queries that do not exist in the original
plan - which includes adding a UNION node.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812285.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
johno
Дата:
Thanks for the quick reply David!

However I am still unsure how these two queries are not relationally equivalent. I am struggling to find a counterexample where the first and third query (in email, not in gist) would yield different results. Any ideas?

Jano


On Mon, Jul 21, 2014 at 11:31 PM, David G Johnston <david.g.johnston@gmail.com> wrote:
johno wrote
> The question is... why is the query planner unable to make this
> optimization for the slow query? What am I missing?

Short answer - your first and last queries are not relationally equivalent
and the optimizer cannot change the behavior of the query which it is
optimizing.  i.e. you did not make an optimization but rather choose to
reformulate the question so that it could be answered more easily while
still providing an acceptable answer.

The question main question is better phrased as:

Give me 100 updated at t(0) but only that are subsequent to a given ID.  If
there are less than 100 such records give me enough additional rows having t
> t(0) so that the total number of rows returned is equal to 100.

Both queries give the same answer but only due to the final LIMIT 100. They
arrive there in different ways which necessitates generating different
plans.  At a basic level it is unable to push down LIMIT into a WHERE clause
and it cannot add additional sub-queries that do not exist in the original
plan - which includes adding a UNION node.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812285.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
David G Johnston
Дата:
johno wrote
> Thanks for the quick reply David!
>
> However I am still unsure how these two queries are not relationally
> equivalent. I am struggling to find a counterexample where the first and
> third query (in email, not in gist) would yield different results. Any
> ideas?

Remove the outer LIMIT 100 from both queries...

The first query would return a maximal number of rows that meet the OR
criteria while the second query would return at most 200 rows since both
sub-queries would still have their own independent LIMIT 100 clauses.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812289.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
johno
Дата:
Oh, yes I do understand that if I remove the outer limit, the semantics of the query would change. However I am looking for the counterexample *with* the limit clauses. Maybe I just don't understand what relationally equivalent means, sorry about that.

BTW this is to my understanding a very similar scenario to how partitioned tables work and push down limit and where conditions. Why is this not possible in this case? 

Jano


On Mon, Jul 21, 2014 at 11:54 PM, David G Johnston <david.g.johnston@gmail.com> wrote:
johno wrote
> Thanks for the quick reply David!
>
> However I am still unsure how these two queries are not relationally
> equivalent. I am struggling to find a counterexample where the first and
> third query (in email, not in gist) would yield different results. Any
> ideas?

Remove the outer LIMIT 100 from both queries...

The first query would return a maximal number of rows that meet the OR
criteria while the second query would return at most 200 rows since both
sub-queries would still have their own independent LIMIT 100 clauses.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812289.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
David G Johnston
Дата:
johno wrote
> Oh, yes I do understand that if I remove the outer limit, the semantics of
> the query would change. However I am looking for the counterexample *with*
> the limit clauses. Maybe I just don't understand what relationally
> equivalent means, sorry about that.
>
> BTW this is to my understanding a very similar scenario to how partitioned
> tables work and push down limit and where conditions. Why is this not
> possible in this case?
>
> Jano
>
>
> On Mon, Jul 21, 2014 at 11:54 PM, David G Johnston <

> david.g.johnston@

>> wrote:
>
>> johno wrote
>> > Thanks for the quick reply David!
>> >
>> > However I am still unsure how these two queries are not relationally
>> > equivalent. I am struggling to find a counterexample where the first
>> and
>> > third query (in email, not in gist) would yield different results. Any
>> > ideas?
>>
>> Remove the outer LIMIT 100 from both queries...
>>
>> The first query would return a maximal number of rows that meet the OR
>> criteria while the second query would return at most 200 rows since both
>> sub-queries would still have their own independent LIMIT 100 clauses.
>>
>> David J.

Try following my lead and bottom-post, please.

Anyway, the query has no clue that because of the final LIMIT 100 that the
two different feeding queries are just going to happen to end up providing
the same result.  Maybe, in this particular instance, it is theoretically
possible to make such a proof but generally that is not the case and so such
an optimization has not made into the codebase even if it theoretically
could be done (I'm not convinced it could but do not know enough to explain
to someone else why I have that impression).

I do not know enough to answer why this situation is any different from a
similar partitioning scenario.  An example showing exactly what a similar
partitioning query looks like would help in this regard.

If you are looking for considerably more insight into the planner workings
and why it does or doesn't do something you will need to wait for others.  I
can, to a reasonable degree, deconstruct a pair of queries and either
explain or guess as to why things are happening but that is mostly applied
deductive reasoning and not because I have any particular insight into the
codebase.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812291.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
johno
Дата:

Try following my lead and bottom-post, please.

Sorry for that.
 

Anyway, the query has no clue that because of the final LIMIT 100 that the
two different feeding queries are just going to happen to end up providing
the same result.  Maybe, in this particular instance, it is theoretically
possible to make such a proof but generally that is not the case and so such
an optimization has not made into the codebase even if it theoretically
could be done (I'm not convinced it could but do not know enough to explain
to someone else why I have that impression).

I do not know enough to answer why this situation is any different from a
similar partitioning scenario.  An example showing exactly what a similar
partitioning query looks like would help in this regard.

If you are looking for considerably more insight into the planner workings
and why it does or doesn't do something you will need to wait for others.  I
can, to a reasonable degree, deconstruct a pair of queries and either
explain or guess as to why things are happening but that is mostly applied
deductive reasoning and not because I have any particular insight into the
codebase.


Thanks again for your time. Let's just wait for someone else and see where this will end up going. 

Jano

Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
Tom Lane
Дата:
johno <jan.suchal@gmail.com> writes:
> I am trying to optimize a simple query that returns first 100 rows that
> have been updated since a given timestamp (ordered by timestamp and id
> desc).  If there are several rows with the same timestamp I need to a
> second condition, that states that I want to return rows having the given
> timestamp and id > given id.

> The obvious query is

> SELECT * FROM register_uz_accounting_entities
> WHERE effective_on > '2014-07-11' OR (effective_on = '2014-07-11' AND
> id > 1459)
> ORDER BY effective_on, id
> LIMIT 100

A more readily optimizable query is

SELECT * FROM register_uz_accounting_entities
WHERE (effective_on, id) > ('2014-07-11'::date, 1459)
ORDER BY effective_on, id
LIMIT 100

This formulation allows the planner to match both the WHERE and ORDER BY
clauses directly to the two-column index.

> I've tried to optimize this query by pushing down the limit and order by's
> into explicit subselects.

As noted earlier, that's unlikely to be an improvement, because on its
face it specifies more computation.  Postgres is not terribly bright
about UNIONs, either.

            regards, tom lane


Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
johno
Дата:



On Tue, Jul 22, 2014 at 4:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
johno <jan.suchal@gmail.com> writes:
> I am trying to optimize a simple query that returns first 100 rows that
> have been updated since a given timestamp (ordered by timestamp and id
> desc).  If there are several rows with the same timestamp I need to a
> second condition, that states that I want to return rows having the given
> timestamp and id > given id.

> The obvious query is

> SELECT * FROM register_uz_accounting_entities
> WHERE effective_on > '2014-07-11' OR (effective_on = '2014-07-11' AND
> id > 1459)
> ORDER BY effective_on, id
> LIMIT 100

A more readily optimizable query is

SELECT * FROM register_uz_accounting_entities
WHERE (effective_on, id) > ('2014-07-11'::date, 1459)
ORDER BY effective_on, id
LIMIT 100

Yes, but that query has completely different semantics - I can't change that.
 

This formulation allows the planner to match both the WHERE and ORDER BY
clauses directly to the two-column index.

Are both fields really used? I was under the impression that only the first column from index can be used when there is a range query.
 

> I've tried to optimize this query by pushing down the limit and order by's
> into explicit subselects.

As noted earlier, that's unlikely to be an improvement, because on its
face it specifies more computation.  Postgres is not terribly bright
about UNIONs, either.


Despite the cost calculation in explain the actual query times are very different. I get consistent sub 50ms responses from the optimized one (union with pushing down the limits) and 500+ms for the plain one (when not using bitmap index scan).

Is this possible optimization considered by query planner or do I have "force" it?

Thanks again for your time and effort, I appreciate it.

 

                        regards, tom lane

Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
Tom Lane
Дата:
johno <jan.suchal@gmail.com> writes:
> On Tue, Jul 22, 2014 at 4:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> johno <jan.suchal@gmail.com> writes:
>>> The obvious query is
>>> SELECT * FROM register_uz_accounting_entities
>>> WHERE effective_on > '2014-07-11' OR (effective_on = '2014-07-11' AND
>>> id > 1459)
>>> ORDER BY effective_on, id
>>> LIMIT 100

>> A more readily optimizable query is
>> SELECT * FROM register_uz_accounting_entities
>> WHERE (effective_on, id) > ('2014-07-11'::date, 1459)
>> ORDER BY effective_on, id
>> LIMIT 100

> Yes, but that query has completely different semantics - I can't change
> that.

No, it doesn't.  Read it again ... or read up on row comparisons,
if you're unfamiliar with that notation.  The above queries are
exactly equivalent per spec.

            regards, tom lane


Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions

От
johno
Дата:
No, it doesn't.  Read it again ... or read up on row comparisons,
if you're unfamiliar with that notation.  The above queries are
exactly equivalent per spec.

Wow, this is great. Thanks.