Обсуждение: [PERFORM] Efficiently merging and sorting collections of sorted rows

Поиск
Список
Период
Сортировка

[PERFORM] Efficiently merging and sorting collections of sorted rows

От
Clint Miller
Дата:
Let's say I have the following table and index:

create table foo(s text, i integer);
create index foo_idx on foo (s, i);

If I run the following commands:

start transaction;
set local enable_sort = off;
explain analyze select * from foo where s = 'a' order by i;
end;

I get the following query plan:

Index Only Scan using foo_idx on foo (cost=0.14..8. 15 row=1 width=36) (actual time=0.008..0.0 10 rows=3 loops=1)
  Index Cond: (s = 'a'::text)
  Heap Fetches: 3

That's a good plan because it's not doing a quick sort. Instead, it's just reading the sort order off of the index, which is exactly what I want. (I had to disable enable_sort because I didn't have enough rows of test data in the table to get Postgres to use the index. But if I had enough rows, the enable_sort stuff wouldn't be necessary. My real table has lots of rows and doesn't need enable_sort turned off to do the sort with the index.)

But, if I run the following commands:

start transaction;
set local enable_sort = off;
explain analyze select * from foo where s = 'a' or s = 'b' order by i;
end;

I get the following plan:

Sort  (cost=10000000001.16..10000000001.16 rows=2 width=36) (actual time=0.020..0.021 rows=7 loops=1)
  Sort Key: i
  Sort Method: quicksort  Memory: 25kB
  ->  Seq Scan on foo  (cost=0.00..1.15 rows=2 width=36) (actual time=0.009..0.011 rows=7 loops=1)
        Filter: ((s = 'a'::text) OR (s = 'b'::text))
        Rows Removed by Filter: 3

Here, it's loading the full result set into memory and doing a quick sort. (I think that's what it's doing, at least. If that's not the case, let me know.) That's not good.

What I'd really like Postgres to do is use the index to get a sorted list of rows where s = 'a'. Then, use the index again to get a sorted list of rows where s = 'b'. Then it seems like Postgres should be able to merge the sorted lists into a single sorted result set in O(n) time and O(1) memory using a single merge operation.

Am I doing something wrong here? Is there a way to get Postgres to not do a quick sort here?

My concern is that my real table has a significant number of rows, and the result set will not fit into memory. So instead of getting a quick sort, I'll end up getting a slow, disk-based merge sort. I really need the bulk of the sort operation to come off of the index so that time and memory are small.

Thanks for any help on this issue.

Re: [PERFORM] Efficiently merging and sorting collections of sorted rows

От
Peter Geoghegan
Дата:
On Fri, Jun 23, 2017 at 3:58 PM, Clint Miller <clint.miller1@gmail.com> wrote:
> Here, it's loading the full result set into memory and doing a quick sort.
> (I think that's what it's doing, at least. If that's not the case, let me
> know.) That's not good.

It's not sorting stuff that doesn't need to be read into memory in the
first place. In the case of your plan with the sequential scan, some
rows are eliminated early, before being input to the sort node.

> What I'd really like Postgres to do is use the index to get a sorted list of
> rows where s = 'a'. Then, use the index again to get a sorted list of rows
> where s = 'b'. Then it seems like Postgres should be able to merge the
> sorted lists into a single sorted result set in O(n) time and O(1) memory
> using a single merge operation.
>
> Am I doing something wrong here? Is there a way to get Postgres to not do a
> quick sort here?

I would like that too. There is a patch that does what I think you're
describing, but it seems to be in limbo:

https://commitfest.postgresql.org/11/409/

--
Peter Geoghegan


Re: [PERFORM] Efficiently merging and sorting collections of sorted rows

От
Tom Lane
Дата:
Clint Miller <clint.miller1@gmail.com> writes:
> That's a good plan because it's not doing a quick sort. Instead, it's just
> reading the sort order off of the index, which is exactly what I want. (I
> had to disable enable_sort because I didn't have enough rows of test data
> in the table to get Postgres to use the index. But if I had enough rows,
> the enable_sort stuff wouldn't be necessary. My real table has lots of rows
> and doesn't need enable_sort turned off to do the sort with the index.)

TBH, I think this whole argument is proceeding from false premises.
Using an indexscan as a substitute for an explicit sort of lots of
rows isn't all that attractive, because it implies a whole lot of
random access to the table (unless the table is nearly in index
order, which isn't a condition you can count on without expending
a lot of maintenance effort to keep it that way).  seqscan-and-sort
is often a superior alternative, especially if you're willing to give
the sort a reasonable amount of work_mem.

> What I'd really like Postgres to do is use the index to get a sorted list
> of rows where s = 'a'. Then, use the index again to get a sorted list of
> rows where s = 'b'. Then it seems like Postgres should be able to merge the
> sorted lists into a single sorted result set in O(n) time and O(1) memory
> using a single merge operation.

If there's no duplicates to remove, I think this will work:

explain
(select * from foo a where s = 'a' order by i)
union all
(select * from foo b where s = 'b' order by i)
order by i;

 Merge Append  (cost=0.32..48.73 rows=12 width=36)
   Sort Key: a.i
   ->  Index Only Scan using foo_idx on foo a  (cost=0.15..24.26 rows=6 width=36)
         Index Cond: (s = 'a'::text)
   ->  Index Only Scan using foo_idx on foo b  (cost=0.15..24.26 rows=6 width=36)
         Index Cond: (s = 'b'::text)

In this case it's pretty obvious that the two union arms can never
return the same row, but optimizing OR into UNION in general is
difficult because of the possibility of duplicates.  I wouldn't
recommend holding your breath waiting for the planner to do this
for you.

            regards, tom lane


Re: [PERFORM] Efficiently merging and sorting collections of sorted rows

От
Merlin Moncure
Дата:
On Fri, Jun 23, 2017 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Clint Miller <clint.miller1@gmail.com> writes:
>> That's a good plan because it's not doing a quick sort. Instead, it's just
>> reading the sort order off of the index, which is exactly what I want. (I
>> had to disable enable_sort because I didn't have enough rows of test data
>> in the table to get Postgres to use the index. But if I had enough rows,
>> the enable_sort stuff wouldn't be necessary. My real table has lots of rows
>> and doesn't need enable_sort turned off to do the sort with the index.)
>
> TBH, I think this whole argument is proceeding from false premises.
> Using an indexscan as a substitute for an explicit sort of lots of
> rows isn't all that attractive, because it implies a whole lot of
> random access to the table (unless the table is nearly in index
> order, which isn't a condition you can count on without expending
> a lot of maintenance effort to keep it that way).  seqscan-and-sort
> is often a superior alternative, especially if you're willing to give
> the sort a reasonable amount of work_mem.

Hm, if he reverses the index terms he gets his sort order for free and
a guaranteed IOS.   This would only be sensible to do only if several
conditions applied, you'd have to live under the IOS criteria
generally, the number of rows returned to what relative to what was
thrown out would have to be reasonably high (this is key), and the
result set would have to be large making the sort an expensive
consideration relative to the filtering.  You'd also have to be
uninterested in explicit filters on 's' or be willing to create
another index to do that if you were.

merlin

postgres=# \d foo
      Table "public.foo"
 Column │  Type   │ Modifiers
────────┼─────────┼───────────
 s      │ text    │
 i      │ integer │
Indexes:
    "foo_i_s_idx" btree (i, s)  -- reversed

postgres=# set enable_sort = false;
SET

postgres=# explain analyze select * from foo where s = 'a' or s = 'b'
order by i;
                                                       QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Index Only Scan using foo_i_s_idx on foo  (cost=0.15..68.75 rows=12
width=36) (actual time=0.004..0.004 rows=0 loops=1)
   Filter: ((s = 'a'::text) OR (s = 'b'::text))
   Heap Fetches: 0
 Planning time: 0.215 ms
 Execution time: 0.025 ms







merlin