Обсуждение: [PATCH] Teach planner to further optimize sort in distinct

Поиск
Список
Период
Сортировка

[PATCH] Teach planner to further optimize sort in distinct

От
Ankit Kumar Pandey
Дата:
Hi, this is extension of `teach planner to evaluate multiple windows in 
the optimal order` work applied to distinct operation.

Based on discussions before 

(https://www.postgresql.org/message-id/flat/CAApHDvr7rSCVXzGfVa1L9pLpkKj6-s8NynK8o%2B98X9sKjejnQQ%40mail.gmail.com#e01327a3053d9281c40f281ef7105b42)

,

 > All I imagine you need to do for it
 > is to invent a function in pathkeys.c which is along the lines of what
 > pathkeys_count_contained_in() does, but returns a List of pathkeys
 > which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
 > that does not exist as a pathkey in keys1. In
 > create_final_distinct_paths(), you can then perform an incremental
 > sort on any input_path which has a non-empty return list and in
 > create_incremental_sort_path(), you'll pass presorted_keys as the
 > number of pathkeys in the path, and the required pathkeys the
 > input_path->pathkeys + the pathkeys returned from the new function.


There is bit confusion in wording here:

"returns a List of pathkeys
which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
that does not exist as a pathkey in keys1."

You mean extract common keys without ordering right?

Example: keys1 = (a,b,c), keys2 = (b,a)

returns (a,b)

and

keys1 = (a,b,c), keys = (d)

returns = ()

which translates to

needed_pathkeys = (a,b,c) = key2

input_pathkeys = (b,a) key1

returns (b,a) = common_keys

new needed_pathkeys = unique(common_keys + old needed_pathkeys)

=> new needed_pathkeys = (b,a,c)

The new needed_pathkeys matches input_pathkeys.

This is what I implemented in the patch.


The patched version yields the following plans:

set enable_hashagg=0;
set enable_seqscan=0;

explain (costs off) select distinct relname,relkind,count(*) over 
(partition by
relkind) from pg_Class;
                        QUERY PLAN
---------------------------------------------------------
  Unique
    ->  Incremental Sort
          Sort Key: relkind, relname, (count(*) OVER (?))
          Presorted Key: relkind
          ->  WindowAgg
                ->  Sort
                      Sort Key: relkind
                      ->  Seq Scan on pg_class
(8 rows)

explain (costs off) select distinct a, b, count(*) over (partition by b, 
a) from abcd;
                        QUERY PLAN
--------------------------------------------------------
  Unique
    ->  Incremental Sort
          Sort Key: b, a, (count(*) OVER (?))
          Presorted Key: b, a
          ->  WindowAgg
                ->  Incremental Sort
                      Sort Key: b, a
                      Presorted Key: b
                      ->  Index Scan using b_idx on abcd
(9 rows)

explain (costs off) select distinct a, b, count(*) over (partition by c, 
d) from abcd;
                        QUERY PLAN
--------------------------------------------------------
  Unique
    ->  Sort
          Sort Key: a, b, (count(*) OVER (?))
          ->  WindowAgg
                ->  Incremental Sort
                      Sort Key: c, d
                      Presorted Key: c
                      ->  Index Scan using c_idx on abcd
(8 rows)


Issue with index path still remains as pathkeys get purged by 
truncate_useless_pathkeys

and hence are not available in create_final_distinct_paths for the above 
optimizations.


I have attached a patch for the reference.


Thanks,

Ankit


Вложения

Re: [PATCH] Teach planner to further optimize sort in distinct

От
David Rowley
Дата:
On Wed, 18 Jan 2023 at 08:27, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> There is bit confusion in wording here:
>
> "returns a List of pathkeys
> which are in keys1 but not in keys2 and NIL if keys2 has a pathkey
> that does not exist as a pathkey in keys1."
>
> You mean extract common keys without ordering right?

I think you should write a function like:

bool pathkeys_count_contained_in_unordered(List *keys1, List *keys2,
List **reorderedkeys, int *n_common)

which works very similarly to pathkeys_count_contained_in, but
populates *reorderedkeys so it contains all of the keys in keys1, but
put the matching ones in the same order as they are in keys2.  If you
find a keys2 that does not exist in keys1 then just add the additional
unmatched keys1 keys to *reorderedkeys.  Set *n_common to the number
of common keys excluding any that come after a key2 key that does not
exist as a key1 key.

You can just switch to using that function in
create_final_distinct_paths(). You'll need to consider if the query is
a DISTINCT ON query and not try the unordered version of the function
in that case.

I also just noticed that in build_index_paths() we'll leave the index
path's pathkeys empty if we deem the pathkeys as useless.  I'm not
sure what the repercussions of setting those to the return value of
build_index_pathkeys() if useful_pathkeys is otherwise empty.  It's
possible that truncate_useless_pathkeys() needs to be modified to
check if the pathkeys might be useful for DISTINCT, but now that I see
we don't populate the IndexPath's pathkeys when we deem them not
useful makes me wonder if this entire patch is a good idea. When I
thought about it I assumed that we always set IndexPath's pathkeys to
whatever (if any) sort order that the index provides.

David



Re: [PATCH] Teach planner to further optimize sort in distinct

От
Ankit Kumar Pandey
Дата:
> On 19/01/23 18:49, David Rowley wrote:

> I think you should write a function like:

> bool pathkeys_count_contained_in_unordered(List *keys1, List *keys2,
> List **reorderedkeys, int *n_common)

> which works very similarly to pathkeys> _count_contained_in, but
> populates *reorderedkeys so it contains all of the keys in keys1, but
> put the matching ones in the same order as they are in keys2.  If you
> find a keys2 that does not exist in keys1 then just add the additional
> unmatched keys1 keys to *reorderedkeys.  Set *n_common to the number
> of common keys excluding any that come after a key2 key that does not
> exist as a key1 key.

> You can just switch to using that function in
> create_final_distinct_paths(). You'll need to consider if the query is
> a DISTINCT ON query and not try the unordered version of the function
> in that case.

Tried this, it worked as expected. Tests are green as well.

> I also just noticed that in build_index_paths() we'll leave the index
> path's pathkeys empty if we deem the pathkeys as useless.  I'm not
> sure what the repercussions of setting those to the return value of
> build_index_pathkeys() if useful_pathkeys is otherwise empty.

This is very rigid indeed.

> It's possible that truncate_useless_pathkeys() needs to be modified to
> check if the pathkeys might be useful for DISTINCT 

We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering.

Can we not have pathkeys_useful_for_distinct?

Also, pathkeys_useful_for_ordering calls pathkeys_count_contained_in.

We can add code path on similar lines.

> When I
> thought about it I assumed that we always set IndexPath's pathkeys to
> whatever (if any) sort order that the index provides.

Can we not added original path keys in IndexPath? It could be useful

at other places as well. Atleast, I can see it useful in sorting cases.

> makes me wonder if this entire patch is a good idea. 

We are still getting some benefit even without index paths for now.


I have attached patch with pathkeys_count_contained_in_unordered

and corresponding changes in create_final_distinct_paths for reference.


Thanks,

Ankit

Вложения

Re: [PATCH] Teach planner to further optimize sort in distinct

От
David Rowley
Дата:
On Fri, 20 Jan 2023 at 08:26, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> > On 19/01/23 18:49, David Rowley wrote:
> > You can just switch to using that function in
> > create_final_distinct_paths(). You'll need to consider if the query is
> > a DISTINCT ON query and not try the unordered version of the function
> > in that case.
>
> Tried this, it worked as expected. Tests are green as well.

Looking at the patch, you've not added any additional tests.  If the
existing tests are all passing then that just tells me that either the
code is not functioning as intended or we have no tests that look at
the EXPLAIN output which can make use of this optimization. If the
former is true, then the patch needs to be fixed. If it's the latter
then you need to write new tests.

> > It's possible that truncate_useless_pathkeys() needs to be modified to
> > check if the pathkeys might be useful for DISTINCT
>
> We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering.
>
> Can we not have pathkeys_useful_for_distinct?

I don't know all the repercussions. If you look at add_path() then
you'll see we do a pathkey comparison when the costs are not fuzzily
different from an existing path so that we try to keep a path with the
best pathkeys.  If we start keeping paths around with other weird
pathkeys that are not along the lines of the query_pathkeys requires,
then add_path might start throwing away paths that are actually good
for something.  It seems probable that could cause some regressions.

> I have attached patch with pathkeys_count_contained_in_unordered
> and corresponding changes in create_final_distinct_paths for reference.

Does this patch actually work?  I tried:

create table ab (a int, b int);
insert into ab select a,b from generate_Series(1,1000)
a,generate_series(1,1000) b;
analyze ab;
create index on ab(a);
set enable_hashagg=0;
explain select distinct b,a from ab where a < 10;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Unique  (cost=729.70..789.67 rows=7714 width=8)
   ->  Sort  (cost=729.70..749.69 rows=7996 width=8)
         Sort Key: b, a
         ->  Index Scan using ab_a_idx on ab  (cost=0.42..211.36
rows=7996 width=8)
               Index Cond: (a < 10)
(5 rows)

I'd have expected an incremental sort here. I don't see that you're
adjusting IndexPath's pathkeys anywhere. The nested loop in
pathkeys_count_contained_in_unordered() seems to be inside out. The
reordered_pathkeys must have the common pathkeys in the order they
appear in keys2. In your patch, they'll be ordered according to the
keys1 list, which is wrong. Testing would tell you this, so all the
more reason to make the patch work and write some queries to ensure it
does actually work, then some tests to ensure that remains in a
working state.

Feel free to take the proper time to write a working patch which
contains new tests to ensure it's functioning as intended. It's
disheartening to review patches that don't seem to work. If it wasn't
meant to work, then you didn't make that clear. I'll likely not be
able to do any further reviews until the March commitfest, so it might
be better to only post if you're stuck.  Please don't rush out the
next patch. Take your time and study the code and see if you can build
up your own picture for what the repercussions might be of IndexPaths
having additional pathkeys when they were previously empty.  If you're
uncertain of aspects of the patch you've written feel free to leave
XXX type comments to indicate this. That way the reviewer will know
you might need more guidance on that and you'll not forget yourself
when you come back and look again after a few weeks.

David



Re: [PATCH] Teach planner to further optimize sort in distinct

От
Ankit Kumar Pandey
Дата:
> On 20/01/23 06:07, David Rowley wrote:

> Looking at the patch, you've not added any additional tests.  If the
> existing tests are all passing then that just tells me that either the
> code is not functioning as intended or we have no tests that look at
> the EXPLAIN output which can make use of this optimization. If the
> former is true, then the patch needs to be fixed. If it's the latter
> then you need to write new tests.

I definitely need to add tests because this scenario is missing.



> I don't know all the repercussions. If you look at add_path() then
> you'll see we do a pathkey comparison when the costs are not fuzzily
> different from an existing path so that we try to keep a path with the
> best pathkeys.  If we start keeping paths around with other weird
> pathkeys that are not along the lines of the query_pathkeys requires,
> then add_path might start throwing away paths that are actually good
> for something.  It seems probable that could cause some regressions.

Okay, in that case I think it is  better idea to store original pathkeys
(apart from what get assigned by useful_pathkeys). I need to dig deeper for this.


> Does this patch actually work?  I tried:
> I don't see that you're
> adjusting IndexPath's pathkeys anywhere. 

I had removed the changes for indexPath (it was in v1) because I hadn't investigated
repercussions. But I failed to mention this.

> The nested loop in
> pathkeys_count_contained_in_unordered() seems to be inside out. The
> reordered_pathkeys must have the common pathkeys in the order they
> appear in keys2. In your patch, they'll be ordered according to the
> keys1 list, which is wrong. Testing would tell you this, so all the
> more reason to make the patch work and write some queries to ensure it
> does actually work, then some tests to ensure that remains in a
> working state.
> Feel free to take the proper time to write a working patch which
> contains new tests to ensure it's functioning as intended. It's
> disheartening to review patches that don't seem to work. If it wasn't
> meant to work, then you didn't make that clear.
> Please don't rush out the next patch. Take your time and study the code 
> and see if you can build up your own picture for what the repercussions 
> might be of IndexPaths having additional pathkeys when they were previously empty.  
> If you're uncertain of aspects of the patch you've written feel free to leave
> XXX type comments to indicate this. That way the reviewer will know
> you might need more guidance on that and you'll not forget yourself
> when you come back and look again after a few weeks.

I deeply regret this. I will be mindful of my patches and ensure that they are
complete by themselves.
Thanks for your pointers as well, I can see errors in my approach which I will address.
  

> I'll likely not be
> able to do any further reviews until the March commitfest, so it might
> be better to only post if you're stuck.  

Yes sure, I will work on patches and limit posts to discussion only (if blocked).

Thanks,
Ankit