Обсуждение: [PATCH] Teach planner to further optimize sort in distinct
Hi, this is extension of `teach planner to evaluate multiple windows in the optimal order` work applied to distinct operation. Based on discussions before (https://www.postgresql.org/message-id/flat/CAApHDvr7rSCVXzGfVa1L9pLpkKj6-s8NynK8o%2B98X9sKjejnQQ%40mail.gmail.com#e01327a3053d9281c40f281ef7105b42) , > All I imagine you need to do for it > is to invent a function in pathkeys.c which is along the lines of what > pathkeys_count_contained_in() does, but returns a List of pathkeys > which are in keys1 but not in keys2 and NIL if keys2 has a pathkey > that does not exist as a pathkey in keys1. In > create_final_distinct_paths(), you can then perform an incremental > sort on any input_path which has a non-empty return list and in > create_incremental_sort_path(), you'll pass presorted_keys as the > number of pathkeys in the path, and the required pathkeys the > input_path->pathkeys + the pathkeys returned from the new function. There is bit confusion in wording here: "returns a List of pathkeys which are in keys1 but not in keys2 and NIL if keys2 has a pathkey that does not exist as a pathkey in keys1." You mean extract common keys without ordering right? Example: keys1 = (a,b,c), keys2 = (b,a) returns (a,b) and keys1 = (a,b,c), keys = (d) returns = () which translates to needed_pathkeys = (a,b,c) = key2 input_pathkeys = (b,a) key1 returns (b,a) = common_keys new needed_pathkeys = unique(common_keys + old needed_pathkeys) => new needed_pathkeys = (b,a,c) The new needed_pathkeys matches input_pathkeys. This is what I implemented in the patch. The patched version yields the following plans: set enable_hashagg=0; set enable_seqscan=0; explain (costs off) select distinct relname,relkind,count(*) over (partition by relkind) from pg_Class; QUERY PLAN --------------------------------------------------------- Unique -> Incremental Sort Sort Key: relkind, relname, (count(*) OVER (?)) Presorted Key: relkind -> WindowAgg -> Sort Sort Key: relkind -> Seq Scan on pg_class (8 rows) explain (costs off) select distinct a, b, count(*) over (partition by b, a) from abcd; QUERY PLAN -------------------------------------------------------- Unique -> Incremental Sort Sort Key: b, a, (count(*) OVER (?)) Presorted Key: b, a -> WindowAgg -> Incremental Sort Sort Key: b, a Presorted Key: b -> Index Scan using b_idx on abcd (9 rows) explain (costs off) select distinct a, b, count(*) over (partition by c, d) from abcd; QUERY PLAN -------------------------------------------------------- Unique -> Sort Sort Key: a, b, (count(*) OVER (?)) -> WindowAgg -> Incremental Sort Sort Key: c, d Presorted Key: c -> Index Scan using c_idx on abcd (8 rows) Issue with index path still remains as pathkeys get purged by truncate_useless_pathkeys and hence are not available in create_final_distinct_paths for the above optimizations. I have attached a patch for the reference. Thanks, Ankit
Вложения
On Wed, 18 Jan 2023 at 08:27, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote: > There is bit confusion in wording here: > > "returns a List of pathkeys > which are in keys1 but not in keys2 and NIL if keys2 has a pathkey > that does not exist as a pathkey in keys1." > > You mean extract common keys without ordering right? I think you should write a function like: bool pathkeys_count_contained_in_unordered(List *keys1, List *keys2, List **reorderedkeys, int *n_common) which works very similarly to pathkeys_count_contained_in, but populates *reorderedkeys so it contains all of the keys in keys1, but put the matching ones in the same order as they are in keys2. If you find a keys2 that does not exist in keys1 then just add the additional unmatched keys1 keys to *reorderedkeys. Set *n_common to the number of common keys excluding any that come after a key2 key that does not exist as a key1 key. You can just switch to using that function in create_final_distinct_paths(). You'll need to consider if the query is a DISTINCT ON query and not try the unordered version of the function in that case. I also just noticed that in build_index_paths() we'll leave the index path's pathkeys empty if we deem the pathkeys as useless. I'm not sure what the repercussions of setting those to the return value of build_index_pathkeys() if useful_pathkeys is otherwise empty. It's possible that truncate_useless_pathkeys() needs to be modified to check if the pathkeys might be useful for DISTINCT, but now that I see we don't populate the IndexPath's pathkeys when we deem them not useful makes me wonder if this entire patch is a good idea. When I thought about it I assumed that we always set IndexPath's pathkeys to whatever (if any) sort order that the index provides. David
> On 19/01/23 18:49, David Rowley wrote: > I think you should write a function like: > bool pathkeys_count_contained_in_unordered(List *keys1, List *keys2, > List **reorderedkeys, int *n_common) > which works very similarly to pathkeys> _count_contained_in, but > populates *reorderedkeys so it contains all of the keys in keys1, but > put the matching ones in the same order as they are in keys2. If you > find a keys2 that does not exist in keys1 then just add the additional > unmatched keys1 keys to *reorderedkeys. Set *n_common to the number > of common keys excluding any that come after a key2 key that does not > exist as a key1 key. > You can just switch to using that function in > create_final_distinct_paths(). You'll need to consider if the query is > a DISTINCT ON query and not try the unordered version of the function > in that case. Tried this, it worked as expected. Tests are green as well. > I also just noticed that in build_index_paths() we'll leave the index > path's pathkeys empty if we deem the pathkeys as useless. I'm not > sure what the repercussions of setting those to the return value of > build_index_pathkeys() if useful_pathkeys is otherwise empty. This is very rigid indeed. > It's possible that truncate_useless_pathkeys() needs to be modified to > check if the pathkeys might be useful for DISTINCT We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering. Can we not have pathkeys_useful_for_distinct? Also, pathkeys_useful_for_ordering calls pathkeys_count_contained_in. We can add code path on similar lines. > When I > thought about it I assumed that we always set IndexPath's pathkeys to > whatever (if any) sort order that the index provides. Can we not added original path keys in IndexPath? It could be useful at other places as well. Atleast, I can see it useful in sorting cases. > makes me wonder if this entire patch is a good idea. We are still getting some benefit even without index paths for now. I have attached patch with pathkeys_count_contained_in_unordered and corresponding changes in create_final_distinct_paths for reference. Thanks, Ankit
Вложения
On Fri, 20 Jan 2023 at 08:26, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote: > > On 19/01/23 18:49, David Rowley wrote: > > You can just switch to using that function in > > create_final_distinct_paths(). You'll need to consider if the query is > > a DISTINCT ON query and not try the unordered version of the function > > in that case. > > Tried this, it worked as expected. Tests are green as well. Looking at the patch, you've not added any additional tests. If the existing tests are all passing then that just tells me that either the code is not functioning as intended or we have no tests that look at the EXPLAIN output which can make use of this optimization. If the former is true, then the patch needs to be fixed. If it's the latter then you need to write new tests. > > It's possible that truncate_useless_pathkeys() needs to be modified to > > check if the pathkeys might be useful for DISTINCT > > We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering. > > Can we not have pathkeys_useful_for_distinct? I don't know all the repercussions. If you look at add_path() then you'll see we do a pathkey comparison when the costs are not fuzzily different from an existing path so that we try to keep a path with the best pathkeys. If we start keeping paths around with other weird pathkeys that are not along the lines of the query_pathkeys requires, then add_path might start throwing away paths that are actually good for something. It seems probable that could cause some regressions. > I have attached patch with pathkeys_count_contained_in_unordered > and corresponding changes in create_final_distinct_paths for reference. Does this patch actually work? I tried: create table ab (a int, b int); insert into ab select a,b from generate_Series(1,1000) a,generate_series(1,1000) b; analyze ab; create index on ab(a); set enable_hashagg=0; explain select distinct b,a from ab where a < 10; QUERY PLAN ------------------------------------------------------------------------------------ Unique (cost=729.70..789.67 rows=7714 width=8) -> Sort (cost=729.70..749.69 rows=7996 width=8) Sort Key: b, a -> Index Scan using ab_a_idx on ab (cost=0.42..211.36 rows=7996 width=8) Index Cond: (a < 10) (5 rows) I'd have expected an incremental sort here. I don't see that you're adjusting IndexPath's pathkeys anywhere. The nested loop in pathkeys_count_contained_in_unordered() seems to be inside out. The reordered_pathkeys must have the common pathkeys in the order they appear in keys2. In your patch, they'll be ordered according to the keys1 list, which is wrong. Testing would tell you this, so all the more reason to make the patch work and write some queries to ensure it does actually work, then some tests to ensure that remains in a working state. Feel free to take the proper time to write a working patch which contains new tests to ensure it's functioning as intended. It's disheartening to review patches that don't seem to work. If it wasn't meant to work, then you didn't make that clear. I'll likely not be able to do any further reviews until the March commitfest, so it might be better to only post if you're stuck. Please don't rush out the next patch. Take your time and study the code and see if you can build up your own picture for what the repercussions might be of IndexPaths having additional pathkeys when they were previously empty. If you're uncertain of aspects of the patch you've written feel free to leave XXX type comments to indicate this. That way the reviewer will know you might need more guidance on that and you'll not forget yourself when you come back and look again after a few weeks. David
> On 20/01/23 06:07, David Rowley wrote: > Looking at the patch, you've not added any additional tests. If the > existing tests are all passing then that just tells me that either the > code is not functioning as intended or we have no tests that look at > the EXPLAIN output which can make use of this optimization. If the > former is true, then the patch needs to be fixed. If it's the latter > then you need to write new tests. I definitely need to add tests because this scenario is missing. > I don't know all the repercussions. If you look at add_path() then > you'll see we do a pathkey comparison when the costs are not fuzzily > different from an existing path so that we try to keep a path with the > best pathkeys. If we start keeping paths around with other weird > pathkeys that are not along the lines of the query_pathkeys requires, > then add_path might start throwing away paths that are actually good > for something. It seems probable that could cause some regressions. Okay, in that case I think it is better idea to store original pathkeys (apart from what get assigned by useful_pathkeys). I need to dig deeper for this. > Does this patch actually work? I tried: > I don't see that you're > adjusting IndexPath's pathkeys anywhere. I had removed the changes for indexPath (it was in v1) because I hadn't investigated repercussions. But I failed to mention this. > The nested loop in > pathkeys_count_contained_in_unordered() seems to be inside out. The > reordered_pathkeys must have the common pathkeys in the order they > appear in keys2. In your patch, they'll be ordered according to the > keys1 list, which is wrong. Testing would tell you this, so all the > more reason to make the patch work and write some queries to ensure it > does actually work, then some tests to ensure that remains in a > working state. > Feel free to take the proper time to write a working patch which > contains new tests to ensure it's functioning as intended. It's > disheartening to review patches that don't seem to work. If it wasn't > meant to work, then you didn't make that clear. > Please don't rush out the next patch. Take your time and study the code > and see if you can build up your own picture for what the repercussions > might be of IndexPaths having additional pathkeys when they were previously empty. > If you're uncertain of aspects of the patch you've written feel free to leave > XXX type comments to indicate this. That way the reviewer will know > you might need more guidance on that and you'll not forget yourself > when you come back and look again after a few weeks. I deeply regret this. I will be mindful of my patches and ensure that they are complete by themselves. Thanks for your pointers as well, I can see errors in my approach which I will address. > I'll likely not be > able to do any further reviews until the March commitfest, so it might > be better to only post if you're stuck. Yes sure, I will work on patches and limit posts to discussion only (if blocked). Thanks, Ankit