Обсуждение: Optimizer on sort aggregate

Поиск
Список
Период
Сортировка

Optimizer on sort aggregate

От
Feng Tian
Дата:
Hi,

Consider the following queries.

create table t(i int, j int, k int, t text);
insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 1000000) i;

ftian=# explain select count(distinct j) from t group by t, i;
                               QUERY PLAN                              
------------------------------------------------------------------------
 GroupAggregate  (cost=158029.84..178029.84 rows=1000000 width=22)
   ->  Sort  (cost=158029.84..160529.84 rows=1000000 width=22)
         Sort Key: t, i
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22)
(4 rows)


The query,
select count(distinct j) from t group by t, i;

runs for 35 seconds.  However, if I change the query to,
select count(distinct j) from t group by i, t;  -- note switching the ordering
select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea

Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a string with collation is slow, which is well understood by pg hackers.   However, here, the sorting order is forced by the planner, not user.   Planner can do the following optimizations,

1. for the sort we generated for sort agg, planner can switch column ordering, put int before string,
2. for the sort we generated for sort agg, use bytea compare instead of string compare.

They will bring big improvement to this common query.   Is this something reasonable? 

Thanks,




Re: Optimizer on sort aggregate

От
Tatsuo Ishii
Дата:
> The query,
> select count(distinct j) from t group by t, i;
> 
> runs for 35 seconds.  However, if I change the query to,
> select count(distinct j) from t group by i, t;  -- note switching the
> ordering
> select count(distinct j) from t group by decode(t, 'escape'), i; -- convert
> t to bytea
> 
> Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a
> string with collation is slow, which is well understood by pg hackers.
> However, here, the sorting order is forced by the planner, not user.
> Planner can do the following optimizations,

Interesting. I got following result:

test=# explain analyze select count(distinct j) from t group by t, i;
   QUERY PLAN                                                        
 

--------------------------------------------------------------------------------------------------------------------------GroupAggregate
(cost=137519.84..157519.84 rows=1000000 width=22) (actual time=1332.937..2431.238 rows=1000000 loops=1)  Group Key: t,
i ->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual time=1332.922..1507.413 rows=1000000 loops=1)
  Sort Key: t, i        Sort Method: external merge  Disk: 33232kB        ->  Seq Scan on t  (cost=0.00..17352.00
rows=1000000width=22) (actual time=0.006..131.406 rows=1000000 loops=1)Planning time: 0.031 msExecution time: 2484.271
ms
(8 rows)

Time: 2484.520 ms

test=# explain analyze select count(distinct j) from t group by i, t;
   QUERY PLAN                                                        
 

--------------------------------------------------------------------------------------------------------------------------GroupAggregate
(cost=137519.84..157519.84 rows=1000000 width=22) (actual time=602.510..1632.087 rows=1000000 loops=1)  Group Key: i, t
->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual time=602.493..703.274 rows=1000000 loops=1)
SortKey: i, t        Sort Method: external sort  Disk: 33240kB        ->  Seq Scan on t  (cost=0.00..17352.00
rows=1000000width=22) (actual time=0.014..129.213 rows=1000000 loops=1)Planning time: 0.176 msExecution time: 1685.575
ms
(8 rows)

Time: 1687.641 ms

Not so big difference here (maybe because I use SSD) but there is
still about 50% difference in execution time. Note that I disable
locale support.

> 1. for the sort we generated for sort agg, planner can switch column
> ordering, put int before string,
> 2. for the sort we generated for sort agg, use bytea compare instead of
> string compare.
> 
> They will bring big improvement to this common query.   Is this something
> reasonable?

Not looking at sort and planner code closely, I guess planner does
not recognize the cost difference between "external merge" and
"external sort" because cost estimations for sort step in each plan
are exactly same (137519.84..140019.84). However I am not sure if we
should fix the planner or should fix our sort machinery since it is
possible that the sort machinery should not expose such a big
difference in the two sort methods.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Optimizer on sort aggregate

От
David Rowley
Дата:
On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian <fengttt@gmail.com> wrote:
Hi,

Consider the following queries.

create table t(i int, j int, k int, t text);
insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 1000000) i;

ftian=# explain select count(distinct j) from t group by t, i;
                               QUERY PLAN                              
------------------------------------------------------------------------
 GroupAggregate  (cost=158029.84..178029.84 rows=1000000 width=22)
   ->  Sort  (cost=158029.84..160529.84 rows=1000000 width=22)
         Sort Key: t, i
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22)
(4 rows)


The query,
select count(distinct j) from t group by t, i;

runs for 35 seconds.  However, if I change the query to,
select count(distinct j) from t group by i, t;  -- note switching the ordering
select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea

Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a string with collation is slow, which is well understood by pg hackers.   However, here, the sorting order is forced by the planner, not user.   Planner can do the following optimizations,

1. for the sort we generated for sort agg, planner can switch column ordering, put int before string,
2. for the sort we generated for sort agg, use bytea compare instead of string compare.

They will bring big improvement to this common query.   Is this something reasonable? 


It seems like it might be worth looking into, but I think it's likely more complex than sorting the group by order into narrowest column first.

If the query was:

select count(distinct j) from t group by t, i order by t, i;

Then if that was  rewritten to make the group by i,t then the planner would then need to perform an additional sort on t,i to get the correct order by for the final result, which may or may not be faster, it would depend on how many groups there were to sort. Though I guess you could possibly just skip the optimisation if the next node up didn't need any specific ordering.

I also wonder if taking into account the column's width is not quite enough. For example if the 'i' column only had 1 distinct value, then performing the group by on 'i' first wouldn't help at all. Keep in mind that the columns could be much closer in width than in your example, e.g int and bigint. Perhaps you'd need to include some sort of heuristics to look at the number of distinct values and the width, and form some sort of weighted estimates about which column to put first.

Regards

David Rowley

Re: Optimizer on sort aggregate

От
David Rowley
Дата:
On Sat, Oct 18, 2014 at 12:35 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> The query,
> select count(distinct j) from t group by t, i;
>
> runs for 35 seconds.  However, if I change the query to,
> select count(distinct j) from t group by i, t;  -- note switching the
> ordering
> select count(distinct j) from t group by decode(t, 'escape'), i; -- convert
> t to bytea
>
> Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a
> string with collation is slow, which is well understood by pg hackers.
> However, here, the sorting order is forced by the planner, not user.
> Planner can do the following optimizations,

Interesting. I got following result:

test=# explain analyze select count(distinct j) from t group by t, i;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=137519.84..157519.84 rows=1000000 width=22) (actual time=1332.937..2431.238 rows=1000000 loops=1)
   Group Key: t, i
   ->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual time=1332.922..1507.413 rows=1000000 loops=1)
         Sort Key: t, i
         Sort Method: external merge  Disk: 33232kB
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22) (actual time=0.006..131.406 rows=1000000 loops=1)
 Planning time: 0.031 ms
 Execution time: 2484.271 ms
(8 rows)

Time: 2484.520 ms

test=# explain analyze select count(distinct j) from t group by i, t;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=137519.84..157519.84 rows=1000000 width=22) (actual time=602.510..1632.087 rows=1000000 loops=1)
   Group Key: i, t
   ->  Sort  (cost=137519.84..140019.84 rows=1000000 width=22) (actual time=602.493..703.274 rows=1000000 loops=1)
         Sort Key: i, t
         Sort Method: external sort  Disk: 33240kB
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22) (actual time=0.014..129.213 rows=1000000 loops=1)
 Planning time: 0.176 ms
 Execution time: 1685.575 ms
(8 rows)

Time: 1687.641 ms

Not so big difference here (maybe because I use SSD) but there is
still about 50% difference in execution time. Note that I disable
locale support.


I think this is more likely your locale settings, as if I do:

 create table t(i int, j int, k int, t text collate "C");
The GROUP BY t,i runs about 25% faster.

I've not looked at it yet, but Peter G's patch here https://commitfest.postgresql.org/action/patch_view?id=1462 will quite likely narrow the performance gap between the 2 queries.

Regards

David Rowley

Re: Optimizer on sort aggregate

От
Feng Tian
Дата:
Hi, David,

Yes, switch sorting order would loose an interesting order so if user dictates order by t, i; planner need to resort to its cost model.   Estimating cardinality of groupby is a much bigger topic than this thread.

I feel sorting string as if it is bytea is particularly interesting.  I am aware Peter G's patch and I think it is great, but for this sort agg case, first, I believe it is still slower than sorting bytea, and second, Peter G's patch depends on data.   A common long prefix will make the patch less effective, which is probably not so uncommon (for example, URL with a domain prefix).  I don't see any downside of sort bytea, other than lost the interest ordering.  

Thanks,
Feng

   

On Fri, Oct 17, 2014 at 4:36 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian <fengttt@gmail.com> wrote:
Hi,

Consider the following queries.

create table t(i int, j int, k int, t text);
insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 1000000) i;

ftian=# explain select count(distinct j) from t group by t, i;
                               QUERY PLAN                              
------------------------------------------------------------------------
 GroupAggregate  (cost=158029.84..178029.84 rows=1000000 width=22)
   ->  Sort  (cost=158029.84..160529.84 rows=1000000 width=22)
         Sort Key: t, i
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22)
(4 rows)


The query,
select count(distinct j) from t group by t, i;

runs for 35 seconds.  However, if I change the query to,
select count(distinct j) from t group by i, t;  -- note switching the ordering
select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea

Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a string with collation is slow, which is well understood by pg hackers.   However, here, the sorting order is forced by the planner, not user.   Planner can do the following optimizations,

1. for the sort we generated for sort agg, planner can switch column ordering, put int before string,
2. for the sort we generated for sort agg, use bytea compare instead of string compare.

They will bring big improvement to this common query.   Is this something reasonable? 


It seems like it might be worth looking into, but I think it's likely more complex than sorting the group by order into narrowest column first.

If the query was:

select count(distinct j) from t group by t, i order by t, i;

Then if that was  rewritten to make the group by i,t then the planner would then need to perform an additional sort on t,i to get the correct order by for the final result, which may or may not be faster, it would depend on how many groups there were to sort. Though I guess you could possibly just skip the optimisation if the next node up didn't need any specific ordering.

I also wonder if taking into account the column's width is not quite enough. For example if the 'i' column only had 1 distinct value, then performing the group by on 'i' first wouldn't help at all. Keep in mind that the columns could be much closer in width than in your example, e.g int and bigint. Perhaps you'd need to include some sort of heuristics to look at the number of distinct values and the width, and form some sort of weighted estimates about which column to put first.

Regards

David Rowley

Re: Optimizer on sort aggregate

От
Peter Geoghegan
Дата:
On Fri, Oct 17, 2014 at 6:25 PM, Feng Tian <ftian@vitessedata.com> wrote:
> I feel sorting string as if it is bytea is particularly interesting.  I am
> aware Peter G's patch and I think it is great, but for this sort agg case,
> first, I believe it is still slower than sorting bytea, and second, Peter
> G's patch depends on data.   A common long prefix will make the patch less
> effective, which is probably not so uncommon (for example, URL with a domain
> prefix).  I don't see any downside of sort bytea, other than lost the
> interest ordering.

FWIW, that's probably less true than you'd think. Using Robert's test program:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.something"
"http://www.something" ->
131f1f1b2222221e1a18101f131419120109090909090909090909090909090909010909090909090909090909090909090901053d014201420444
(59 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.another"
"http://www.another" ->
131f1f1b2222220c191a1f13101d01090909090909090909090909090901090909090909090909090909090901053d014201420444
(53 bytes)

So the first eight bytes of the first string is 0x131F1F1B2222221E,
and the second 0x131F1F1B2222220C. The last byte is different. That's
because the way the Unicode algorithm [1] works, there is often a
significantly greater concentration of entropy in the first 8 bytes as
compared to raw C strings compared with memcmp() - punctuation
characters and so on are not actually described at the primary weight
level. If we can get even a single byte to somewhat differentiate each
string, we can still win by a very significant amount - just not an
enormous amount. The break even point is hard to characterize exactly,
but I'm quite optimistic that a large majority of real-world text
sorts will see at least some benefit, while a smaller majority will be
much, much faster.

[1] http://www.unicode.org/reports/tr10/#Notation
-- 
Peter Geoghegan



Re: Optimizer on sort aggregate

От
Greg Stark
Дата:
On Sat, Oct 18, 2014 at 3:10 AM, Peter Geoghegan <pg@heroku.com> wrote:
> So the first eight bytes of the first string is 0x131F1F1B2222221E,
> and the second 0x131F1F1B2222220C. The last byte is different.

That's interesting but I think it's mostly a quirk of your example.
Afaics the difference is only that the en_US locale ignores
punctuation like : and /  (or at least treats them as less significant
than alphabetic characters). If you had strings that had less
punctuation or differences that didn't happen to arrive shortly after
the 8-byte boundary then it wouldn't make any difference.

And we still have to run strfrm at least once, write out the whole
binary blob to memory somewhere and if it spills to disk we still have
to write and read much more data. I think recognizing cases where
equality is the only thing we're interested in and locale-sensitive
sorting isn't necessary and using a memcmp would be a clear win.

I'm not immediately clear on what the cleanest way to integrate it
would be. A btree opclass function like the cmp function but that
doesn't need to be consistent with < and >, only = ? Or perhaps a flag
on the btree opclass that indicates that the data types can safely be
compared with memcmp when equality is all that's needed? The latter is
pretty tempting since it would tell code something interesting about
the data type's internal storage that may lead to other optimizations.
On the other hand the former is nice in that the operator could maybe
handle other cases like padding by doing memcmp on only the
significant bits.



-- 
greg



Re: Optimizer on sort aggregate

От
Peter Geoghegan
Дата:
On Sat, Oct 18, 2014 at 5:27 AM, Greg Stark <stark@mit.edu> wrote:
> That's interesting but I think it's mostly a quirk of your example.
> Afaics the difference is only that the en_US locale ignores
> punctuation like : and /  (or at least treats them as less significant
> than alphabetic characters). If you had strings that had less
> punctuation or differences that didn't happen to arrive shortly after
> the 8-byte boundary then it wouldn't make any difference.

The en_US locale treats those kind of punctuation characters as less
significant than alphabetical characters, but is not at all unusual in
doing so - all locales I tested do the same. They also do the same for
spaces. And there is another property of the transformation that is
very important for those outside of the English speaking world -
diacritics are not represented at the primary weight level. So even
though representing certain characters with a diacritic will take 2
bytes in UTF-8, the corresponding primary weight only takes 1 byte. In
short, as I said, the concentration of entropy can easily be a lot
higher within the first n bytes of the primary weight level of the
transformed blob as compared to the first n bytes of the original
string, and frequently *will* be. This will make worst cases for
abbreviated keys significantly less common than they would otherwise
be, while more generally improving performance. Of course, that might
not always be as effective as we'd prefer, but that's something that
you more or less have to live with in order to have competitive sort
performance (even with the HyperLogLog amelioration, you still pay
something for considering the technique). You can always contrive a
case that puts things just out of reach, no matter how much entropy
you manage to concentrate in the abbreviated key. Fortunately, I don't
think those cases are all that representative of what people want from
sort performance.

> And we still have to run strfrm at least once, write out the whole
> binary blob to memory somewhere and if it spills to disk we still have
> to write and read much more data. I think recognizing cases where
> equality is the only thing we're interested in and locale-sensitive
> sorting isn't necessary and using a memcmp would be a clear win.

Yeah. I was making a point that strictly concerned abbreviated keys as
proposed in response to Feng's remarks. I am not 100% sure that the
benefits of abbreviated keys with locale support aren't worth it here,
and I say that in full agreement with Feng about locale-aware sorts
not actually being necessary. It's clear that cache performance is
essential to getting good sort performance, which strongly recommends
abbreviated keys. When we consider the concentration of entropy with
"ordinary" locale-aware abbreviated keys, as compared to abbreviated
keys that just use the C locale artificially for this case, it's not
clear that the concentration of entropy isn't reason enough to prefer
locale aware abbreviated keys. Now, it might well be that paying for n
strxfrm() operations, rather than doing n straight-up memcpy()
operations isn't worth it. But I think it might well be worth it -
particularly when you factor in the complexity avoided by not
special-casing this. I'm not really sure.

The general idea of abbreviated keys is almost old hat, to be honest.
It just happens to be essential for competitive sort performance.
-- 
Peter Geoghegan



Re: Optimizer on sort aggregate

От
David Rowley
Дата:
On Sat, Oct 18, 2014 at 2:25 PM, Feng Tian <ftian@vitessedata.com> wrote:
Hi, David,

Yes, switch sorting order would loose an interesting order so if user dictates order by t, i; planner need to resort to its cost model.   Estimating cardinality of groupby is a much bigger topic than this thread.


Well I don't want to jump in and make the idea more complex than it needs to be, More research on this would be really good!

Just to make my thoughts a bit more clear, I had thought that you'd likely use attwidth from pg_statistic to determine the average width of the column in order to know if you'd want to play about with the order or not. If not, then how would you know which column to put last in the group by if there happened to be 2 text type columns in the grouping list? 

Maybe this could be determined based on some ratio between stadistinct and stawidth in pg_statistic. If for example 2 columns had the same width, then you'd likely want to use the one with more distinct values estimated. Perhaps the heuristics for determining this would be more complex as, if you had an bigint with 1000 distinct values, then it perhaps would be better to group by that first before some int with, say 5 distinct values. Or maybe not? Some series of benchmarks might some sort of indication if there is any sort of magical tipping point to be sought after here. Of course the width alone might not be a great thing to base any benchmarks on as a multibyte text type with, for example, 6 in the stawidth column, would likely be slower to group by first than a bigint, which would have 8 in the stawidth column. People that had carefully written their group bys a certain way for performance might get upset if we made their query slower by messing with them too, so I guess the logic should likely only kick in if it's a clear win to swap the order.

Regards

David Rowley