Обсуждение: Extremely slow HashAggregate in simple UNION query

Поиск
Список
Период
Сортировка

Extremely slow HashAggregate in simple UNION query

От
Felix Geisendörfer
Дата:
Hi all,

today I debugged a query that was executing about 100x slower than expected, and was very surprised by what I found.

I'm posting to this list to see if this might be an issue that should be fixed in PostgreSQL itself.

Below is a simplified version of the query in question:

SET work_mem='64MB';
EXPLAIN ANALYZE
SELECT * FROM generate_series(1, 1) a, generate_series(1, 1) b
UNION
SELECT * FROM generate_series(1, 1) a, generate_series(1, 1) b;

HashAggregate  (cost=80020.01..100020.01 rows=2000000 width=8) (actual time=19.349..23.123 rows=1 loops=1)
  Group Key: a.a, b.b
  ->  Append  (cost=0.01..70020.01 rows=2000000 width=8) (actual time=0.022..0.030 rows=2 loops=1)
        ->  Nested Loop  (cost=0.01..20010.01 rows=1000000 width=8) (actual time=0.021..0.022 rows=1 loops=1)
              ->  Function Scan on generate_series a  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.014..0.014
rows=1loops=1) 
              ->  Function Scan on generate_series b  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.002..0.003
rows=1loops=1) 
        ->  Nested Loop  (cost=0.01..20010.01 rows=1000000 width=8) (actual time=0.006..0.007 rows=1 loops=1)
              ->  Function Scan on generate_series a_1  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.002..0.003
rows=1loops=1) 
              ->  Function Scan on generate_series b_1  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.002..0.002
rows=1loops=1) 
Planning Time: 0.101 ms
Execution Time: 45.986 ms

As you can see, it takes over 45ms (!) to execute what appear to be a very simple query. How is this possible?

Based on my debugging, I think the following is going on:

1. The query overestimates the final output rows by a factor of 2 million. [1]
2. The executor uses the bad estimate to allocate a huge hash table [2], and the increased work_mem [3] gives it enough
ropeto hang itself [4]. 
3. Somehow EXPLAIN gets confused by this and only ends up tracking 23ms of the query execution instead of 45ms [5].

I'm certainly a novice when it comes to PostgreSQL internals, but I'm wondering if this could be fixed by taking a more
dynamicapproach for allocating HashAggregate hash tables? 

Anyway, for my query using UNION ALL was acceptable, so I'm not too troubled by this. I was just really caught by
surprisethat bad estimates can not only cause bad query plans, but also cause good query plans to execute extremely
slowly.I had never seen that before. 

Best Regards
Felix

[1] My actual query had bad estimates for other reasons (GIN Index), but that's another story. The query above was of
coursedeliberately designed to have bad estimates. 
[2] nodeAgg.c: build_hash_table()
[3] A lot of queries in my application benefit from increased work_mem.
[4] execGrouping.c: nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / entrysize));
[5] In my actual query it was even worse, only 6 out of 40ms of the total execution time were reported as being spent
inthe query nodes. 


Re: Extremely slow HashAggregate in simple UNION query

От
Michael Lewis
Дата:
I believe this would be relevant-

It seems there is hope down the road to improve those estimates.

Re: Extremely slow HashAggregate in simple UNION query

От
Pavel Stehule
Дата:


út 20. 8. 2019 v 17:12 odesílatel Felix Geisendörfer <felix@felixge.de> napsal:
Hi all,

today I debugged a query that was executing about 100x slower than expected, and was very surprised by what I found.

I'm posting to this list to see if this might be an issue that should be fixed in PostgreSQL itself.

Below is a simplified version of the query in question:

SET work_mem='64MB';q
EXPLAIN ANALYZE
SELECT * FROM generate_series(1, 1) a, generate_series(1, 1) b
UNION
SELECT * FROM generate_series(1, 1) a, generate_series(1, 1) b;

HashAggregate  (cost=80020.01..100020.01 rows=2000000 width=8) (actual time=19.349..23.123 rows=1 loops=1)
  Group Key: a.a, b.b
  ->  Append  (cost=0.01..70020.01 rows=2000000 width=8) (actual time=0.022..0.030 rows=2 loops=1)
        ->  Nested Loop  (cost=0.01..20010.01 rows=1000000 width=8) (actual time=0.021..0.022 rows=1 loops=1)
              ->  Function Scan on generate_series a  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.014..0.014 rows=1 loops=1)
              ->  Function Scan on generate_series b  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.002..0.003 rows=1 loops=1)
        ->  Nested Loop  (cost=0.01..20010.01 rows=1000000 width=8) (actual time=0.006..0.007 rows=1 loops=1)
              ->  Function Scan on generate_series a_1  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.002..0.003 rows=1 loops=1)
              ->  Function Scan on generate_series b_1  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.002..0.002 rows=1 loops=1)
Planning Time: 0.101 ms
Execution Time: 45.986 ms

As you can see, it takes over 45ms (!) to execute what appear to be a very simple query. How is this possible?

Based on my debugging, I think the following is going on:

1. The query overestimates the final output rows by a factor of 2 million. [1]
2. The executor uses the bad estimate to allocate a huge hash table [2], and the increased work_mem [3] gives it enough rope to hang itself [4].
3. Somehow EXPLAIN gets confused by this and only ends up tracking 23ms of the query execution instead of 45ms [5].

I'm certainly a novice when it comes to PostgreSQL internals, but I'm wondering if this could be fixed by taking a more dynamic approach for allocating HashAggregate hash tables?

Table functions has not statistics - and default ROWS value is 1000, so it is reason why there is very ever estimating. This specific case can be solved better in PostgreSQL 12, where some functions like generate_series has support functions with better estimations.

You can get profile of this query with some profiler


and you can see a reason why the query is slow.

The speed on PostgreSQL 12 of your example is good - about 1ms

But when I repeat your example, the speed was more terrible. On second hand - nobody can expect optimal plan when there is this crazy miss estimation. Looks so some wrong is inside some cleaning part.

Pavel




Anyway, for my query using UNION ALL was acceptable, so I'm not too troubled by this. I was just really caught by surprise that bad estimates can not only cause bad query plans, but also cause good query plans to execute extremely slowly. I had never seen that before.

Best Regards
Felix

[1] My actual query had bad estimates for other reasons (GIN Index), but that's another story. The query above was of course deliberately designed to have bad estimates.
[2] nodeAgg.c: build_hash_table()
[3] A lot of queries in my application benefit from increased work_mem.
[4] execGrouping.c: nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / entrysize));
[5] In my actual query it was even worse, only 6 out of 40ms of the total execution time were reported as being spent in the query nodes.

Re: Extremely slow HashAggregate in simple UNION query

От
Andres Freund
Дата:
Hi,

On 2019-08-20 17:11:58 +0200, Felix Geisendörfer wrote:
> today I debugged a query that was executing about 100x slower than expected, and was very surprised by what I found.
>
> I'm posting to this list to see if this might be an issue that should be fixed in PostgreSQL itself.
>
> Below is a simplified version of the query in question:
>
> SET work_mem='64MB';
> EXPLAIN ANALYZE
> SELECT * FROM generate_series(1, 1) a, generate_series(1, 1) b
> UNION
> SELECT * FROM generate_series(1, 1) a, generate_series(1, 1) b;
>
> HashAggregate  (cost=80020.01..100020.01 rows=2000000 width=8) (actual time=19.349..23.123 rows=1 loops=1)

FWIW, that's not a mis-estimate I'm getting on master ;).  Obviously
that doesn't actually address your concern...


> 1. The query overestimates the final output rows by a factor of 2 million. [1]

Right. There's not really that much we can do about that in
general. That'll always be possible. Although we can obviously improve
the estimates a good bit more.


> I'm certainly a novice when it comes to PostgreSQL internals, but I'm
> wondering if this could be fixed by taking a more dynamic approach for
> allocating HashAggregate hash tables?

Under-sizing the hashtable just out of caution will have add overhead to
a lot more common cases. That requires copying data around during
growth, which is far far from free. Or you can use hashtables that don't
need to copy, but they're also considerably slower in the more common
cases.


> 3. Somehow EXPLAIN gets confused by this and only ends up tracking 23ms of the query execution instead of 45ms [5].

Well, there's plenty work that's not attributed to nodes. IIRC we don't
track executor startup/shutdown overhead on a per-node basis. So I don't
really think this is necessarily something that suspicious.  Which
indeed seems to be what's happening here (this is with 11, to be able to
hit the problem with your reproducer):

+   33.01%  postgres  postgres          [.] tuplehash_iterate
-   18.39%  postgres  libc-2.28.so      [.] __memset_avx2_erms
   - 90.94% page_fault
        __memset_avx2_erms
        tuplehash_allocate
        tuplehash_create
        BuildTupleHashTableExt
        build_hash_table
        ExecInitAgg
        ExecInitNode
        InitPlan
        standard_ExecutorStart

Greetings,

Andres Freund



Re: Extremely slow HashAggregate in simple UNION query

От
Felix Geisendörfer
Дата:
Hi,

> On 20. Aug 2019, at 19:32, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-20 17:11:58 +0200, Felix Geisendörfer wrote:
>>
>> HashAggregate  (cost=80020.01..100020.01 rows=2000000 width=8) (actual time=19.349..23.123 rows=1 loops=1)
>
> FWIW, that's not a mis-estimate I'm getting on master ;).  Obviously
> that doesn't actually address your concern...

I suppose this is thanks to the new optimizer support functions
mentioned by Michael and Pavel?

Of course I'm very excited about those improvements, but yeah, my
real query is mis-estimating things for totally different reasons not
involving any SRFs.

>> I'm certainly a novice when it comes to PostgreSQL internals, but I'm
>> wondering if this could be fixed by taking a more dynamic approach for
>> allocating HashAggregate hash tables?
>
> Under-sizing the hashtable just out of caution will have add overhead to
> a lot more common cases. That requires copying data around during
> growth, which is far far from free. Or you can use hashtables that don't
> need to copy, but they're also considerably slower in the more common
> cases.

How does PostgreSQL currently handle the case where the initial hash
table is under-sized due to the planner having underestimated things?
Are the growth costs getting amortized by using an exponential growth
function?

Anyway, I can accept my situation to be an edge case that doesn't justify
making things more complicated.

>> 3. Somehow EXPLAIN gets confused by this and only ends up tracking 23ms of the query execution instead of 45ms [5].
>
> Well, there's plenty work that's not attributed to nodes. IIRC we don't
> track executor startup/shutdown overhead on a per-node basis.

I didn't know that, thanks for clarifying : ).


Re: Extremely slow HashAggregate in simple UNION query

От
Andres Freund
Дата:
Hi,

On 2019-08-20 19:55:56 +0200, Felix Geisendörfer wrote:
> > On 20. Aug 2019, at 19:32, Andres Freund <andres@anarazel.de> wrote:
> > FWIW, that's not a mis-estimate I'm getting on master ;).  Obviously
> > that doesn't actually address your concern...
> 
> I suppose this is thanks to the new optimizer support functions
> mentioned by Michael and Pavel?

Right.


> > Under-sizing the hashtable just out of caution will have add overhead to
> > a lot more common cases. That requires copying data around during
> > growth, which is far far from free. Or you can use hashtables that don't
> > need to copy, but they're also considerably slower in the more common
> > cases.
> 
> How does PostgreSQL currently handle the case where the initial hash
> table is under-sized due to the planner having underestimated things?
> Are the growth costs getting amortized by using an exponential growth
> function?

Yes. But that's still far from free.

Greetings,

Andres Freund



Re: Extremely slow HashAggregate in simple UNION query

От
Jeff Janes
Дата:
On Tue, Aug 20, 2019 at 11:12 AM Felix Geisendörfer <felix@felixge.de> wrote:
 ...
 
[1] My actual query had bad estimates for other reasons (GIN Index), but that's another story. The query above was of course deliberately designed to have bad estimates.

As noted elsewhere, v12 thwarts your attempts to deliberately design the bad estimates.  You can still get them, you just have to work a bit harder at it:

CREATE FUNCTION j (bigint, bigint) returns setof bigint as $$ select generate_series($1,$2) $$ rows 1000 language sql;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b                        
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=80021.00..100021.00 rows=2000000 width=16) (actual time=11.332..13.241 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..70021.00 rows=2000000 width=16) (actual time=0.118..0.163 rows=2 loops=1)
         ->  Nested Loop  (cost=0.50..20010.50 rows=1000000 width=16) (actual time=0.117..0.118 rows=1 loops=1)
               ->  Function Scan on j a  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.087..0.088 rows=1 loops=1)
               ->  Function Scan on j b  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.027..0.027 rows=1 loops=1)
         ->  Nested Loop  (cost=0.50..20010.50 rows=1000000 width=16) (actual time=0.044..0.044 rows=1 loops=1)
               ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.022..0.022 rows=1 loops=1)
               ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.020..0.021 rows=1 loops=1)
 Planning Time: 0.085 ms
 Execution Time: 69.277 ms
(11 rows)

But the same advance in v12 which makes it harder to fool with your test case also opens the possibility of fixing your real case.

I've made an extension which has a function which always returns true, but lies about how often it is expected to return true. See the attachment.  With that, you can fine-tune the planner.

CREATE EXTENSION pg_selectivities ;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.00001)
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.00001);
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=45021.40..45021.60 rows=20 width=16) (actual time=0.226..0.227 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..45021.30 rows=20 width=16) (actual time=0.105..0.220 rows=2 loops=1)
         ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual time=0.104..0.105 rows=1 loops=1)
               Join Filter: pg_always('1e-05'::double precision)
               ->  Function Scan on j a  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.066..0.066 rows=1 loops=1)
               ->  Function Scan on j b  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.035..0.035 rows=1 loops=1)
         ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual time=0.112..0.113 rows=1 loops=1)
               Join Filter: pg_always('1e-05'::double precision)
               ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.077..0.077 rows=1 loops=1)
               ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.034..0.034 rows=1 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 0.281 ms

Cheers,

Jeff

Вложения

Re: Extremely slow HashAggregate in simple UNION query

От
Pavel Stehule
Дата:


čt 22. 8. 2019 v 3:11 odesílatel Jeff Janes <jeff.janes@gmail.com> napsal:
On Tue, Aug 20, 2019 at 11:12 AM Felix Geisendörfer <felix@felixge.de> wrote:
 ...
 
[1] My actual query had bad estimates for other reasons (GIN Index), but that's another story. The query above was of course deliberately designed to have bad estimates.

As noted elsewhere, v12 thwarts your attempts to deliberately design the bad estimates.  You can still get them, you just have to work a bit harder at it:

CREATE FUNCTION j (bigint, bigint) returns setof bigint as $$ select generate_series($1,$2) $$ rows 1000 language sql;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b                        
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=80021.00..100021.00 rows=2000000 width=16) (actual time=11.332..13.241 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..70021.00 rows=2000000 width=16) (actual time=0.118..0.163 rows=2 loops=1)
         ->  Nested Loop  (cost=0.50..20010.50 rows=1000000 width=16) (actual time=0.117..0.118 rows=1 loops=1)
               ->  Function Scan on j a  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.087..0.088 rows=1 loops=1)
               ->  Function Scan on j b  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.027..0.027 rows=1 loops=1)
         ->  Nested Loop  (cost=0.50..20010.50 rows=1000000 width=16) (actual time=0.044..0.044 rows=1 loops=1)
               ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.022..0.022 rows=1 loops=1)
               ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.020..0.021 rows=1 loops=1)
 Planning Time: 0.085 ms
 Execution Time: 69.277 ms
(11 rows)

But the same advance in v12 which makes it harder to fool with your test case also opens the possibility of fixing your real case.

I think so much more interesting should be long time after query processing - last row was processed in 13ms, but Execution Time was 69ms .. so some cleaning is 56ms - that is pretty long.


I've made an extension which has a function which always returns true, but lies about how often it is expected to return true. See the attachment.  With that, you can fine-tune the planner.

CREATE EXTENSION pg_selectivities ;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.00001)
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.00001);
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=45021.40..45021.60 rows=20 width=16) (actual time=0.226..0.227 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..45021.30 rows=20 width=16) (actual time=0.105..0.220 rows=2 loops=1)
         ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual time=0.104..0.105 rows=1 loops=1)
               Join Filter: pg_always('1e-05'::double precision)
               ->  Function Scan on j a  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.066..0.066 rows=1 loops=1)
               ->  Function Scan on j b  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.035..0.035 rows=1 loops=1)
         ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual time=0.112..0.113 rows=1 loops=1)
               Join Filter: pg_always('1e-05'::double precision)
               ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.077..0.077 rows=1 loops=1)
               ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000 width=8) (actual time=0.034..0.034 rows=1 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 0.281 ms

Cheers,

Jeff

Re: Extremely slow HashAggregate in simple UNION query

От
Felix Geisendörfer
Дата:

> On 21. Aug 2019, at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> As noted elsewhere, v12 thwarts your attempts to deliberately design the bad estimates.  You can still get them, you
justhave to work a bit harder at it: 
>
> CREATE FUNCTION j (bigint, bigint) returns setof bigint as $$ select generate_series($1,$2) $$ rows 1000 language
sql;

Yeah, that's awesome! I didn't know about this until I ran into this issue, I'll definitely be using it for future
estimationproblems that are difficult to fix otherwise! 

> I've made an extension which has a function which always returns true, but lies about how often it is expected to
returntrue. See the attachment.  With that, you can fine-tune the planner. 
>
> CREATE EXTENSION pg_selectivities ;

Very cool and useful : )!

I think in most cases I'll be okay with declaring a function with a static ROWS estimate, but I'll consider your
extensionif I need more flexibility in the future! 

Thanks
Felix


Re: Extremely slow HashAggregate in simple UNION query

От
Jeff Janes
Дата:
On Thu, Aug 22, 2019 at 1:09 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
čt 22. 8. 2019 v 3:11 odesílatel Jeff Janes <jeff.janes@gmail.com> napsal:
...
 
But the same advance in v12 which makes it harder to fool with your test case also opens the possibility of fixing your real case.

I think so much more interesting should be long time after query processing - last row was processed in 13ms, but Execution Time was 69ms .. so some cleaning is 56ms - that is pretty long.

Most of the time is not after the clock stops, but before the stepwise ANALYZE clock starts.  If you just do an EXPLAIN rather than EXPLAIN ANALYZE, that is also slow.  The giant hash table is created during the planning step (or somewhere around there--I notice that EXPLAIN ANALYZE output doesn't count it in what it labels as the planning step--but it is some step that EXPLAIN without ANALYZE does execute, which to me makes it a planning step).

For me, "perf top" shows kernel's __do_page_fault as the top function.  tuplehash_iterate does show up at 20% (which I think is overattributed, considering how little the speedup is when dropping ANALYZE), but everything else just looks like kernel memory management code.

Cheers,

Jeff

Re: Extremely slow HashAggregate in simple UNION query

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> Most of the time is not after the clock stops, but before the stepwise
> ANALYZE clock starts.  If you just do an EXPLAIN rather than EXPLAIN
> ANALYZE, that is also slow.  The giant hash table is created during the
> planning step (or somewhere around there

It's in executor startup, I believe.

            regards, tom lane



Re: Extremely slow HashAggregate in simple UNION query

От
Andres Freund
Дата:
Hi,

On August 24, 2019 12:41:03 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Jeff Janes <jeff.janes@gmail.com> writes:
>> Most of the time is not after the clock stops, but before the
>stepwise
>> ANALYZE clock starts.  If you just do an EXPLAIN rather than EXPLAIN
>> ANALYZE, that is also slow.  The giant hash table is created during
>the
>> planning step (or somewhere around there
>
>It's in executor startup, I believe.

I'm sure personally interested in doing so, but it'd not be hard to measure the executor startup time separately. And
displayit either on a per node basis, or as a total number. 

Quite unconvinced this thread is a convincing reason to do so (or really do anything). But for some other workloads
executorstartup is a very large fraction of the total time, without massive misestimations. Optimizing that could be
easierwith that information available. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.