Обсуждение: Performance regressions found using sqlfuzz
Jinho Jung
Re: Jung, Jinho 2019-02-11 <BN6PR07MB3409922471073F2B619A8CA4EE640@BN6PR07MB3409.namprd07.prod.outlook.com> > We are developing a tool called sqlfuzz for automatically finding performance regressions in PostgreSQL. sqlfuzz performsmutational fuzzing to generate SQL queries that take more time to execute on the latest version of PostgreSQL comparedto prior versions. We hope that these queries would help further increase the utility of the regression test suite. Hi, is sqlfuzz available anywhere? Does it have anything to do with sqlsmith? The query formatting looks very much like sqlsmith's. Christoph
Hello,We are developing a tool called sqlfuzz for automatically finding performance regressions in PostgreSQL. sqlfuzz performs mutational fuzzing to generate SQL queries that take more time to execute on the latest version of PostgreSQL compared to prior versions. We hope that these queries would help further increase the utility of the regression test suite.We would greatly appreciate feedback from the community regarding the queries found by the tool so far. We have already incorporated prior feedback from the community in the latest version of sqlfuzz.
Hi Jeff,
Thanks for the feedback! The git bisect idea was particularly helpful.
We use query complexity constraints in sqlfuzz to ensure that the constructed queries are realistic (e.g., limit the query size, automatically minimize the query, avoid esoteric expressions and functions, restrict number of joins, etc.).
Our goal is to augment the test suite with queries that will assist developers with more comprehensively evaluating the impact of new optimization heuristics, query processing strategies etc. We are working on improving the utility of the tool and your feedback on these reports will be super helpful. Thanks.
For each regression, we share:
1) the associated query,
2) the commit that activated it,
3) our high-level analysis, and
4) query execution plans in old and new versions of PostgreSQL.
All these regressions are observed on the latest version (dev HEAD).
####### QUERY 2:
####### QUERY 3:
####### QUERY 4:
####### QUERY 1:
Sent: Tuesday, February 12, 2019 1:03 PM
To: Jung, Jinho
Cc: pgsql-performance@postgresql.org
Subject: Re: Performance regressions found using sqlfuzz
Hello,We are developing a tool called sqlfuzz for automatically finding performance regressions in PostgreSQL. sqlfuzz performs mutational fuzzing to generate SQL queries that take more time to execute on the latest version of PostgreSQL compared to prior versions. We hope that these queries would help further increase the utility of the regression test suite.We would greatly appreciate feedback from the community regarding the queries found by the tool so far. We have already incorporated prior feedback from the community in the latest version of sqlfuzz.
>>>>> "Jung" == Jung, Jinho <jinho.jung@gatech.edu> writes: Jung> select distinct Jung> ref_0.i_im_id as c0, Jung> ref_1.ol_dist_info as c1 Jung> from Jung> public.item as ref_0 right join Jung> public.order_line as ref_1 Jung> on (ref_0.i_id = 5) Jung> - Commit: 84f9a35 (Improve estimate of distinct values in estimate_num_groups()) Jung> - Our analysis: We believe that this regression is related to the Jung> new logic for estimating the number of distinct values in the Jung> optimizer. This is affecting even queries with point lookups Jung> (ref_0.i_id = 5) in the TPC-C benchmark. So what's happening here is that the old plan was mis-estimating the result, believed incorrectly that it would fit into work_mem, and generated a hashaggregate plan accordingly; it ran fast because hashaggregate doesn't spill to disk but silently overflows work_mem. The new plan correctly estimates the result size, and therefore is forbidden from generating the hashaggregate plan at the default work_mem setting; it generates a sort plan, and the sort of course spills to disk since work_mem is exceeded. Had the value of work_mem been set to something appropriate for the workload, then the query plan would not have changed. So the problem (from an automated testing perspective) is that an actual _improvement_ in the code is being reported as a regression. Jung> ####### QUERY 3: Jung> select Jung> cast(ref_1.ol_i_id as int4) as c0 Jung> from Jung> public.stock as ref_0 Jung> left join public.order_line as ref_1 Jung> on (ref_1.ol_number is not null) Jung> where ref_1.ol_number is null Jung> - Commit: 77cd477 (Enable parallel query by default.) Jung> - Our analysis: We believe that this regression is due to Jung> parallel queries being enabled by default. Surprisingly, we found Jung> that even on a larger TPC-C database (scale factor of 50, roughly Jung> 4GB), parallel scan is still slower than the non-parallel one in Jung> the old version, when the query is not returning any tuples. The problem here is not actually with parallel scans as such, but rather the omission of a Materialize node in the parallel plan, and what looks like some rather serious mis-costing of the nestloop antijoin. Jung> ####### QUERY 4: Jung> select Jung> ref_0.s_dist_06 as c0 Jung> from Jung> public.stock as ref_0 Jung> where (ref_0.s_w_id < cast(least(0, 1) as int8)) Jung> - Commit: 5edc63b (Account for the effect of lossy pages when costing bitmap scans) Jung> - Our analysis: We believe that this regression has to do with Jung> two factors: 1) conditional expression (e.g., LEAST or NULLIF) Jung> are not reduced to constants unlike string functions (e.g., Jung> CHAR_LENGTH) 2) change in the cost estimation function for bitmap Jung> scan. Execution time grows by 3 orders of magnitude. We note that Jung> this regression is only observed on large databases (e.g., scale Jung> factor of 50). Again, this is showing up because of a large database and a small work_mem. The bitmap scan on stock only becomes lossy if the number of rows matched in the index is very large relative to work_mem; the lack of plan-time evaluation of LEAST means that the planner doesn't have any good way to estimate the selectivity, so it's taking a default estimate. Jung> ####### QUERY 1: Jung> select Jung> ref_0.o_d_id as c0 Jung> from Jung> public.oorder as ref_0 Jung> where EXISTS ( Jung> select Jung> 1 Jung> from Jung> (select distinct Jung> ref_0.o_entry_d as c0, Jung> ref_1.c_credit as c1 Jung> from Jung> public.customer as ref_1 Jung> where (false) Jung> ) as subq_1 Jung> ); Jung> - Commit: bf6c614 (Do execGrouping.c via expression eval machinery, take two) Jung> - Our analysis: We are not sure about the root cause of this Jung> regression. This might have to do with grouping logic. What this query is basically exercising is how fast one can do ExecReScan on a DISTINCT query, without also considering the performance effects of actually doing the grouping (the constant-false qual here means that the grouping comparison is never actually performed). An optimization tradeoff that speeds up comparisons within a scan at the cost of a fixed overhead for the scan will therefore make this query slower, but it still seems a good tradeoff to make (of course it would be even better to make the overhead per-query rather than per-scan, and there were other issues with this commit that should have been caught at the time). -- Andrew (irc:RhodiumToad)
Hi, On 2019-02-14 17:27:40 +0000, Jung, Jinho wrote: > ####### QUERY 2: > > select distinct > ref_0.i_im_id as c0, > ref_1.ol_dist_info as c1 > from > public.item as ref_0 right join > public.order_line as ref_1 > on (ref_0.i_id = 5) > > - Commit: 84f9a35 (Improve estimate of distinct values in estimate_num_groups()) > > - Our analysis: We believe that this regression is related to the new logic for estimating the number of distinct valuesin the optimizer. This is affecting even queries with point lookups (ref_0.i_id = 5) in the TPC-C benchmark. > > - Query Execution Plans > > [OLD version] > HashAggregate (cost=11972.38..12266.20 rows=29382 width=29) (actual time=233.543..324.973 rows=300144 loops=1) > Group Key: ref_0.i_im_id, ref_1.ol_dist_info > -> Nested Loop Left Join (cost=0.29..10471.64 rows=300148 width=29) (actual time=0.012..114.955 rows=300148 loops=1) > -> Seq Scan on order_line ref_1 (cost=0.00..6711.48 rows=300148 width=25) (actual time=0.004..25.061 rows=300148loops=1) > -> Materialize (cost=0.29..8.32 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=300148) > -> Index Scan using item_pkey on item ref_0 (cost=0.29..8.31 rows=1 width=4) (actual time=0.005..0.005rows=1 loops=1) > Index Cond: (i_id = 10) > Planning time: 0.267 ms > Execution time: 338.027 ms > > > [NEW version] > Unique (cost=44960.08..47211.19 rows=300148 width=29) (actual time=646.545..885.502 rows=300144 loops=1) > -> Sort (cost=44960.08..45710.45 rows=300148 width=29) (actual time=646.544..838.933 rows=300148 loops=1) > Sort Key: ref_0.i_im_id, ref_1.ol_dist_info > Sort Method: external merge Disk: 11480kB > -> Nested Loop Left Join (cost=0.29..10471.64 rows=300148 width=29) (actual time=0.016..111.889 rows=300148loops=1) > -> Seq Scan on order_line ref_1 (cost=0.00..6711.48 rows=300148 width=25) (actual time=0.004..24.612rows=300148 loops=1) > -> Materialize (cost=0.29..8.32 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=300148) > -> Index Scan using item_pkey on item ref_0 (cost=0.29..8.31 rows=1 width=4) (actual time=0.008..0.008rows=1 loops=1) > Index Cond: (i_id = 10) > Planning Time: 0.341 ms > Execution Time: 896.662 ms This seems perfectly alright - the old version used more memory than work_mem actually should have allowed. I'd bet you get the performance back if you set work mem to a bigger value. > ####### QUERY 3: > > select > cast(ref_1.ol_i_id as int4) as c0 > from > public.stock as ref_0 > left join public.order_line as ref_1 > on (ref_1.ol_number is not null) > where ref_1.ol_number is null > > - Commit: 77cd477 (Enable parallel query by default.) > > - Our analysis: We believe that this regression is due to parallel queries being enabled by default. Surprisingly, we foundthat even on a larger TPC-C database (scale factor of 50, roughly 4GB), parallel scan is still slower than the non-parallelone in the old version, when the query is not returning any tuples. > > - Query Execution Plans > > [OLD version] > Nested Loop Anti Join (cost=0.00..18006.81 rows=1 width=4) (actual time=28.689..28.689 rows=0 loops=1) > -> Seq Scan on stock ref_0 (cost=0.00..5340.00 rows=100000 width=0) (actual time=0.028..15.722 rows=100000 loops=1) > -> Materialize (cost=0.00..9385.22 rows=300148 width=4) (actual time=0.000..0.000 rows=1 loops=100000) > -> Seq Scan on order_line ref_1 (cost=0.00..6711.48 rows=300148 width=4) (actual time=0.004..0.004 rows=1 loops=1) > Filter: (ol_number IS NOT NULL) > Planning time: 0.198 ms > Execution time: 28.745 ms > > [NEW version] > Gather (cost=1000.00..15164.93 rows=1 width=4) (actual time=91.022..92.634 rows=0 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Nested Loop Anti Join (cost=0.00..14164.83 rows=1 width=4) (actual time=88.889..88.889 rows=0 loops=3) > -> Parallel Seq Scan on stock ref_0 (cost=0.00..4756.67 rows=41667 width=0) (actual time=0.025..7.331 rows=33333loops=3) > -> Seq Scan on order_line ref_1 (cost=0.00..6711.48 rows=300148 width=4) (actual time=0.002..0.002 rows=1 loops=100000) > Filter: (ol_number IS NOT NULL) > Planning Time: 0.258 ms > Execution Time: 92.699 ms I'm not particularly bothered - this is a pretty small difference. Most of the time here is likely spent starting the workers, the cost of which is hard to predict/model accurately. > ####### QUERY 4: > > select > ref_0.s_dist_06 as c0 > from > public.stock as ref_0 > where (ref_0.s_w_id < cast(least(0, 1) as int8)) > > - Commit: 5edc63b (Account for the effect of lossy pages when costing bitmap scans) > > - Our analysis: We believe that this regression has to do with two factors: 1) conditional expression (e.g., LEAST or NULLIF)are not reduced to constants unlike string functions (e.g., CHAR_LENGTH) 2) change in the cost estimation functionfor bitmap scan. Execution time grows by 3 orders of magnitude. We note that this regression is only observed onlarge databases (e.g., scale factor of 50). > > - Query Execution Plans > > [OLD version] > Bitmap Heap Scan on stock ref_0 (cost=31201.11..273173.13 rows=1666668 width=25) (actual time=0.005..0.005 rows=0 loops=1) > Recheck Cond: (s_w_id < (LEAST(0, 1))::bigint) > -> Bitmap Index Scan on stock_pkey (cost=0.00..30784.44 rows=1666668 width=0) (actual time=0.005..0.005 rows=0 loops=1) > Index Cond: (s_w_id < (LEAST(0, 1))::bigint) > Planning time: 0.228 ms > Execution time: 0.107 ms > > [NEW version] > Seq Scan on stock ref_0 (cost=0.00..304469.17 rows=1666613 width=25) (actual time=716.397..716.397 rows=0 loops=1) > Filter: (s_w_id < (LEAST(0, 1))::bigint) > Rows Removed by Filter: 5000000 > Planning Time: 0.221 ms > Execution Time: 716.513 ms Hm. The primary problem here is that the estimation both before and after are really bad. So I don't think the commit you point out here is really to blame. I'm not that bothered by the query not being great, given the weird construction with LEAST(), but we probably could fix that pretty easily. > ####### QUERY 1: > > select > ref_0.o_d_id as c0 > from > public.oorder as ref_0 > where EXISTS ( > select > 1 > from > (select distinct > ref_0.o_entry_d as c0, > ref_1.c_credit as c1 > from > public.customer as ref_1 > where (false) > ) as subq_1 > ); > > - Commit: bf6c614 (Do execGrouping.c via expression eval machinery, take two) > > - Our analysis: We are not sure about the root cause of this regression. This might have to do with grouping logic. > > - Query Execution Plans > > [OLD version] > Seq Scan on oorder ref_0 (cost=0.00..77184338.54 rows=15022 width=4) (actual time=34.173..34.173 rows=0 loops=1) > Filter: (SubPlan 1) > Rows Removed by Filter: 30044 > SubPlan 1 > -> Subquery Scan on subq_1 (cost=2569.01..2569.03 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=30044) > -> HashAggregate (cost=2569.01..2569.02 rows=1 width=3) (actual time=0.000..0.000 rows=0 loops=30044) > Group Key: ref_0.o_entry_d, ref_1.c_credit > -> Result (cost=0.00..2569.00 rows=1 width=3) (actual time=0.000..0.000 rows=0 loops=30044) > One-Time Filter: false > -> Seq Scan on customer ref_1 (cost=0.00..2569.00 rows=1 width=3) (never executed) > Planning time: 0.325 ms > Execution time: 34.234 ms > > [NEW version] > Seq Scan on oorder ref_0 (cost=0.00..1152.32 rows=15022 width=4) (actual time=74.799..74.799 rows=0 loops=1) > Filter: (SubPlan 1) > Rows Removed by Filter: 30044 > SubPlan 1 > -> Subquery Scan on subq_1 (cost=0.00..0.02 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=30044) > -> HashAggregate (cost=0.00..0.01 rows=1 width=20) (actual time=0.000..0.000 rows=0 loops=30044) > Group Key: ref_0.o_entry_d, c_credit > -> Result (cost=0.00..0.00 rows=0 width=20) (actual time=0.000..0.000 rows=0 loops=30044) > One-Time Filter: false > Planning Time: 0.350 ms > Execution Time: 79.237 ms I think that might be fixed in the latest point release. I screwed up and made resets of tuple hash tables (and there's 30044 of those here) more expensive. It's fixed in the latest minor release however. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-02-14 17:27:40 +0000, Jung, Jinho wrote: >> - Our analysis: We believe that this regression has to do with two factors: 1) conditional expression (e.g., LEAST orNULLIF) are not reduced to constants unlike string functions (e.g., CHAR_LENGTH) 2) change in the cost estimation functionfor bitmap scan. Execution time grows by 3 orders of magnitude. We note that this regression is only observed onlarge databases (e.g., scale factor of 50). > Hm. The primary problem here is that the estimation both before and > after are really bad. So I don't think the commit you point out here is > really to blame. I'm not that bothered by the query not being great, > given the weird construction with LEAST(), but we probably could fix > that pretty easily. We already did: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master [6f19a8c41] 2018-12-30 13:42:04 -0500 Teach eval_const_expressions to constant-fold LEAST/GREATEST expressions. Doing this requires an assumption that the invoked btree comparison function is immutable. We could check that explicitly, but in other places such as contain_mutable_functions we just assume that it's true, so we may as well do likewise here. (If the comparison function's behavior isn't immutable, the sort order in indexes built with it would be unstable, so it seems certainly wrong for it not to be so.) Vik Fearing Discussion: https://postgr.es/m/c6e8504c-4c43-35fa-6c8f-3c0b80a912cc@2ndquadrant.com BTW, const-folding NULLIF would not be a similarly tiny fix, because it would need to check for immutability of the underlying operator (since it is possibly a cross-type comparison, we can't get away with just assuming it's immutable). I'm not convinced that case is worth carrying extra code for. regards, tom lane
Jinho Jung
Sent: Saturday, February 16, 2019 5:37:49 PM
To: Andres Freund
Cc: Jung, Jinho; Jeff Janes; pgsql-performance@postgresql.org
Subject: Re: Performance regressions found using sqlfuzz
> On 2019-02-14 17:27:40 +0000, Jung, Jinho wrote:
>> - Our analysis: We believe that this regression has to do with two factors: 1) conditional expression (e.g., LEAST or NULLIF) are not reduced to constants unlike string functions (e.g., CHAR_LENGTH) 2) change in the cost estimation function for bitmap scan. Execution time grows by 3 orders of magnitude. We note that this regression is only observed on large databases (e.g., scale factor of 50).
> Hm. The primary problem here is that the estimation both before and
> after are really bad. So I don't think the commit you point out here is
> really to blame. I'm not that bothered by the query not being great,
> given the weird construction with LEAST(), but we probably could fix
> that pretty easily.
We already did:
Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master [6f19a8c41] 2018-12-30 13:42:04 -0500
Teach eval_const_expressions to constant-fold LEAST/GREATEST expressions.
Doing this requires an assumption that the invoked btree comparison
function is immutable. We could check that explicitly, but in other
places such as contain_mutable_functions we just assume that it's true,
so we may as well do likewise here. (If the comparison function's
behavior isn't immutable, the sort order in indexes built with it would
be unstable, so it seems certainly wrong for it not to be so.)
Vik Fearing
Discussion: https://postgr.es/m/c6e8504c-4c43-35fa-6c8f-3c0b80a912cc@2ndquadrant.com
BTW, const-folding NULLIF would not be a similarly tiny fix, because
it would need to check for immutability of the underlying operator
(since it is possibly a cross-type comparison, we can't get
away with just assuming it's immutable). I'm not convinced that
case is worth carrying extra code for.
regards, tom lane