Обсуждение: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

Поиск
Список
Период
Сортировка

BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16624
Logged by:          XINYU LIU
Email address:      XINYULIU@UMICH.EDU
PostgreSQL version: 13rc1
Operating system:   Ubuntu 20.04
Description:

Hello,

We are developing a tool for automatically finding performance bugs in
PostgreSQL. Our key insight is that given a pair of semantic equivalent
queries, a robust DBMS should return the same result within a similar
execution time. Significant time difference suggests a potential performance
bug in the DBMS.

We are sharing a pair of TPC-H queries that exhibit a potential performance
bug in this report:

First query:
SELECT "ps_suppkey" 
FROM   "partsupp" 
WHERE  "ps_partkey" = 1486; 

Second query:
SELECT "ps_suppkey" 
FROM   "partsupp" 
WHERE  "ps_partkey" + 1486 = 2972; 

[Actual Behavior]
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes only 1.059 millisecond, while the second query takes 247.176
millisecond. We think the time difference results from different plans
selected.

[Query Execution Plan]
* First query:
                                                         QUERY PLAN
                                                

-----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using partsupp_pkey on partsupp  (cost=0.43..4.59 rows=9
width=4) (actual time=0.692..0.694 rows=4 loops=1)
   Index Cond: (ps_partkey = 1486)
   Heap Fetches: 0
 Planning Time: 4.748 ms
 Execution Time: 1.059 ms
(5 rows)

* Second query:
                                                                      QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.43..91865.94 rows=19994 width=4) (actual
time=2.032..246.821 rows=4 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Index Only Scan using partsupp_pkey on partsupp
(cost=0.43..88866.54 rows=8331 width=4) (actual time=159.371..240.012 rows=1
loops=3)
         Filter: ((ps_partkey + 1486) = 2972)
         Rows Removed by Filter: 1333332
         Heap Fetches: 0
 Planning Time: 4.556 ms
 Execution Time: 247.176 ms
(9 rows)


[Expected Behavior]
I would have expected the DBMS to run these two queries with similar
execution time, given that they both have the same semantics. Notably, the
execution time difference between these two queries will grow significantly
when the size of the database grows. On the TPC-H benchmark of scale factor
100, the first query takes 1.9 millisecond, while the second query takes 83
seconds.

[Test Environment]
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23
00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v13 beta3
Database: TPC-H benchmark (with scale factor 5)

[Steps for reproducing our observations]
    
* Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
* Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
    echo $i
    name=`echo $i|cut -d'.' -f1`
    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
* Execute the queries


Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> We are developing a tool for automatically finding performance bugs in
> PostgreSQL. Our key insight is that given a pair of semantic equivalent
> queries, a robust DBMS should return the same result within a similar
> execution time. Significant time difference suggests a potential performance
> bug in the DBMS.

Unfortunately, that's a viewpoint that is not going to win you a lot of
converts.

> First query:
> SELECT "ps_suppkey" 
> FROM   "partsupp" 
> WHERE  "ps_partkey" = 1486; 

> Second query:
> SELECT "ps_suppkey" 
> FROM   "partsupp" 
> WHERE  "ps_partkey" + 1486 = 2972; 

Sure, in theory the optimizer could rearrange that to
"WHERE ps_partkey = 2972 - 1486" and thus still have an indexable
condition.  In practice, we have essentially no interest in doing so.
It would require vastly more semantic knowledge than the optimizer
has got, and looking for such cases would consume lots of planner
cycles that would be better spent elsewhere.

If you want to complain that that makes Postgres' optimizer "not
robust", you're entitled to that opinion.  But it's very unlikely
to change in the foreseeable future.  We expect that if a user
cares about the performance of a particular query, they'll be
willing to write it in a form that the optimizer can deal with.

            regards, tom lane



Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
David Rowley
Дата:
On Sat, 19 Sep 2020 at 09:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> PG Bug reporting form <noreply@postgresql.org> writes:
> > We are developing a tool for automatically finding performance bugs in
> > PostgreSQL. Our key insight is that given a pair of semantic equivalent
> > queries, a robust DBMS should return the same result within a similar
> > execution time. Significant time difference suggests a potential performance
> > bug in the DBMS.
>
> Unfortunately, that's a viewpoint that is not going to win you a lot of
> converts.
>
> > First query:
> > SELECT "ps_suppkey"
> > FROM   "partsupp"
> > WHERE  "ps_partkey" = 1486;
>
> > Second query:
> > SELECT "ps_suppkey"
> > FROM   "partsupp"
> > WHERE  "ps_partkey" + 1486 = 2972;
>
> Sure, in theory the optimizer could rearrange that to
> "WHERE ps_partkey = 2972 - 1486" and thus still have an indexable
> condition.  In practice, we have essentially no interest in doing so.
> It would require vastly more semantic knowledge than the optimizer
> has got, and looking for such cases would consume lots of planner
> cycles that would be better spent elsewhere.
>
> If you want to complain that that makes Postgres' optimizer "not
> robust", you're entitled to that opinion.  But it's very unlikely
> to change in the foreseeable future.  We expect that if a user
> cares about the performance of a particular query, they'll be
> willing to write it in a form that the optimizer can deal with.

I'd go a bit further and say that the queries are not even equivalent.

The reason for that is that "ps_partkey" + 1486 = 2972 could cause an
arithmetic overflow ERROR whereas "ps_partkey" = 1486 couldn't.
Perhaps nobody would complain if we didn't throw an ERROR where we
otherwise might have. However, when you consider that you must do 2972
- 1486 to obtain the new constant to compare to the index column, you
could start getting the overflows there instead.  In that case, you
could cause overflows in cases that would have worked perfectly fine
without applying the optimisation.  So it seems to me that the author
the query could have written it that way exactly to eliminate the
chances of such overflows. They might not be happy if we rewrite their
query to make overflows possible again.

David



Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
Peter Geoghegan
Дата:
On Fri, Sep 18, 2020 at 2:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Unfortunately, that's a viewpoint that is not going to win you a lot of
> converts.

I think that this perspective is held by all groups that works on
query optimization, not just the PostgreSQL developers:

https://use-the-index-luke.com/sql/where-clause/obfuscation/math

Evidently nobody even tries to do this stuff.

-- 
Peter Geoghegan



Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> Evidently nobody even tries to do this stuff.

Yeah.  But it would be particularly hard to do in Postgres, because of
our resolutely extensible approach to data types and operators/functions.
We'd have to devise some general-purpose APIs whereby a specific data
type or function could decide whether a given transformation is valid
or safe.  It's not even very clear where would be the right place to
lodge the knowledge.

If you were really intent on making this happen, you could imagine
for instance attaching a SupportRequestSimplify hook to "int4eq",
which could transform "x - c1 = c2" into "x = c1 + c2" after verifying
that the constant values wouldn't produce overflow.  But this idea
doesn't scale very well, since you'd really be talking about looking
for a whole bunch of different possible rearrangements not just one.
And it's unclear how you could avoid reimplementing a bunch of that
logic for each different equality operator.  On top of that, you would
have little visibility into whether this was worth the cycles, since
such code would be unlikely to know whether "x" is an indexable
value or whether the expression occurs in a place where it could
be used in an indexqual anyway.

            regards, tom lane



Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
Peter Geoghegan
Дата:
On Wed, Sep 23, 2020 at 12:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If you were really intent on making this happen, you could imagine
> for instance attaching a SupportRequestSimplify hook to "int4eq",
> which could transform "x - c1 = c2" into "x = c1 + c2" after verifying
> that the constant values wouldn't produce overflow.  But this idea
> doesn't scale very well, since you'd really be talking about looking
> for a whole bunch of different possible rearrangements not just one.

Yeah. Apparently even sophisticated C compilers can only perform
simpler algebraic reductions on integers, even though it probably
matters a lot more than it would during database query optimization
[1]. It's not as universal as you might think at first. Algebraic
reductions are unsafe with floats because you risk losing precision. I
bet that there are various subtle behaviors that get broken once you
think it all through for each datatype.

It's not that hard to create algebraic expressions involving integers
that production quality C compilers cannot perform relatively obvious
algebraic reductions on. I'm not an expert by any means, but I suspect
that it's a surprisingly hard problem.

As you said, the fix for this problem is don't do that. People that
write SQL for a living all seem to figure that out on their own. It's
not something we hear any complaints about.

[1] https://www.agner.org/optimize/optimizing_cpp.pdf
-- 
Peter Geoghegan



Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
Xinyu Liu
Дата:
Thank you so much for the helpful discussion!

On Wed, Sep 23, 2020 at 4:49 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Sep 23, 2020 at 12:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If you were really intent on making this happen, you could imagine
> for instance attaching a SupportRequestSimplify hook to "int4eq",
> which could transform "x - c1 = c2" into "x = c1 + c2" after verifying
> that the constant values wouldn't produce overflow.  But this idea
> doesn't scale very well, since you'd really be talking about looking
> for a whole bunch of different possible rearrangements not just one.

Yeah. Apparently even sophisticated C compilers can only perform
simpler algebraic reductions on integers, even though it probably
matters a lot more than it would during database query optimization
[1]. It's not as universal as you might think at first. Algebraic
reductions are unsafe with floats because you risk losing precision. I
bet that there are various subtle behaviors that get broken once you
think it all through for each datatype.

It's not that hard to create algebraic expressions involving integers
that production quality C compilers cannot perform relatively obvious
algebraic reductions on. I'm not an expert by any means, but I suspect
that it's a surprisingly hard problem.

As you said, the fix for this problem is don't do that. People that
write SQL for a living all seem to figure that out on their own. It's
not something we hear any complaints about.

[1] https://www.agner.org/optimize/optimizing_cpp.pdf
--
Peter Geoghegan


--
-Xinyu

Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification

От
Peter Geoghegan
Дата:
On Thu, Sep 24, 2020 at 7:44 PM Xinyu Liu <xinyuliu@umich.edu> wrote:
> Thank you so much for the helpful discussion!

You're welcome. Also, I hope that my remarks were not discouraging in
a general way.

I think that people from academic institutions sometimes don't have
the practical context that is available to practitioners. Having a
sense of the pressures that caused the system to develop (or not
develop) in a certain direction is crucially important. I was blunt
because I thought it was the most helpful way of communicating my
message.

-- 
Peter Geoghegan