Обсуждение: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

Поиск
Список
Период
Сортировка

BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16625
Logged by:          XINYU LIU
Email address:      xinyuliu@umich.edu
PostgreSQL version: 12.3
Operating system:   Ubuntu 19.10
Description:

Hello,

We are developing a tool for automatically finding performance bugs in
PostgreSQL. Our key insight is that given a pair of semantic equivalent
queries, a robust DBMS should return the same result within a similar
execution time. Significant time difference suggests a potential performance
bug in the DBMS.

We are sharing a pair of TPC-H queries that exhibit a potential performance
bug in this report:

First query:
SELECT "s_suppkey" 
FROM   "supplier" 
WHERE  s_suppkey > 100;  

Second query:
SELECT "s_suppkey" 
FROM   "supplier" 
WHERE  s_suppkey > 100 
GROUP  BY s_suppkey; 

[Actual Behavior]
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes only 17 millisecond, while the second query takes 42
millisecond. We think the time difference results from different plans
selected.

[Query Execution Plan]
First query:
                                                       QUERY PLAN
                                      
--------------------------------------------------------------------------------------------------------------
 Seq Scan on supplier  (cost=0.00..1734.00 rows=49909 width=4) (actual
time=0.047..14.898 rows=49900 loops=1)
   Filter: (s_suppkey > 100)
   Rows Removed by Filter: 100
 Planning Time: 0.639 ms
 Execution Time: 17.469 ms
(5 rows)




Second query:
                                                                      QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1858.77..2357.86 rows=49909 width=4) (actual
time=30.093..38.541 rows=49900 loops=1)
   Group Key: s_suppkey
   ->  Seq Scan on supplier  (cost=0.00..1734.00 rows=49909 width=4) (actual
time=0.047..13.656 rows=49900 loops=1)
         Filter: (s_suppkey > 100)
         Rows Removed by Filter: 100
 Planning Time: 0.669 ms
 Execution Time: 42.270 ms
(7 rows)


[Expected Behavior]
Since these two queries are semantically equivalent, we were hoping that
PostgreSQL will return the same results in roughly the same amount of
time.


[Test Environment]
Ubuntu 19.10
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)


[Steps for reproducing our observations]
    
* Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
* Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
    echo $i
    name=`echo $i|cut -d'.' -f1`
    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
*Execute the queries


Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От
Christophe Pettus
Дата:

> On Sep 18, 2020, at 20:32, PG Bug reporting form <noreply@postgresql.org> wrote:
[snip]
> First query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100;
>
> Second query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100
> GROUP  BY s_suppkey;

[snip]

> [Expected Behavior]
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL will return the same results in roughly the same amount of
> time.

These two queries are not semantically equivalent, as described.  I might guess that the table definition has a primary
key(or should have had) on "supplier"."s_suppkey", and thus the GROUP BY should be redundant, but there's nothing in
thereport or the query plans that indicate this is the case. 
--
-- Christophe Pettus
   xof@thebuild.com




Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От
Xinyu Liu
Дата:
Thank you so much for your reply!
Yes, I should have pointed out in the bug report that "supplier"."s_suppkey" is the primary key and thus the GROUP BY should be redundant.
I am also attaching the result of "\d supplier" to this email:
tpch5=# \d supplier
                        Table "public.supplier"
   Column    |          Type          | Collation | Nullable | Default
-------------+------------------------+-----------+----------+---------
 s_suppkey   | integer                |           | not null |
 s_name      | character(25)          |           | not null |
 s_address   | character varying(40)  |           | not null |
 s_nationkey | integer                |           | not null |
 s_phone     | character(15)          |           | not null |
 s_acctbal   | numeric(15,2)          |           | not null |
 s_comment   | character varying(101) |           | not null |
Indexes:
    "supplier_pkey" PRIMARY KEY, btree (s_suppkey)
Foreign-key constraints:
    "supplier_fk1" FOREIGN KEY (s_nationkey) REFERENCES nation(n_nationkey)
Referenced by:
    TABLE "partsupp" CONSTRAINT "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)

We will make sure we conclude such important context information in our future bug reports.


On Sat, Sep 19, 2020 at 11:57 AM Christophe Pettus <xof@thebuild.com> wrote:


> On Sep 18, 2020, at 20:32, PG Bug reporting form <noreply@postgresql.org> wrote:
[snip]
> First query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100; 
>
> Second query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100
> GROUP  BY s_suppkey;

[snip]

> [Expected Behavior]
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL will return the same results in roughly the same amount of
> time.

These two queries are not semantically equivalent, as described.  I might guess that the table definition has a primary key (or should have had) on "supplier"."s_suppkey", and thus the GROUP BY should be redundant, but there's nothing in the report or the query plans that indicate this is the case.
--
-- Christophe Pettus
   xof@thebuild.com



--
-Xinyu

Re: BUG #16625: Query Optimizer - Performance bug related to removal of unnecessary aggregate function

От
David Rowley
Дата:
On Sun, 20 Sep 2020 at 03:53, PG Bug reporting form
<noreply@postgresql.org> wrote:
> We are developing a tool for automatically finding performance bugs in
> PostgreSQL. Our key insight is that given a pair of semantic equivalent
> queries, a robust DBMS should return the same result within a similar
> execution time. Significant time difference suggests a potential performance
> bug in the DBMS.

It's important to differentiate between bugs and optimisations that
PostgreSQL does not perform.   I imagine the findings of running this
tool is best directed towards a wiki page on
https://wiki.postgresql.org/

It's also important to always go ahead and apply such optimisations
without any regard to the cost of checking if the optimisation can
apply.  For optimisations like this, it's not always just a simple
case of attempting to apply them regardless. Checking for such cases
will penalise queries where the optimisation cannot be applied. The
people that benefit are the ones that write bad SQL and the people
that lose out are the ones who quite good SQL. That's not a
particularly good incentive to write good SQL.  In some cases, the
cost of checking if the optimisation can be applied will be so
negligible that it's worth it as the gains are good if it can be
applied.  The answer to whether this the case for this particilar
optimisation will depend on who you ask.

> We are sharing a pair of TPC-H queries that exhibit a potential performance
> bug in this report:
>
> First query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100;
>
> Second query:
> SELECT "s_suppkey"
> FROM   "supplier"
> WHERE  s_suppkey > 100
> GROUP  BY s_suppkey;
>
> [Actual Behavior]
> We executed both queries on the TPC-H benchmark of scale factor 5: the first
> query takes only 17 millisecond, while the second query takes 42
> millisecond. We think the time difference results from different plans
> selected.

Accounting for the information on the other email that mentions
s_suppkey is the primary key of the supplier table, this is not a bug.
It's simply an optimisation that we currently don't apply. There's a
patch around that aims to implement this but it's not yet been
applied. So the optimisation may appear in some future version of
PostgreSQL. If you'd like to help with that then please look at
https://commitfest.postgresql.org/29/2433/

I'd suggest a wiki page would be a good place to note down other
possible future optimisations. There's no shortage of possible
optimisations that we don't apply, and an endless stream of bug
reports is not the way to have new optimisations added to the
PostgreSQL planner.

David