Обсуждение: [BUGS] BUG #14780: PostgreSQL 9.6 selects a wrong plan during aggregationagainst timestamp columns

Поиск
Список
Период
Сортировка

[BUGS] BUG #14780: PostgreSQL 9.6 selects a wrong plan during aggregationagainst timestamp columns

От
sogawa@yandex.ru
Дата:
The following bug has been logged on the website:

Bug reference:      14780
Logged by:          sogawa-sps
Email address:      sogawa@yandex.ru
PostgreSQL version: 9.6.2
Operating system:   Windows 10
Description:

PostgreSQL 9.6 selects a wrong plan during aggregation against indexed
timestamp columns while it's ok for other types.

Given: table “log” that has three columns: user_id, day, hours.
   user_id character varying(36) COLLATE pg_catalog."default" NOT NULL,   day timestamp without time zone,   hours
doubleprecision
 

All columns have indexes.

The issue is that aggregation against the 'day' field works extremely slow
because makes a full scan filtering the entries that doesn’t relate to
user_id = 'ab056f5a-390b-41d7-ba56-897c14b679bf'
   select min(day) from log where user_id =
'ab056f5a-390b-41d7-ba56-897c14b679bf'
   [     {       "Execution Time": 146502.05,       "Planning Time": 0.893,       "Plan": {         "Startup Cost":
789.02,        "Actual Rows": 1,         "Plans": [           {             "Startup Cost": 0.44,             "Actual
Rows":1,             "Plans": [               {                 "Index Cond": "(log.day IS NOT NULL)",
"StartupCost": 0.44,                 "Scan Direction": "Forward",                 "Plan Width": 8,
"RowsRemoved by Index Recheck": 0,                 "Actual Rows": 1,                 "Node Type": "Index Scan",
       "Total Cost": 1395792.54,                 "Plan Rows": 1770,                 "Relation Name": "log",
   "Alias": "log",                 "Parallel Aware": false,                 "Actual Total Time": 146502.015,
    "Output": [                   "log.day"                 ],                 "Parent Relationship": "Outer",
      "Actual Startup Time": 146502.015,                 "Schema": "public",                 "Filter":
"((log.user_id)::text= 
'ab056f5a-390b-41d7-ba56-897c14b679bf'::text)",                 "Actual Loops": 1,                 "Rows Removed by
Filter":12665610,                 "Index Name": "index_log_day"               }             ],             "Node Type":
"Limit",            "Plan Rows": 1,             "Parallel Aware": false,             "Actual Total Time": 146502.016,
         "Output": [               "log.day"             ],             "Parent Relationship": "InitPlan",
"ActualStartup Time": 146502.016,             "Plan Width": 8,             "Subplan Name": "InitPlan 1 (returns $0)",
         "Actual Loops": 1,             "Total Cost": 789.02           }         ],         "Node Type": "Result",
  "Plan Rows": 1,         "Parallel Aware": false,         "Actual Total Time": 146502.019,         "Output": [
 "$0"         ],         "Actual Startup Time": 146502.019,         "Plan Width": 8,         "Actual Loops": 1,
"TotalCost": 789.03       },       "Triggers": []     }   ]
 

However the almost similar query but for the double type has a correct.
   select min(hours) from log where user_id =
'ab056f5a-390b-41d7-ba56-897c14b679bf'

Server selects entries for user_id = 'ab056f5a-390b-41d7-ba56-897c14b679bf'
first and then aggregates among them what is correct
   [     {       "Execution Time": 5.989,       "Planning Time": 1.186,       "Plan": {         "Partial Mode":
"Simple",        "Startup Cost": 6842.66,         "Actual Rows": 1,         "Plans": [           {             "Startup
Cost":66.28,             "Plan Width": 8,             "Rows Removed by Index Recheck": 0,             "Actual Rows":
745,            "Plans": [               {                 "Startup Cost": 0,                 "Plan Width": 0,
      "Actual Rows": 745,                 "Node Type": "Bitmap Index Scan",                 "Index Cond":
"((log.user_id)::text= 
'ab056f5a-390b-41d7-ba56-897c14b679bf'::text)",                 "Plan Rows": 1770,                 "Parallel Aware":
false,                "Actual Total Time": 0.25,                 "Parent Relationship": "Outer",
"ActualStartup Time": 0.25,                 "Total Cost": 65.84,                 "Actual Loops": 1,
"IndexName": "index_log_user_id"               }             ],             "Recheck Cond": "((log.user_id)::text = 
'ab056f5a-390b-41d7-ba56-897c14b679bf'::text)",             "Exact Heap Blocks": 742,             "Node Type": "Bitmap
HeapScan",             "Plan Rows": 1770,             "Relation Name": "log",             "Alias": "log",
"ParallelAware": false,             "Actual Total Time": 5.793,             "Output": [               "day",
  "hours",               "user_id"             ],             "Lossy Heap Blocks": 0,             "Parent
Relationship":"Outer",             "Actual Startup Time": 0.357,             "Total Cost": 6838.23,             "Actual
Loops":1,             "Schema": "public"           }         ],         "Node Type": "Aggregate",         "Strategy":
"Plain",        "Plan Rows": 1,         "Parallel Aware": false,         "Actual Total Time": 5.946,         "Output":
[          "min(hours)"         ],         "Actual Startup Time": 5.946,         "Plan Width": 8,         "Actual
Loops":1,         "Total Cost": 6842.67       },       "Triggers": []     }   ]
 

Optimizer have to select correct plan for the timestamp fields like it does
for double.

WA: Rewrite query into:
    select user_id, min(day) from log where user_id =
'ac43a155-4fbb-49eb-a670-02c307eb3d4f' group by user_id



--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On 16 August 2017 at 02:47,  <sogawa@yandex.ru> wrote:
>
>     select min(day) from log where user_id =
> 'ab056f5a-390b-41d7-ba56-897c14b679bf'

This is a classic database optimization problem that is difficult to
always get right. Whether to use an index on day to find the lowest
values and scan until you find the specified user_id or to scan all
the records for that user_id to find the minimum day will depend on
the number of records each user_id has and how they're distributed
across the days. If you have some users with many records then using
an index on user_id can perform terribly when those ids are
referenced.

The classic solution is to have an index on <user_id, day> and then
the database can look up the correct value in a single index probe.

These kinds of problems are better addressed to pgsql-general

-- 
greg


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs