Re: planner support functions: handle GROUP BY estimates ?

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема Re: planner support functions: handle GROUP BY estimates ?
Дата
Msg-id 20191223001648.GG30414@telsasoft.com
обсуждение исходный текст
Ответ на planner support functions: handle GROUP BY estimates ?  (Justin Pryzby <pryzby@telsasoft.com>)
Ответы Re: planner support functions: handle GROUP BY estimates ?  (Justin Pryzby <pryzby@telsasoft.com>)
Список pgsql-hackers
On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
> Tom implemented "Planner support functions":
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
> https://www.postgresql.org/docs/12/xfunc-optimization.html
> 
> I wondered whether there was any consideration to extend that to allow
> providing improved estimates of "group by".  That currently requires manually
> by creating an expression index, if the function is IMMUTABLE (which is not
> true for eg.  date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc().  Currently, the
planner assumes that functions output equally many groups as their input
variables.  Most invocations of our reports use date_trunc (or similar), so my
earlier attempt to alert on rowcount misestimates was very brief.

I currently assume that the input data has 1 second granularity:
|postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1
seconds')a;ANALYZE t;
 
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| Group  (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| Finalize HashAggregate  (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081
loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| Group  (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1)

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

0) Add a fudge factor of 4x or maybe 30x;

1) Avoid applying a corrective factor for seconds or minutes that makes the
rowcount less than (say) 2 or 100.  That would divide 24 but might then avoid
the last /60 or /60/60.  Ultimately, that's more "fudge" than anything else;

2) Leave alone pg_catalog.date_trunc(), but provide "template" support
functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the
given corrective factor, which should allow more accurate rowcount for input
data with granularity of the given number of seconds.

Ideally, that would be user-specified factor, but I don't think that's possible
to specify in SQL; the constant has to be built into the C function.  At
telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe
make a "date_trunc" function in the user schema which calls the
pg_catalog.date_trunc with support function timestamp_support_10pow3;

There could be a "base" support function that accepts a multiplier argument,
and then any user-provided C extension would be a one-liner specifing an
arbitrary value;

3) Maybe there are better functions than date_trunc() to address;

4) Leave it as a patch in the archives for people to borrow from;

Justin

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "tsunakawa.takay@fujitsu.com"
Дата:
Сообщение: RE: Implementing Incremental View Maintenance
Следующее
От: Tom Lane
Дата:
Сообщение: Drongo vs. 9.4 initdb TAP test