Re: PoC/WIP: Extended statistics on expressions

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: PoC/WIP: Extended statistics on expressions
Дата	7 декабря 2020 г. 14:15:17
Msg-id	b2995773-c9a1-5d3b-fc90-4f3ab189be11@enterprisedb.com обсуждение исходный текст
Ответ на	Re: PoC/WIP: Extended statistics on expressions (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы	Re: PoC/WIP: Extended statistics on expressions Re: PoC/WIP: Extended statistics on expressions
Список	pgsql-hackers

Дерево обсуждения


On 12/7/20 10:56 AM, Dean Rasheed wrote:
> On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> Attached is a patch series rebased on top of 25a9e54d2d.
> 
> After reading this thread and [1], I think I prefer the name
> "standard" rather than "expressions", because it is meant to describe
> the kind of statistics being built rather than what they apply to, but
> maybe that name doesn't actually need to be exposed to the end user:
> 
> Looking at the current behaviour, there are a couple of things that
> seem a little odd, even though they are understandable. For example,
> the fact that
> 
>   CREATE STATISTICS s (expressions) ON (expr), col FROM tbl;
> 
> fails, but
> 
>   CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl;
> 
> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax
> 
>   CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl;
> 
> tends to suggest that it's going to create statistics on the pair of
> expressions, describing their correlation, when actually it builds 2
> independent statistics. Also, this error text isn't entirely accurate:
> 
>   CREATE STATISTICS s ON col FROM tbl;
>   ERROR:  extended statistics require at least 2 columns
> 
> because extended statistics don't always require 2 columns, they can
> also just have an expression, or multiple expressions and 0 or 1
> columns.
> 
> I think a lot of this stems from treating "expressions" in the same
> way as the other (multi-column) stats kinds, and it might actually be
> neater to have separate documented syntaxes for single- and
> multi-column statistics:
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>     ON (expression)
>     FROM table_name
> 
>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>     [ ( statistics_kind [, ... ] ) ]
>     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>     FROM table_name
> 
> The first syntax would create single-column stats, and wouldn't accept
> a statistics_kind argument, because there is only one kind of
> single-column statistic. Maybe that might change in the future, but if
> so, it's likely that the kinds of single-column stats will be
> different from the kinds of multi-column stats.
> 
> In the second syntax, the only accepted kinds would be the current
> multi-column stats kinds (ndistinct, dependencies, and mcv), and it
> would always build stats describing the correlations between the
> columns listed. It would continue to build standard/expression stats
> on any expressions in the list, but that's more of an implementation
> detail.
> 
> It would no longer be possible to do "CREATE STATISTICS s
> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to
> issue 2 separate "CREATE STATISTICS" commands, but that seems more
> logical, because they're independent stats.
> 
> The parsing code might not change much, but some of the errors would
> be different. For example, the errors "building only extended
> expression statistics on simple columns not allowed" and "extended
> expression statistics require at least one expression" would go away,
> and the error "extended statistics require at least 2 columns" might
> become more specific, depending on the stats kind.
> 

I think it makes sense in general. I see two issues with this approach,
though:

* By adding expression/standard stats for individual statistics, it
makes the list of statistics longer - I wonder if this might have
measurable impact on lookups in this list.

* I'm not sure it's a good idea that the second syntax would always
build the per-expression stats. Firstly, it seems a bit strange that it
behaves differently than the other kinds. Secondly, I wonder if there
are cases where it'd be desirable to explicitly disable building these
per-expression stats. For example, what if we have multiple extended
statistics objects, overlapping on a couple expressions. It seems
pointless to build the stats for all of them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: PoC/WIP: Extended statistics on expressions