Re: Collecting statistics about contents of JSONB columns

Поиск

Список

Период

Сортировка

От	Mahendra Singh Thalor
Тема	Re: Collecting statistics about contents of JSONB columns
Дата	25 января 2022 г. 19:56:01
Msg-id	CAKYtNArmMg=cU6SogL6u-b21Gx=rMoR9G1ah6kLANRw7LhHukA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Collecting statistics about contents of JSONB columns (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: Collecting statistics about contents of JSONB columns (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

On Tue, 25 Jan 2022 at 03:50, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 1/23/22 01:24, Nikita Glukhov wrote:
> > Hi!
> >
> > I am glad that you found my very old patch interesting and started to
> > work on it. We failed to post it in 2016 mostly because we were not
> > satisfied with JSONB storage. Also we decided to wait for completion
> > of work on extended statistics as we thought that it could help us.
> > But in early 2017 we switched to SQL/JSON and forgot about this patch.
> >
>
> Understood. Let's see how feasible this idea is and if we can move this
> forward.
>
> >
> > I think custom datatype is necessary for better performance. With a
> > plain JSONB we need to do a lot of work for extraction of path stats:
> > - iterate through MCV/Histogram JSONB arrays
> > - cast numeric values to float, string to text etc.
> > - build PG arrays from extracted datums
> > - form pg_statistic tuple.
> >
> > With a custom data type we could store pg_statistic tuple unmodified
> > and use it without any overhead. But then we need modify a bit
> > VariableStatData and several functions to pass additional nullfrac
> > corrections.
> >
>
> I'm not against evaluating/exploring alternative storage formats, but my
> feeling is the real impact on performance will be fairly low. At least I
> haven't seen this as very expensive while profiling the patch so far. Of
> course, I may be wrong, and it may be more significant in some cases.
>
> > Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> > would be enough.
> >
> Maybe, but then you still need to store a bunch of those, right? So
> either an array (likely toasted) or 1:M table. I'm not sure it's goiing
> to be much cheaper than JSONB.
>
> I'd suggest we focus on what we need to store first, which seems like
> tha primary question, and worry about the exact storage format then.
>
>
> > Also there is an idea to store per-path separately in pg_statistic_ext
> > rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
> > also help user to select which paths to analyze simply by using some
> > sort of CREATE STATISTICS. But it is really unclear how to:
> > * store pg_statistic_ext rows from typanalyze
> > * form path expressions for array elements (maybe add new jsonpath
> > operator)
> > * match various -> operator chains to stxexprs
> > * jsonpath-like languages need simple API for searching by stxexprs
> >
>
> Sure, you can do statistics on expressions, right? Of course, if that
> expression produces JSONB value, that's not very useful at the moment.
> Maybe we could have two typanalyze functions - one for regular analyze,
> one for extended statistics?
>
> That being said, I'm not sure extended stats are a good match for this.
> My feeling was we should collect these stats for all JSONB columns,
> which is why I argued for putting that in pg_statistic.
>
> >
> >
> > Per-path statistics should only be collected for scalars. This can be
> > enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
> > problem: computed MCVs and histograms will be broken and we will not be
> > able to use them for queries like (jb > const) in general case. Also
> > we will not be and use internally in scalarineqsel() and var_eq_const()
> > (see jsonSelectivity()). Instead, we will have to implement own
> > estimator functions for JSONB comparison operators that will correctly
> > use our hacked MCVs and histograms (and maybe not all cases will be
> > supported; for example, comparison to scalars only).
> >
>
> Yeah, but maybe that's an acceptable trade-off? I mean, if we can
> improve estimates for most clauses, and there's a some number of clauses
> that are estimated just like without stats, that's still an improvement,
> right?
>
> > It's possible to replace objects and arrays with empty ones when
> > scalarsOnly is set to keep correct frequencies of non-scalars.
> > But there is an old bug in JSONB comparison: empty arrays are placed
> > before other values in the JSONB sort order, although they should go
> > after all scalars. So we need also to distinguish empty and non-empty
> > arrays here.
> >
>
> Hmmm ...
> >
> >
> > I tried to fix a major part of places marked as XXX and FIXME, the new
> > version of the patches is attached. There are a lot of changes, you
> > can see them in a step-by-step form in the corresponding branch
> > jsonb_stats_20220122 in our GitHub repo [1].
> >
>
> Thanks! I'll go through the changes soon.
>
>

Thanks, Nikita and Tomas for these patches.

For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time with these patches.

Setup:

autovacuum=off

rest all are default settings.

Insert attached file with and without the patch to compare the time taken by analyze.

With json patches:

postgres=# analyze test ;
ANALYZE
Time: 28464.062 ms (00:28.464)
postgres=#
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
328 kB
(1 row)

Without json patches:

postgres=# analyze test ;
ANALYZE
Time: 120.864 ms
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
272 kB

I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.

In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.

I will run perf and will try to find out the root cause of this.

Apart from this performance issue, I haven't found any crashes or issues.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Вложения

test_jsonb_1.sql

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Mark Dilger
Дата: 25 января 2022 г., 19:42:14
Сообщение: Re: CREATEROLE and role ownership hierarchies

Следующее

От: Tom Lane
Дата: 25 января 2022 г., 20:45:15
Сообщение: Re: Replace uses of deprecated Python module distutils.sysconfig

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Collecting statistics about contents of JSONB columns

Вложения

Предыдущее

Следующее