Re: Collecting statistics about contents of JSONB columns

Поиск
Список
Период
Сортировка
От Mahendra Singh Thalor
Тема Re: Collecting statistics about contents of JSONB columns
Дата
Msg-id CAKYtNArmMg=cU6SogL6u-b21Gx=rMoR9G1ah6kLANRw7LhHukA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Collecting statistics about contents of JSONB columns  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Collecting statistics about contents of JSONB columns  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
On Tue, 25 Jan 2022 at 03:50, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 1/23/22 01:24, Nikita Glukhov wrote:
> > Hi!
> >
> > I am glad that you found my very old patch interesting and started to
> > work on it.  We failed to post it in 2016 mostly because we were not
> > satisfied with JSONB storage.  Also we decided to wait for completion
> > of work on extended statistics as we thought that it could help us.
> > But in early 2017 we switched to SQL/JSON and forgot about this patch.
> >
>
> Understood. Let's see how feasible this idea is and if we can move this
> forward.
>
> >
> > I think custom datatype is necessary for better performance. With a
> > plain JSONB we need to do a lot of work for extraction of path stats:
> >   - iterate through MCV/Histogram JSONB arrays
> >   - cast numeric values to float, string to text etc.
> >   - build PG arrays from extracted datums
> >   - form pg_statistic tuple.
> >
> > With a custom data type we could store pg_statistic tuple unmodified
> > and use it without any overhead.  But then we need modify a bit
> > VariableStatData and several functions to pass additional nullfrac
> > corrections.
> >
>
> I'm not against evaluating/exploring alternative storage formats, but my
> feeling is the real impact on performance will be fairly low. At least I
> haven't seen this as very expensive while profiling the patch so far. Of
> course, I may be wrong, and it may be more significant in some cases.
>
> > Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> > would be enough.
> >
> Maybe, but then you still need to store a bunch of those, right? So
> either an array (likely toasted) or 1:M table. I'm not sure it's goiing
> to be much cheaper than JSONB.
>
> I'd suggest we focus on what we need to store first, which seems like
> tha primary question, and worry about the exact storage format then.
>
>
> > Also there is an idea to store per-path separately in pg_statistic_ext
> > rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
> > also help user to select which paths to analyze simply by using some
> > sort of CREATE STATISTICS.  But it is really unclear how to:
> >   * store pg_statistic_ext rows from typanalyze
> >   * form path expressions for array elements (maybe add new jsonpath
> >     operator)
> >   * match various -> operator chains to stxexprs
> >   * jsonpath-like languages need simple API for searching by stxexprs
> >
>
> Sure, you can do statistics on expressions, right? Of course, if that
> expression produces JSONB value, that's not very useful at the moment.
> Maybe we could have two typanalyze functions - one for regular analyze,
> one for extended statistics?
>
> That being said, I'm not sure extended stats are a good match for this.
> My feeling was we should collect these stats for all JSONB columns,
> which is why I argued for putting that in pg_statistic.
>
> >
> >
> > Per-path statistics should only be collected for scalars.  This can be
> > enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
> > problem: computed MCVs and histograms will be broken and we will not be
> > able to use them for queries like (jb > const) in general case.  Also
> > we will not be and use internally in scalarineqsel() and var_eq_const()
> > (see jsonSelectivity()).  Instead, we will have to implement own
> > estimator functions for JSONB comparison operators that will correctly
> > use our hacked MCVs and histograms (and maybe not all cases will be
> > supported; for example, comparison to scalars only).
> >
>
> Yeah, but maybe that's an acceptable trade-off? I mean, if we can
> improve estimates for most clauses, and there's a some number of clauses
> that are estimated just like without stats, that's still an improvement,
> right?
>
> > It's possible to replace objects and arrays with empty ones when
> > scalarsOnly is set to keep correct frequencies of non-scalars.
> > But there is an old bug in JSONB comparison: empty arrays are placed
> > before other values in the JSONB sort order, although they should go
> > after all scalars.  So we need also to distinguish empty and non-empty
> > arrays here.
> >
>
> Hmmm ...
> >
> >
> > I tried to fix a major part of places marked as XXX and FIXME, the new
> > version of the patches is attached.  There are a lot of changes, you
> > can see them in a step-by-step form in the corresponding branch
> > jsonb_stats_20220122 in our GitHub repo [1].
> >
>
> Thanks! I'll go through the changes soon.
>
>

Thanks, Nikita and Tomas for these patches.

For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time with these patches.

Setup:
autovacuum=off
rest all are default settings.

Insert attached file with and without the patch to compare the time taken by analyze.

With json patches:
postgres=# analyze test ;
ANALYZE
Time: 28464.062 ms (00:28.464)
postgres=#
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
 pg_size_pretty
----------------
 328 kB
(1 row)
--

Without json patches:
postgres=# analyze test ;
ANALYZE
Time: 120.864 ms
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
 pg_size_pretty
----------------
 272 kB

I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.

I will run perf and will try to find out the root cause of this.

Apart from this performance issue, I haven't found any crashes or issues.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Re: CREATEROLE and role ownership hierarchies
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Replace uses of deprecated Python module distutils.sysconfig