Обсуждение: extended stats on partitioned tables

Поиск
Список
Период
Сортировка

extended stats on partitioned tables

От
Justin Pryzby
Дата:
extended stats objects are allowed on partitioned tables since v10.
https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com
8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b

But since 859b3003de they're not populated - pg_statistic_ext(_data) is empty.
This was the consequence of a commit to avoid an error I reported with stats on
inheritence parents (not partitioned tables).

preceding 859b3003de, stats on the parent table *did* improve the estimate,
so this part of the commit message seems to have been wrong?
|commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb
|    Don't build extended statistics on inheritance trees
...
|    Moreover, the current selectivity estimation code only works with individual
|    relations, so building statistics on inheritance trees would be pointless
|    anyway.

|CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i);
|CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100);
|TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM generate_series(1,999)a;
|CREATE STATISTICS pp ON (a),(b) FROM p;
|VACUUM ANALYZE p;
|SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass;

|postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p GROUP BY 1,2; abort;
| HashAggregate  (cost=20.98..21.98 rows=100 width=8) (actual time=1.088..1.093 rows=10 loops=1)

|postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2;
| HashAggregate  (cost=20.98..21.09 rows=10 width=8) (actual time=1.082..1.086 rows=10 loops=1)

So I think this is a regression, and extended stats should be populated for
partitioned tables - I had actually done that for some parent tables and hadn't
noticed that the stats objects no longer do anything.

That begs the question if the current behavior for inheritence parents is
correct..

CREATE TABLE p (i int, a int, b int);
CREATE TABLE pd () INHERITS (p);
INSERT INTO pd SELECT 1, a/100, a/100 FROM generate_series(1,999)a;
CREATE STATISTICS pp ON (a),(b) FROM p;
VACUUM ANALYZE p;
explain analyze SELECT a,b FROM p GROUP BY 1,2;

| HashAggregate  (cost=25.99..26.99 rows=100 width=8) (actual time=3.268..3.284 rows=10 loops=1)

Since child tables can be queried directly, it's a legitimate question whether
we should collect stats for the table heirarchy or (since the catalog only
supports one) only the table itself.  I'd think that stats for the table
hierarchy would be more commonly useful (but we shouldn't change the behavior
in existing releases again).  Anyway it seems unfortunate that
statistic_ext_data still has no stxinherited.

Note that for partitioned tables if I enable enable_partitionwise_aggregate,
then stats objects on the child tables can be helpful (but that's also
confusing to the question at hand).



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 9/23/21 11:26 PM, Justin Pryzby wrote:
> extended stats objects are allowed on partitioned tables since v10.
> https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com
> 8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b
> 
> But since 859b3003de they're not populated - pg_statistic_ext(_data) is empty.
> This was the consequence of a commit to avoid an error I reported with stats on
> inheritence parents (not partitioned tables).
> 
> preceding 859b3003de, stats on the parent table *did* improve the estimate,
> so this part of the commit message seems to have been wrong?
> |commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb
> |    Don't build extended statistics on inheritance trees
> ...
> |    Moreover, the current selectivity estimation code only works with individual
> |    relations, so building statistics on inheritance trees would be pointless
> |    anyway.
> 
> |CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i);
> |CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100);
> |TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM generate_series(1,999)a;
> |CREATE STATISTICS pp ON (a),(b) FROM p;
> |VACUUM ANALYZE p;
> |SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass;
> 
> |postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p GROUP BY 1,2; abort;
> | HashAggregate  (cost=20.98..21.98 rows=100 width=8) (actual time=1.088..1.093 rows=10 loops=1)
> 
> |postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2;
> | HashAggregate  (cost=20.98..21.09 rows=10 width=8) (actual time=1.082..1.086 rows=10 loops=1)
> 
> So I think this is a regression, and extended stats should be populated for
> partitioned tables - I had actually done that for some parent tables and hadn't
> noticed that the stats objects no longer do anything.
> 
> That begs the question if the current behavior for inheritence parents is
> correct..
> 
> CREATE TABLE p (i int, a int, b int);
> CREATE TABLE pd () INHERITS (p);
> INSERT INTO pd SELECT 1, a/100, a/100 FROM generate_series(1,999)a;
> CREATE STATISTICS pp ON (a),(b) FROM p;
> VACUUM ANALYZE p;
> explain analyze SELECT a,b FROM p GROUP BY 1,2;
> 
> | HashAggregate  (cost=25.99..26.99 rows=100 width=8) (actual time=3.268..3.284 rows=10 loops=1)
> 

Agreed, that seems like a regression, but I don't see how to fix that 
without having the extra flag in the catalog. Otherwise we can store 
just one version for each statistics object :-(

> Since child tables can be queried directly, it's a legitimate question whether
> we should collect stats for the table heirarchy or (since the catalog only
> supports one) only the table itself.  I'd think that stats for the table
> hierarchy would be more commonly useful (but we shouldn't change the behavior
> in existing releases again).  Anyway it seems unfortunate that
> statistic_ext_data still has no stxinherited.
> 

Yeah, we probably need the flag - I planned to get it into 14, but then 
I got distracted by something else :-/

Attached is a PoC that I quickly bashed together today. It's pretty raw, 
but it passed "make check" and I think it does most of the things right. 
Can you try if this fixes the estimates with partitioned tables?

Extended statistics use two catalogs, pg_statistic_ext for definition, 
while pg_statistic_ext_data stores the built statistics objects - the 
flag needs to be in the "data" catalog, and managing the records is a 
bit challenging - the current PoC code mostly works, but I had to relax 
some error checks and I'm sure there are cases when we fail to remove a 
row, or something like that.

> Note that for partitioned tables if I enable enable_partitionwise_aggregate,
> then stats objects on the child tables can be helpful (but that's also
> confusing to the question at hand).
> 

Yeah. I think it'd be helpful to assemble a script with various test 
cases demonstrating how we estimate various cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: extended stats on partitioned tables

От
Justin Pryzby
Дата:
On Sat, Sep 25, 2021 at 09:27:10PM +0200, Tomas Vondra wrote:
> On 9/23/21 11:26 PM, Justin Pryzby wrote:
> > extended stats objects are allowed on partitioned tables since v10.
> > https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com
> > 8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b
> > 
> > But since 859b3003de they're not populated - pg_statistic_ext(_data) is empty.
> > This was the consequence of a commit to avoid an error I reported with stats on
> > inheritence parents (not partitioned tables).
> > 
> > preceding 859b3003de, stats on the parent table *did* improve the estimate,
> > so this part of the commit message seems to have been wrong?
> > |commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb
> > |    Don't build extended statistics on inheritance trees
> > ...
> > |    Moreover, the current selectivity estimation code only works with individual
> > |    relations, so building statistics on inheritance trees would be pointless
> > |    anyway.
> > 
> > |CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i);
> > |CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100);
> > |TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM generate_series(1,999)a;
> > |CREATE STATISTICS pp ON (a),(b) FROM p;
> > |VACUUM ANALYZE p;
> > |SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass;
> > 
> > |postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p GROUP BY 1,2; abort;
> > | HashAggregate  (cost=20.98..21.98 rows=100 width=8) (actual time=1.088..1.093 rows=10 loops=1)
> > 
> > |postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2;
> > | HashAggregate  (cost=20.98..21.09 rows=10 width=8) (actual time=1.082..1.086 rows=10 loops=1)
> > 
> > So I think this is a regression, and extended stats should be populated for
> > partitioned tables - I had actually done that for some parent tables and hadn't
> > noticed that the stats objects no longer do anything.
...
> Agreed, that seems like a regression, but I don't see how to fix that
> without having the extra flag in the catalog. Otherwise we can store just
> one version for each statistics object :-(

Do you think it's possible to backpatch a fix to handle partitioned tables
specifically ?

The "tuple already updated" error which I reported and which was fixed by
859b3003 involved inheritence children.  Since partitioned tables have no data
themselves, the !inh check could be relaxed.  It's not totally clear to me if
the correct statistics would be used in that case.  I suppose the wrong
(inherited) stats would be wrongly applied affect queries FROM ONLY a
partitioned table, which seems pointless to write and also hard for the
estimates to be far off :)

> Attached is a PoC that I quickly bashed together today. It's pretty raw, but
> it passed "make check" and I think it does most of the things right. Can you
> try if this fixes the estimates with partitioned tables?

I think pg_stats_ext_exprs also needs to expose the inherited flag.

Thanks,
-- 
Justin



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:

On 9/25/21 9:53 PM, Justin Pryzby wrote:
> On Sat, Sep 25, 2021 at 09:27:10PM +0200, Tomas Vondra wrote:
>> On 9/23/21 11:26 PM, Justin Pryzby wrote:
>>> extended stats objects are allowed on partitioned tables since v10.
>>> https://www.postgresql.org/message-id/flat/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA%40mail.gmail.com
>>> 8c5cdb7f4f6e1d6a6104cb58ce4f23453891651b
>>>
>>> But since 859b3003de they're not populated - pg_statistic_ext(_data) is empty.
>>> This was the consequence of a commit to avoid an error I reported with stats on
>>> inheritence parents (not partitioned tables).
>>>
>>> preceding 859b3003de, stats on the parent table *did* improve the estimate,
>>> so this part of the commit message seems to have been wrong?
>>> |commit 859b3003de87645b62ee07ef245d6c1f1cd0cedb
>>> |    Don't build extended statistics on inheritance trees
>>> ...
>>> |    Moreover, the current selectivity estimation code only works with individual
>>> |    relations, so building statistics on inheritance trees would be pointless
>>> |    anyway.
>>>
>>> |CREATE TABLE p (i int, a int, b int) PARTITION BY RANGE (i);
>>> |CREATE TABLE pd PARTITION OF p FOR VALUES FROM (1)TO(100);
>>> |TRUNCATE p; INSERT INTO p SELECT 1, a/100, a/100 FROM generate_series(1,999)a;
>>> |CREATE STATISTICS pp ON (a),(b) FROM p;
>>> |VACUUM ANALYZE p;
>>> |SELECT * FROM pg_statistic_ext WHERE stxrelid ='p'::regclass;
>>>
>>> |postgres=# begin; DROP STATISTICS pp; explain analyze SELECT a,b FROM p GROUP BY 1,2; abort;
>>> | HashAggregate  (cost=20.98..21.98 rows=100 width=8) (actual time=1.088..1.093 rows=10 loops=1)
>>>
>>> |postgres=# explain analyze SELECT a,b FROM p GROUP BY 1,2;
>>> | HashAggregate  (cost=20.98..21.09 rows=10 width=8) (actual time=1.082..1.086 rows=10 loops=1)
>>>
>>> So I think this is a regression, and extended stats should be populated for
>>> partitioned tables - I had actually done that for some parent tables and hadn't
>>> noticed that the stats objects no longer do anything.
> ...
>> Agreed, that seems like a regression, but I don't see how to fix that
>> without having the extra flag in the catalog. Otherwise we can store just
>> one version for each statistics object :-(
> 
> Do you think it's possible to backpatch a fix to handle partitioned tables
> specifically ?
> 
> The "tuple already updated" error which I reported and which was fixed by
> 859b3003 involved inheritence children.  Since partitioned tables have no data
> themselves, the !inh check could be relaxed.  It's not totally clear to me if
> the correct statistics would be used in that case.  I suppose the wrong
> (inherited) stats would be wrongly applied affect queries FROM ONLY a
> partitioned table, which seems pointless to write and also hard for the
> estimates to be far off :)
> 

Hmmm, maybe. To prevent the "tuple concurrently updated" we must ensure 
we never build stats with and without inheritance at the same time (for 
the same rel). The 859b3003de ensures that by only building extended 
stats in the (!inh) case, but we might tweak that based on relkind. See 
the attached patch. But I wonder if there are cases that might be hurt 
by this - that'd be a regression too, of course.

>> Attached is a PoC that I quickly bashed together today. It's pretty raw, but
>> it passed "make check" and I think it does most of the things right. Can you
>> try if this fixes the estimates with partitioned tables?
> 
> I think pg_stats_ext_exprs also needs to expose the inherited flag.
> 

Yeah, I only did the bare minimum to get the PoC working. I'm sure there 
are various other loose ends.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: extended stats on partitioned tables

От
Justin Pryzby
Дата:
On Sat, Sep 25, 2021 at 11:01:21PM +0200, Tomas Vondra wrote:
> > Do you think it's possible to backpatch a fix to handle partitioned tables
> > specifically ?
> > 
> > The "tuple already updated" error which I reported and which was fixed by
> > 859b3003 involved inheritence children.  Since partitioned tables have no data
> > themselves, the !inh check could be relaxed.  It's not totally clear to me if
> > the correct statistics would be used in that case.  I suppose the wrong
> > (inherited) stats would be wrongly applied affect queries FROM ONLY a
> > partitioned table, which seems pointless to write and also hard for the
> > estimates to be far off :)
> 
> Hmmm, maybe. To prevent the "tuple concurrently updated" we must ensure we
> never build stats with and without inheritance at the same time (for the
> same rel). The 859b3003de ensures that by only building extended stats in
> the (!inh) case, but we might tweak that based on relkind. See the attached
> patch. But I wonder if there are cases that might be hurt by this - that'd
> be a regression too, of course.

I think we should leave the inheritance case alone, since it hasn't changed in
2 years, and building stats on the table ONLY is a legitimate interpretation,
and it's as good as we can do without the catalog change.

But the partitioned case used to work, and there's no utility in selecting FROM
ONLY a partitioned table, so we might as well build the stats including its
partitions.

I don't think anything would get worse for the partitioned case.
Obviously building inherited ext stats could change plans - that's the point.
It's weird that the stats objects which existed for 18 months before being
"built" after the patch was applied, but no so weird that the release notes
wouldn't be ample documentation.

If building statistics caused the plan to change undesirably, the solution
would be to drop the stats object, of course.

+               build_ext_stats = (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ? inh : (!inh);
                                                                                                     
 

It's weird to build inherited extended stats for partitioned tables but not for
inheritence parents.  We could be clever and say "build inherited ext stats for
inheritence parents only if we didn't insert any stats for the table itself
(because it's empty)".  But I think that's fragile: a single tuple in the
parent table could cause stats to be built there instead of on its heirarchy,
and the extended stats would be used for *both* FROM and FROM ONLY, which is an
awful combination.

Since do_analyze_rel is only called once for partitioned tables, I think you
could write that as:

/* Do not build inherited stats (since the catalog cannot support it) except
 * for partitioned tables, for which numrows==0 and have no non-inherited stats */
build_ext_stats = !inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;

-- 
Justin



Re: extended stats on partitioned tables

От
Justin Pryzby
Дата:
(Resending with -hackers)

It seems like your patch should also check "inh" in examine_variable and
statext_expressions_load.

Which leads to another issue in stable branches:

ANALYZE builds only non-inherited stats, but they're incorrectly used for
inherited queries - the rowcount estimate is worse on inheritence parents with
extended stats than without.

 CREATE TABLE p(i int, j int);
 CREATE TABLE p1() INHERITS(p);
 INSERT INTO p SELECT a, a/10 FROM generate_series(1,9)a;
 INSERT INTO p1 SELECT a, a FROM generate_series(1,999)a;
 CREATE STATISTICS ps ON i,j FROM p;
 VACUUM ANALYZE p,p1;

postgres=# explain analyze SELECT * FROM p GROUP BY 1,2;
 HashAggregate  (cost=26.16..26.25 rows=9 width=8) (actual time=2.571..3.282 rows=1008 loops=1)

postgres=# begin; DROP STATISTICS ps; explain analyze SELECT * FROM p GROUP BY 1,2; rollback;
 HashAggregate  (cost=26.16..36.16 rows=1000 width=8) (actual time=2.167..2.872 rows=1008 loops=1)

I guess examine_variable() should have corresponding logic to the hardcoded
!inh in analyze.c.

-- 
Justin



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 9/25/21 11:46 PM, Justin Pryzby wrote:
> On Sat, Sep 25, 2021 at 11:01:21PM +0200, Tomas Vondra wrote:
>>> Do you think it's possible to backpatch a fix to handle partitioned tables
>>> specifically ?
>>>
>>> The "tuple already updated" error which I reported and which was fixed by
>>> 859b3003 involved inheritence children.  Since partitioned tables have no data
>>> themselves, the !inh check could be relaxed.  It's not totally clear to me if
>>> the correct statistics would be used in that case.  I suppose the wrong
>>> (inherited) stats would be wrongly applied affect queries FROM ONLY a
>>> partitioned table, which seems pointless to write and also hard for the
>>> estimates to be far off :)
>>
>> Hmmm, maybe. To prevent the "tuple concurrently updated" we must ensure we
>> never build stats with and without inheritance at the same time (for the
>> same rel). The 859b3003de ensures that by only building extended stats in
>> the (!inh) case, but we might tweak that based on relkind. See the attached
>> patch. But I wonder if there are cases that might be hurt by this - that'd
>> be a regression too, of course.
> 
> I think we should leave the inheritance case alone, since it hasn't changed in
> 2 years, and building stats on the table ONLY is a legitimate interpretation,
> and it's as good as we can do without the catalog change.
> 
> But the partitioned case used to work, and there's no utility in selecting FROM
> ONLY a partitioned table, so we might as well build the stats including its
> partitions.
> 
> I don't think anything would get worse for the partitioned case.
> Obviously building inherited ext stats could change plans - that's the point.
> It's weird that the stats objects which existed for 18 months before being
> "built" after the patch was applied, but no so weird that the release notes
> wouldn't be ample documentation.
> 

Agreed.

> If building statistics caused the plan to change undesirably, the solution
> would be to drop the stats object, of course.
> 
> +               build_ext_stats = (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ? inh : (!inh);
                                                                                                       
 
> 
> It's weird to build inherited extended stats for partitioned tables but not for
> inheritence parents.  We could be clever and say "build inherited ext stats for
> inheritence parents only if we didn't insert any stats for the table itself
> (because it's empty)".  But I think that's fragile: a single tuple in the
> parent table could cause stats to be built there instead of on its heirarchy,
> and the extended stats would be used for *both* FROM and FROM ONLY, which is an
> awful combination.
> 

I don't think there's a good way to check if there are any rows in the
parent relation. And even then, a single row might cause huge changes to
query plans (essentially switching to very different stats).

> Since do_analyze_rel is only called once for partitioned tables, I think you
> could write that as:
> 
> /* Do not build inherited stats (since the catalog cannot support it) except
>  * for partitioned tables, for which numrows==0 and have no non-inherited stats */
> build_ext_stats = !inh || onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
> 

Good point.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Justin Pryzby
Дата:
On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote:
> It seems like your patch should also check "inh" in examine_variable and
> statext_expressions_load.

I tried adding that - I mostly kept my patches separate.
Hopefully this is more helpful than a complication.
I added at: https://commitfest.postgresql.org/35/3332/

+       /* create only the "stxdinherit=false", because that always exists */
+       datavalues[Anum_pg_statistic_ext_data_stxdinherit - 1] = ObjectIdGetDatum(false);

That'd be confusing for partitioned tables, no?
They'd always have an row with no data.
I guess it could be stxdinherit = BoolGetDatum(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE).
(not ObjectIdGetDatum).
Then, that affects the loops which delete the tuples - neither inh nor !inh is
guaranteed, unless you check relkind there, too.

BTW, you'd need to add an "inherited" column to \dX if you added the "built"
data back.

Also, I think in backbranches we should document what's being stored in
pg_statistic_ext, since it's pretty unintuitive:
 - noninherted stats (FROM ONLY) for inheritence parents;
 - inherted stats (FROM *) for partitioned tables;

I think the !inh decision in 859b3003de was basically backwards.
I think it'd be rare for someone to put extended stats on a parent for
improving plans involving FROM ONLY.

But it's not worth trying to fix now, since it would change plans in
irreversible ways.  Also, if the stx data were already populated, users would
have to run a manual analyze after upgrading to populate the catalog with the
data the planner would expect in the new version, or else it would end up being
the opposite of the issue I mentioned: non-inherited stats (from before the
upgrade) would be applied by the planner (after the upgrade) to inherited
queries.

Вложения

Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 10/8/21 12:45 AM, Justin Pryzby wrote:
> On Thu, Oct 07, 2021 at 03:26:46PM -0500, Jaime Casanova wrote:
>> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote:
>>> On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote:
>>>> It seems like your patch should also check "inh" in examine_variable and
>>>> statext_expressions_load.
>>>
>>> I tried adding that - I mostly kept my patches separate.
>>> Hopefully this is more helpful than a complication.
>>> I added at: https://commitfest.postgresql.org/35/3332/
>>>
>>
>> Actually, this is confusing. Which patch is the one we should be
>> reviewing?
> 
> It is confusing, but not as much as I first thought.  Please check the commit
> messages.
> 
> The first two patches are meant to be applied to master *and* backpatched.  The
> first one intends to fixes the bug that non-inherited stats are being used for
> queries of inheritance trees.  The 2nd one fixes the regression that stats are
> not collected for inheritence trees of partitioned tables (which is the only
> type of stats they could ever possibly have).
> 

I think 0001 and 0002 seem mostly fine, but it seems a bit strange to do
the (!rte->inh) check in the rel->statlist loops. AFAICS both places
could do that right at the beginning, because it does not depend on the
statistics object at all, just the RelOptInfo.

> And the 3rd+4th patches (Tomas' plus my changes) allow collecting both
> inherited and non-inherited stats, only in master, since it requires a catalog
> change.  It's a bit confusing that patch #4 removes most what I added in
> patches 1 and 2.  But that's exactly what's needed to collect and apply both
> inherited and non-inherited stats: the first two patches avoid applying stats
> collected with the wrong inheritence.  That's also what's needed for the
> patchset to follow the normal "apply to master and backpatch" process, rather
> than 2 patches which are backpatched but not applied to master, and one which
> is applied to master and not backpatched..
> 

Yeah. Af first I was a bit confused because after applying 0003 there
are both the fixes and the "correct" way, but then I realized 0004
removes the unnecessary bits.

The one thing 0003 still needs is to rework the places that need to
touch both inh and !inh stats. The patch simply does

    for (inh = 0; inh <= 1; inh++) { ... }

but that feels a bit too hackish. But if we don't know which of the two
stats exist, I'm not sure what to do about it. And I'm not sure we do
the right thing after removing children, for example (that should drop
the inheritance stats, I guess).

The 1:2 mapping between pg_statistic_ext and pg_statistic_ext_data is a
bit strange, but I can't think of a better way.


> @Tomas: I just found commit 427c6b5b9, which is a remarkably similar issue
> affecting column stats 15 years ago.
> 

What can I say? The history repeats itself ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:

On 11/4/21 12:19 AM, Justin Pryzby wrote:
> On Wed, Nov 03, 2021 at 11:48:44PM +0100, Tomas Vondra wrote:
>> On 10/8/21 12:45 AM, Justin Pryzby wrote:
>>> On Thu, Oct 07, 2021 at 03:26:46PM -0500, Jaime Casanova wrote:
>>>> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote:
>>>>> On Sat, Sep 25, 2021 at 05:31:52PM -0500, Justin Pryzby wrote:
>>>>>> It seems like your patch should also check "inh" in examine_variable and
>>>>>> statext_expressions_load.
>>>>>
>>>>> I tried adding that - I mostly kept my patches separate.
>>>>> Hopefully this is more helpful than a complication.
>>>>> I added at: https://commitfest.postgresql.org/35/3332/
>>>>>
>>>>
>>>> Actually, this is confusing. Which patch is the one we should be
>>>> reviewing?
>>>
>>> It is confusing, but not as much as I first thought.  Please check the commit
>>> messages.
>>>
>>> The first two patches are meant to be applied to master *and* backpatched.  The
>>> first one intends to fixes the bug that non-inherited stats are being used for
>>> queries of inheritance trees.  The 2nd one fixes the regression that stats are
>>> not collected for inheritence trees of partitioned tables (which is the only
>>> type of stats they could ever possibly have).
>>
>> I think 0001 and 0002 seem mostly fine, but it seems a bit strange to do
>> the (!rte->inh) check in the rel->statlist loops. AFAICS both places
>> could do that right at the beginning, because it does not depend on the
>> statistics object at all, just the RelOptInfo.
> 
> I probably did this to make the code change small, to avoid indentin the whole
> block.

But indenting the block is not necessary. It's possible to do something
like this:

    if (!rel->inh)
        return 1.0;

or whatever is the "default" result for that function.

> 
>>> And the 3rd+4th patches (Tomas' plus my changes) allow collecting both
>>> inherited and non-inherited stats, only in master, since it requires a catalog
>>> change.  It's a bit confusing that patch #4 removes most what I added in
>>> patches 1 and 2.  But that's exactly what's needed to collect and apply both
>>> inherited and non-inherited stats: the first two patches avoid applying stats
>>> collected with the wrong inheritence.  That's also what's needed for the
>>> patchset to follow the normal "apply to master and backpatch" process, rather
>>> than 2 patches which are backpatched but not applied to master, and one which
>>> is applied to master and not backpatched..
>>>
>>
>> Yeah. Af first I was a bit confused because after applying 0003 there
>> are both the fixes and the "correct" way, but then I realized 0004
>> removes the unnecessary bits.
> 
> This was to leave your 0003 (mostly) unchanged, so you can see and/or apply my
> changes.  They should be squished together.
> 

Yep.

>> The one thing 0003 still needs is to rework the places that need to
>> touch both inh and !inh stats. The patch simply does
>>
>>     for (inh = 0; inh <= 1; inh++) { ... }
>>
>> but that feels a bit too hackish. But if we don't know which of the two
>> stats exist, I'm not sure what to do about it. 
> 
> There's also this:
> 
> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote:
>> +       /* create only the "stxdinherit=false", because that always exists */
>> +       datavalues[Anum_pg_statistic_ext_data_stxdinherit - 1] = ObjectIdGetDatum(false);
>>
>> That'd be confusing for partitioned tables, no?
>> They'd always have an row with no data.
>> I guess it could be stxdinherit = BoolGetDatum(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE).
>> (not ObjectIdGetDatum).
>> Then, that affects the loops which delete the tuples - neither inh nor !inh is
>> guaranteed, unless you check relkind there, too.
> 
> Maybe the for inh<=1 loop should instead be two calls to new functions factored
> out of get_relation_statistics() and RemoveStatisticsById(), which take "bool
> inh".
> 

Well, yeah. That's part of the strange 1:2 mapping between the stats
definition and data. Although, even with regular stats we have such
mapping, except the "definition" is the pg_attribute row.

>> And I'm not sure we do the right thing after removing children, for example
>> (that should drop the inheritance stats, I guess).
> 
> Do you mean for inheritance only ?  Or partitions too ?
> I think for partitions, the stats should stay.
> And for inheritence, they can stay, for consistency with partitions, and since
> it does no harm.
> 

I think the behavior should be the same as for data in pg_statistic,
i.e. if we keep/remove those, we should do the same thing for extended
statistics.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
Hi,

Attached is a rebased and cleaned-up version of these patches, with more
comments, refactorings etc. Justin and Zhihong, can you take a look?


0001 - Ignore extended statistics for inheritance trees

0002 - Build inherited extended stats on partitioned tables

Those are mostly just Justin's patches, with more detailed comments and
updated commit message. I've considered moving the rel->inh check to
statext_clauselist_selectivity, and then removing the check from
dependencies and MCV. But I decided no to do that, because someone might
be calling those functions directly (even if that's very unlikely).

The one thing bugging me a bit is that the regression test checks only a
GROUP BY query. It'd be nice to add queries testing MCV/dependencies
too, but that seems tricky because most queries will use per-partitions
stats.


0003 - Add stxdinherit flag to pg_statistic_ext_data

This is the patch for master, allowing to build stats for both inherits
flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
how we deal with iterating both flags etc. I've adopted most of the
Justin's fixup patches, except that in plancat.c I've refactored how we
load the stats to process keys/expressions just once.

It has the same issue with regression test using just a GROUP BY query,
but if we add a test to 0001/0002, that'll fix this too.


0004 - Refactor parent ACL check

Not sure about this - I doubt saving 30 rows in an 8kB file is really
worth it. Maybe it is, but then maybe we should try cleaning up the
other ACL checks in this file too? Seems mostly orthogonal to this
thread, though.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: extended stats on partitioned tables

От
Zhihong Yu
Дата:


On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

Attached is a rebased and cleaned-up version of these patches, with more
comments, refactorings etc. Justin and Zhihong, can you take a look?


0001 - Ignore extended statistics for inheritance trees

0002 - Build inherited extended stats on partitioned tables

Those are mostly just Justin's patches, with more detailed comments and
updated commit message. I've considered moving the rel->inh check to
statext_clauselist_selectivity, and then removing the check from
dependencies and MCV. But I decided no to do that, because someone might
be calling those functions directly (even if that's very unlikely).

The one thing bugging me a bit is that the regression test checks only a
GROUP BY query. It'd be nice to add queries testing MCV/dependencies
too, but that seems tricky because most queries will use per-partitions
stats.


0003 - Add stxdinherit flag to pg_statistic_ext_data

This is the patch for master, allowing to build stats for both inherits
flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
how we deal with iterating both flags etc. I've adopted most of the
Justin's fixup patches, except that in plancat.c I've refactored how we
load the stats to process keys/expressions just once.

It has the same issue with regression test using just a GROUP BY query,
but if we add a test to 0001/0002, that'll fix this too.


0004 - Refactor parent ACL check

Not sure about this - I doubt saving 30 rows in an 8kB file is really
worth it. Maybe it is, but then maybe we should try cleaning up the
other ACL checks in this file too? Seems mostly orthogonal to this
thread, though.


Hi,
For patch 3, in commit message:

and there no clear winner. -> and there is no clear winner. 

and it seem wasteful -> and it seems wasteful

The there may be -> There may be

+       /* skip statistics with mismatching stxdinherit value */
+       if (stat->inherit != rte->inh)

Should a log be added for the above case ?

+                    * Determine if we'redealing with inheritance tree.

There should be a space between re and dealing.

Cheers

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:

On 12/12/21 05:38, Zhihong Yu wrote:
> 
> 
> On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     Hi,
> 
>     Attached is a rebased and cleaned-up version of these patches, with more
>     comments, refactorings etc. Justin and Zhihong, can you take a look?
> 
> 
>     0001 - Ignore extended statistics for inheritance trees
> 
>     0002 - Build inherited extended stats on partitioned tables
> 
>     Those are mostly just Justin's patches, with more detailed comments and
>     updated commit message. I've considered moving the rel->inh check to
>     statext_clauselist_selectivity, and then removing the check from
>     dependencies and MCV. But I decided no to do that, because someone might
>     be calling those functions directly (even if that's very unlikely).
> 
>     The one thing bugging me a bit is that the regression test checks only a
>     GROUP BY query. It'd be nice to add queries testing MCV/dependencies
>     too, but that seems tricky because most queries will use per-partitions
>     stats.
> 
> 
>     0003 - Add stxdinherit flag to pg_statistic_ext_data
> 
>     This is the patch for master, allowing to build stats for both inherits
>     flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
>     how we deal with iterating both flags etc. I've adopted most of the
>     Justin's fixup patches, except that in plancat.c I've refactored how we
>     load the stats to process keys/expressions just once.
> 
>     It has the same issue with regression test using just a GROUP BY query,
>     but if we add a test to 0001/0002, that'll fix this too.
> 
> 
>     0004 - Refactor parent ACL check
> 
>     Not sure about this - I doubt saving 30 rows in an 8kB file is really
>     worth it. Maybe it is, but then maybe we should try cleaning up the
>     other ACL checks in this file too? Seems mostly orthogonal to this
>     thread, though.
> 
> 
> Hi,
> For patch 3, in commit message:
> 
> and there no clear winner. -> and there is no clear winner. 
> 
> and it seem wasteful -> and it seems wasteful
> 
> The there may be -> There may be
> 

Thanks, will fix.

> +       /* skip statistics with mismatching stxdinherit value */
> +       if (stat->inherit != rte->inh)
> 
> Should a log be added for the above case ?
> 

Why should we log this? It's an entirely expected case - there's a
mismatch between inheritance for the relation and statistics, simply
skipping it is the right thing to do.

> +                    * Determine if we'redealing with inheritance tree.
> 
> There should be a space between re and dealing.
> 

Thanks, will fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Zhihong Yu
Дата:


On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 12/12/21 05:38, Zhihong Yu wrote:
>
>
> On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
>
>     Hi,
>
>     Attached is a rebased and cleaned-up version of these patches, with more
>     comments, refactorings etc. Justin and Zhihong, can you take a look?
>
>
>     0001 - Ignore extended statistics for inheritance trees
>
>     0002 - Build inherited extended stats on partitioned tables
>
>     Those are mostly just Justin's patches, with more detailed comments and
>     updated commit message. I've considered moving the rel->inh check to
>     statext_clauselist_selectivity, and then removing the check from
>     dependencies and MCV. But I decided no to do that, because someone might
>     be calling those functions directly (even if that's very unlikely).
>
>     The one thing bugging me a bit is that the regression test checks only a
>     GROUP BY query. It'd be nice to add queries testing MCV/dependencies
>     too, but that seems tricky because most queries will use per-partitions
>     stats.
>
>
>     0003 - Add stxdinherit flag to pg_statistic_ext_data
>
>     This is the patch for master, allowing to build stats for both inherits
>     flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
>     how we deal with iterating both flags etc. I've adopted most of the
>     Justin's fixup patches, except that in plancat.c I've refactored how we
>     load the stats to process keys/expressions just once.
>
>     It has the same issue with regression test using just a GROUP BY query,
>     but if we add a test to 0001/0002, that'll fix this too.
>
>
>     0004 - Refactor parent ACL check
>
>     Not sure about this - I doubt saving 30 rows in an 8kB file is really
>     worth it. Maybe it is, but then maybe we should try cleaning up the
>     other ACL checks in this file too? Seems mostly orthogonal to this
>     thread, though.
>
>
> Hi,
> For patch 3, in commit message:
>
> and there no clear winner. -> and there is no clear winner. 
>
> and it seem wasteful -> and it seems wasteful
>
> The there may be -> There may be
>

Thanks, will fix.

> +       /* skip statistics with mismatching stxdinherit value */
> +       if (stat->inherit != rte->inh)
>
> Should a log be added for the above case ?
>

Why should we log this? It's an entirely expected case - there's a
mismatch between inheritance for the relation and statistics, simply
skipping it is the right thing to do.

Hi,
I agree that skipping should be fine (to avoid too much logging).

Thanks


> +                    * Determine if we'redealing with inheritance tree.
>
> There should be a space between re and dealing.
>

Thanks, will fix.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: extended stats on partitioned tables

От
Zhihong Yu
Дата:


On Sat, Dec 11, 2021 at 8:17 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

Attached is a rebased and cleaned-up version of these patches, with more
comments, refactorings etc. Justin and Zhihong, can you take a look?


0001 - Ignore extended statistics for inheritance trees

0002 - Build inherited extended stats on partitioned tables

Those are mostly just Justin's patches, with more detailed comments and
updated commit message. I've considered moving the rel->inh check to
statext_clauselist_selectivity, and then removing the check from
dependencies and MCV. But I decided no to do that, because someone might
be calling those functions directly (even if that's very unlikely).

The one thing bugging me a bit is that the regression test checks only a
GROUP BY query. It'd be nice to add queries testing MCV/dependencies
too, but that seems tricky because most queries will use per-partitions
stats.


0003 - Add stxdinherit flag to pg_statistic_ext_data

This is the patch for master, allowing to build stats for both inherits
flag values. It adds the flag to pg_stats_ext_exprs view to, reworked
how we deal with iterating both flags etc. I've adopted most of the
Justin's fixup patches, except that in plancat.c I've refactored how we
load the stats to process keys/expressions just once.

It has the same issue with regression test using just a GROUP BY query,
but if we add a test to 0001/0002, that'll fix this too.


0004 - Refactor parent ACL check

Not sure about this - I doubt saving 30 rows in an 8kB file is really
worth it. Maybe it is, but then maybe we should try cleaning up the
other ACL checks in this file too? Seems mostly orthogonal to this
thread, though.

Hi,
For patch 1, minor comment:

+           if (planner_rt_fetch(onerel->relid, root)->inh) 

Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking inh, I think it would be better if the above style of checking is used throughout the patch (without introducing rte variable).

Cheers

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 12/12/21 14:47, Zhihong Yu wrote:
> 
> 
> On Sat, Dec 11, 2021 at 9:14 PM Tomas Vondra 
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> 
> wrote:
 >
> ...
> 
>      > +       /* skip statistics with mismatching stxdinherit value */
>      > +       if (stat->inherit != rte->inh)
>      >
>      > Should a log be added for the above case ?
>      >
> 
>     Why should we log this? It's an entirely expected case - there's a
>     mismatch between inheritance for the relation and statistics, simply
>     skipping it is the right thing to do.
> 
> 
> Hi,
> I agree that skipping should be fine (to avoid too much logging).
> 

I'm not sure it's related to the amount of logging, really. It'd be just 
noise without any practical use, even for debugging purposes. If you 
have an inheritance tree, it'll automatically have one set of statistics 
for inh=true and one for inh=false. And this condition will always skip 
one of those, depending on what query is being estimated.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 12/12/21 16:37, Zhihong Yu wrote:
> 
> Hi,
> For patch 1, minor comment:
> 
> +           if (planner_rt_fetch(onerel->relid, root)->inh)
> 
> Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking 
> inh, I think it would be better if the above style of checking is used 
> throughout the patch (without introducing rte variable).
> 

It's mostly a matter of personal taste, but I always found this style of 
condition (i.e. dereferencing a pointer returned by a function) much 
less readable. It's hard to parse what exactly is happening, what struct 
type are we dealing with, etc. YMMV but the separate variable makes it 
much clearer for me. And I'd expect the compilers to produce pretty much 
the same code too for those cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 12/12/21 16:37, Zhihong Yu wrote:
>> Since the rte (RangeTblEntry*) doesn't seem to be used beyond checking 
>> inh, I think it would be better if the above style of checking is used 
>> throughout the patch (without introducing rte variable).

> It's mostly a matter of personal taste, but I always found this style of 
> condition (i.e. dereferencing a pointer returned by a function) much 
> less readable. It's hard to parse what exactly is happening, what struct 
> type are we dealing with, etc. YMMV but the separate variable makes it 
> much clearer for me. And I'd expect the compilers to produce pretty much 
> the same code too for those cases.

FWIW, I agree.  Also, it's possible that future patches would create a
need to touch the RTE again nearby, in which case having the variable
makes it easier to write non-crummy code for that.

            regards, tom lane



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 12/12/21 18:52, Justin Pryzby wrote:
> +          * XXX Can't we simply look at rte->inh?
> +          */
> +         inh = root->append_rel_array == NULL ? false :
> +                 root->append_rel_array[onerel->relid]->parent_relid != 0;
> 
> I think so.  That's what I came up with while trying to figured this out, and
> it's no great surprise that it needed to be cleaned up - thanks.
> 

OK, fixed.

> In your 0003 patch, the "if inh: break" isn't removed from examine_variable(),
> but the corresponding thing is removed everywhere else.
> 

Ah, you're right. And it wasn't updated in the 0002 patch either - it
should do the relkind check too, to allow partitioned tables. Fixed.

> In 0003, mcv_clauselist_selectivity still uses simple_rte_array rather than
> rt_fetch.
> 

That's mostly a conscious choice, so that I don't have to include
parsetree.h. But maybe that'd be better ...

> The regression tests changed as a result of not populating stx_data; I think
> it's may be better to update like this:
> 
> SELECT stxname, stxdndistinct, stxddependencies, stxdmcv, stxoid IS NULL
>   FROM pg_statistic_ext s LEFT JOIN pg_statistic_ext_data d
>   ON d.stxoid = s.oid
>   WHERE s.stxname = 'ab1_a_b_stats';
> 

Not sure I understand. Why would this be better than inner join?

> There's this part about documentation for the changes in backbranches:
> 
> On Sun, Sep 26, 2021 at 03:25:50PM -0500, Justin Pryzby wrote:
>> Also, I think in backbranches we should document what's being stored in
>> pg_statistic_ext, since it's pretty unintuitive:
>>  - noninherted stats (FROM ONLY) for inheritence parents;
>>  - inherted stats (FROM *) for partitioned tables;
> 
> spellcheck: inheritence should be inheritance.
> 

Thanks, fixed. Can you read through the commit messages and check the
attribution is correct for all the patches?

> All for now.  I'm going to update the regression tests for dependencies and the
> other code paths.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:

On 12/12/21 22:32, Justin Pryzby wrote:
> On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote:
>> The one thing bugging me a bit is that the regression test checks only a
>> GROUP BY query. It'd be nice to add queries testing MCV/dependencies
>> too, but that seems tricky because most queries will use per-partitions
>> stats.
> 
> You mean because the quals are pushed down to the scan node.
> 
> Does that indicate a deficiency ?
> 
> If extended stats are collected for a parent table, selectivity estimates based
> from the parent would be better; but instead we use uncorrected column
> estimates from the child tables.
> 
>  From what I see, we could come up with a way to avoid the pushdown, involving
> volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc.
>  > But would it be better if extended stats objects on partitioned 
tables were to
> collect stats for both parent AND CHILD ?  I'm not sure.  Maybe that's the
> wrong solution, but maybe we should still document that extended stats on
> (empty) parent tables are often themselves not used/useful for selectivity
> estimates, and the user should instead (or in addition) create stats on child
> tables.
> 
> Or, maybe if there's no extended stats on the child tables, stats on the parent
> table should be consulted ?
> 

Maybe, but that seems like a mostly separate improvement. At this point 
I'm interested only in testing the behavior implemented in the current 
patches.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 12/13/21 14:48, Justin Pryzby wrote:
> On Sun, Dec 12, 2021 at 10:29:39PM +0100, Tomas Vondra wrote:
>>> In your 0003 patch, the "if inh: break" isn't removed from examine_variable(),
>>> but the corresponding thing is removed everywhere else.
>>
>> Ah, you're right. And it wasn't updated in the 0002 patch either - it
>> should do the relkind check too, to allow partitioned tables. Fixed.
> 
> I think you fixed it in 0002 (thanks) but still wasn't removed from 0003?
> 

D'oh! Those repeated rebase conflicts got me quite confused.

> In these comments:
> +        * When dealing with regular inheritance trees, ignore extended stats
> +        * (which were built without data from child rels, and thus do not
> +        * represent them). For partitioned tables data from partitions are
> +        * in the stats (and there's no data in the non-leaf relations), so
> +        * in this case we do consider extended stats.
> 
> I suggest to add a comment after "For partitioned tables".
> 

I've reworded the comment a bit, hopefully it's a bit clearer now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 12/13/21 14:53, Justin Pryzby wrote:
> On Sun, Dec 12, 2021 at 11:23:19PM +0100, Tomas Vondra wrote:
>> On 12/12/21 22:32, Justin Pryzby wrote:
>>> On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote:
>>>> The one thing bugging me a bit is that the regression test checks only a
>>>> GROUP BY query. It'd be nice to add queries testing MCV/dependencies
>>>> too, but that seems tricky because most queries will use per-partitions
>>>> stats.
>>>
>>> You mean because the quals are pushed down to the scan node.
>>>
>>> Does that indicate a deficiency ?
>>>
>>> If extended stats are collected for a parent table, selectivity estimates based
>>> from the parent would be better; but instead we use uncorrected column
>>> estimates from the child tables.
>>>
>>> From what I see, we could come up with a way to avoid the pushdown, involving
>>> volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc.
>>> But would it be better if extended stats objects on partitioned tables were to
>>> collect stats for both parent AND CHILD ?  I'm not sure.  Maybe that's the
>>> wrong solution, but maybe we should still document that extended stats on
>>> (empty) parent tables are often themselves not used/useful for selectivity
>>> estimates, and the user should instead (or in addition) create stats on child
>>> tables.
>>>
>>> Or, maybe if there's no extended stats on the child tables, stats on the parent
>>> table should be consulted ?
>>
>> Maybe, but that seems like a mostly separate improvement. At this point I'm
>> interested only in testing the behavior implemented in the current patches.
> 
> I don't want to change the scope of the patch, or this thread, but my point is
> that the behaviour already changed once (the original regression) and now we're
> planning to change it again to fix that, so we ought to decide on the expected
> behavior before writing tests to verify it.
> 

OK. Makes sense.

> I think it may be impossible to use the "dependencies" statistic with inherited
> stats.  Normally the quals would be pushed down to the child tables.  But, if
> they weren't pushed down, they'd be attached to something other than a scan
> node on the parent table, so the stats on that table wouldn't apply (right?).
> 

Yeah, that's probably right. But I'm not 100% sure the whole inheritance
tree can't be treated as a single relation by some queries ...

> Maybe the useless stats types should have been prohibited on partitioned tables
> since v10.  It's too late to change that, but perhaps now they shouldn't even
> be collected during analyze.  The dependencies and MCV paths are never called
> with rte->inh==true, so maybe we should Assert(!inh), or add a comment to that
> effect.  Or the regression tests should "memorialize" the behavior.  I'm still
> thinking about it.
> 

Yeah, we can't really prohibit them in backbranches - that'd mean some
CREATE STATISTICS commands suddenly start failing, which would be quite
annoying. Not building them for partitioned tables seems like a better
option - BuildRelationExtStatistics can check relkind and pick what to
ignore. But I'd choose to do that in a separate patch, probably - after
all, it shouldn't really change the behavior of any tests/queries, no?

This reminds me we need to consider if these patches could cause any
issues. The way I see it is this:

1) If the table is a separate relation (not part of an inheritance
tree), this should make no difference. -> OK

2) If the table is using "old" inheritance, this reverts back to
pre-regression behavior. So people will keep using the old statistics
until the ANALYZE, and we need to tell them to ANALYZE or something.

3) If the table is using partitioning, it's guaranteed to be empty and
there are no stats at all. Again, we should tell people to run ANALYZE.

Of course, we can't be sure query plans will change in the right
direction :-(


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
On 1/15/22 06:11, Justin Pryzby wrote:
> On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote:
>> 1) If the table is a separate relation (not part of an inheritance
>> tree), this should make no difference. -> OK
>>
>> 2) If the table is using "old" inheritance, this reverts back to
>> pre-regression behavior. So people will keep using the old statistics
>> until the ANALYZE, and we need to tell them to ANALYZE or something.
>>
>> 3) If the table is using partitioning, it's guaranteed to be empty and
>> there are no stats at all. Again, we should tell people to run ANALYZE.
> 
> I think these can be mentioned in the commit message, which can end up in the
> minor release notes as a recommendation to rerun ANALYZE.
> 

Good point. I pushed the 0002 part and added a short paragraph 
suggesting ANALYZE might be necessary. I did not go into details about 
the individual cases, because that'd be too much for a commit message.

> Thanks for pushing 0001.
> 

Thanks for posting the patches!

I've pushed the second part - attached are the two remaining parts. I'll 
wait a bit before pushing the rebased 0001 (which goes into master 
branch only). Not sure about 0002 - I'm not convinced the refactored ACL 
checks are an improvement, but I'll think about it.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:

On 1/15/22 19:35, Tomas Vondra wrote:
> On 1/15/22 06:11, Justin Pryzby wrote:
>> On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote:
>>> 1) If the table is a separate relation (not part of an inheritance
>>> tree), this should make no difference. -> OK
>>>
>>> 2) If the table is using "old" inheritance, this reverts back to
>>> pre-regression behavior. So people will keep using the old statistics
>>> until the ANALYZE, and we need to tell them to ANALYZE or something.
>>>
>>> 3) If the table is using partitioning, it's guaranteed to be empty and
>>> there are no stats at all. Again, we should tell people to run ANALYZE.
>>
>> I think these can be mentioned in the commit message, which can end up 
>> in the
>> minor release notes as a recommendation to rerun ANALYZE.
>>
> 
> Good point. I pushed the 0002 part and added a short paragraph 
> suggesting ANALYZE might be necessary. I did not go into details about 
> the individual cases, because that'd be too much for a commit message.
> 
>> Thanks for pushing 0001.
>>
> 
> Thanks for posting the patches!
> 
> I've pushed the second part - attached are the two remaining parts. I'll 
> wait a bit before pushing the rebased 0001 (which goes into master 
> branch only). Not sure about 0002 - I'm not convinced the refactored ACL 
> checks are an improvement, but I'll think about it.
> 

BTW when backpatching the first part, I had to decide what to do about 
tests. The 10 & 11 branches did not have the check_estimated_rows() 
function. I considered removing the tests, reworking the tests not to 
need the function, or adding the function. Ultimately I added the 
function, which seemed like the best option.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: extended stats on partitioned tables

От
Tomas Vondra
Дата:
I've pushed the part adding stxdinherit flag to the catalog, so that on 
master we build statistics both with and without data from the child 
relations.

I'm not going to push the ACL refactoring. We have similar code on 
various other places (not addressed by the proposed patch), and it'd 
make backpatching harder. So I'm not sure it'd be an improvement.

In any case, I'm going to mark this as committed. Justin, if you think 
we should reconsider the ACL refactoring, please submit it as a separate 
patch. It seems mostly unrelated to the issue this thread was about.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company