Обсуждение: BUG #17997: Assert failed in validatePartitionedIndex() when attaching partition index to child of valid index

Поиск
Список
Период
Сортировка
The following bug has been logged on the website:

Bug reference:      17997
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16beta1
Operating system:   Ubuntu 22.04
Description:

The following script:
create table t(a int) partition by range (a);
create index on t(a);
create table tp1(a int) partition by range (a);
create table tp1_1 partition of tp1 for values from (1) to (10);
create index on only tp1(a);
alter table t attach partition tp1 for values from (1) to (100);
create index on tp1_1(a);
\d+ t
\d+ tp1
alter index tp1_a_idx attach partition tp1_1_a_idx;

outputs:
                                      Partitioned table "public.t"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression |
Stats target | Description 
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
 a      | integer |           |          |         | plain   |             |
             | 
Partition key: RANGE (a)
Indexes:
    "t_a_idx" btree (a)
Partitions: tp1 FOR VALUES FROM (1) TO (100), PARTITIONED

                                     Partitioned table "public.tp1"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression |
Stats target | Description 
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
 a      | integer |           |          |         | plain   |             |
             | 
Partition of: t FOR VALUES FROM (1) TO (100)
Partition constraint: ((a IS NOT NULL) AND (a >= 1) AND (a < 100))
Partition key: RANGE (a)
Indexes:
    "tp1_a_idx" btree (a) INVALID
Partitions: tp1_1 FOR VALUES FROM (1) TO (10)

(Note that tp1_a_idx is invalid, but t_a_idx is valid.)

and triggers an assertion failure with the following stack trace:
Core was generated by `postgres: law regression [local] ALTER INDEX
                        '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/1943764' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140492487026624) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140492487026624) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140492487026624) at
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140492487026624, signo=signo@entry=6) at
./nptl/pthread_kill.c:89
#3  0x00007fc6f4027476 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#4  0x00007fc6f400d7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055b6a19eb40f in ExceptionalCondition (
    conditionName=0x55b6a1ff8800 "!parentIdx->rd_index->indisvalid",
fileName=0x55b6a1fea7e0 "tablecmds.c", 
    lineNumber=19206) at assert.c:66
#6  0x000055b6a038ea9d in validatePartitionedIndex
(partedIdx=0x7fc6e68a3ec8, partedTbl=0x7fc6e68a2368)
    at tablecmds.c:19206
#7  0x000055b6a038d581 in ATExecAttachPartitionIdx (wqueue=0x7ffe57073d00,
parentIdx=0x7fc6e68a3ec8, 
    name=0x61d000019da0) at tablecmds.c:19087
#8  0x000055b6a03009bc in ATExecCmd (wqueue=0x7ffe57073d00,
tab=0x61900003d6c8, cmd=0x61900003ba08, 
    lockmode=4, cur_pass=10, context=0x7ffe57073fd0) at tablecmds.c:5287
#9  0x000055b6a02fc7a5 in ATRewriteCatalogs (wqueue=0x7ffe57073d00,
lockmode=4, context=0x7ffe57073fd0)
    at tablecmds.c:4969
#10 0x000055b6a02f90b2 in ATController (parsetree=0x625000005d60,
rel=0x7fc6e68a3ec8, cmds=0x625000005d90, 
    recurse=true, lockmode=4, context=0x7ffe57073fd0) at tablecmds.c:4546
#11 0x000055b6a02f8395 in AlterTable (stmt=0x625000005d60, lockmode=4,
context=0x7ffe57073fd0)
    at tablecmds.c:4193
#12 0x000055b6a11f5eb2 in ProcessUtilitySlow (pstate=0x61900003b798,
pstmt=0x625000005e60, 
    queryString=0x625000005218 "alter index tp1_a_idx attach partition
tp1_1_a_idx;", 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x625000006120, qc=0x7ffe570748b0)
    at utility.c:1328
#13 0x000055b6a11f448b in standard_ProcessUtility (pstmt=0x625000005e60, 
    queryString=0x625000005218 "alter index tp1_a_idx attach partition
tp1_1_a_idx;", readOnlyTree=false, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x625000006120, qc=0x7ffe570748b0)
    at utility.c:1077
#14 0x000055b6a11f1b86 in ProcessUtility (pstmt=0x625000005e60, 
    queryString=0x625000005218 "alter index tp1_a_idx attach partition
tp1_1_a_idx;", readOnlyTree=false, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x625000006120, qc=0x7ffe570748b0)
    at utility.c:530
#15 0x000055b6a11ed0bd in PortalRunUtility (portal=0x625000025a18,
pstmt=0x625000005e60, isTopLevel=true, 
    setHoldSnapshot=false, dest=0x625000006120, qc=0x7ffe570748b0) at
pquery.c:1158
#16 0x000055b6a11ede34 in PortalRunMulti (portal=0x625000025a18,
isTopLevel=true, setHoldSnapshot=false, 
    dest=0x625000006120, altdest=0x625000006120, qc=0x7ffe570748b0) at
pquery.c:1315
#17 0x000055b6a11eaa90 in PortalRun (portal=0x625000025a18,
count=9223372036854775807, isTopLevel=true, 
    run_once=true, dest=0x625000006120, altdest=0x625000006120,
qc=0x7ffe570748b0) at pquery.c:791
#18 0x000055b6a11d492d in exec_simple_query (
    query_string=0x625000005218 "alter index tp1_a_idx attach partition
tp1_1_a_idx;") at postgres.c:1274
#19 0x000055b6a11e247a in PostgresMain (dbname=0x62900001b358 "regression",
username=0x6250000020f8 "law")
    at postgres.c:4632
#20 0x000055b6a0e0301f in BackendRun (port=0x614000000440) at
postmaster.c:4461
#21 0x000055b6a0e01723 in BackendStartup (port=0x614000000440) at
postmaster.c:4189
#22 0x000055b6a0df74dd in ServerLoop () at postmaster.c:1779
#23 0x000055b6a0df59fe in PostmasterMain (argc=3, argv=0x6030000006a0) at
postmaster.c:1463
#24 0x000055b6a078750e in main (argc=3, argv=0x6030000006a0) at main.c:198

Reproduced on REL_11 .. master.


PG Bug reporting form <noreply@postgresql.org> writes:
> The following script:
> create table t(a int) partition by range (a);
> create index on t(a);
> create table tp1(a int) partition by range (a);
> create table tp1_1 partition of tp1 for values from (1) to (10);
> create index on only tp1(a);
> alter table t attach partition tp1 for values from (1) to (100);
> create index on tp1_1(a);
> \d+ t
> \d+ tp1
> alter index tp1_a_idx attach partition tp1_1_a_idx;
> ... triggers an assertion failure with the following stack trace:

> #5  0x000055b6a19eb40f in ExceptionalCondition (
>     conditionName=0x55b6a1ff8800 "!parentIdx->rd_index->indisvalid",
> fileName=0x55b6a1fea7e0 "tablecmds.c", 
>     lineNumber=19206) at assert.c:66
> #6  0x000055b6a038ea9d in validatePartitionedIndex
> (partedIdx=0x7fc6e68a3ec8, partedTbl=0x7fc6e68a2368)
>     at tablecmds.c:19206

Thanks for the report!  This is not new in 16, it reproduces at
least as far back as v11.  I suppose we can't simply remove the
assertion, but have to figure out what should happen and write
code for that.

            regards, tom lane



On Sun, Jun 25, 2023 at 10:40:58AM -0400, Tom Lane wrote:
> Thanks for the report!  This is not new in 16, it reproduces at
> least as far back as v11.  I suppose we can't simply remove the
> assertion, but have to figure out what should happen and write
> code for that.

The cause of the crash is the index created by the second command on
the partitioned table "t", which creates the index with indisvalid set
to true because it has no partitions yet.

The crash then happens in the last command because we attach the index
of tp1_1, and the index validation is a two-step process by going
through the partition tree up to the parent:
- First it goes to the index of tp1, which has indisvalid = false.
- Then it goes up to the index of tp, which has indisvalid = true,
failing on the assertion.

71a05b22 has made the creation of invalid indexes for partitioned
tables much softer, still I think that it is correct to mark an index
as valid even if there are no partitions yet.

It seems to me that the mistake in this sequence is when we attach tp1
to t, in the 6th command of what has been reported when attaching a
partition?  I mean this one:
alter table t attach partition tp1 for values from (1) to (100);

At this stage, tp1_1 is attached already to tp1, making the partition
tree becoming t -> tp1 -> tp1_1.  The index attached to tp1 is marked
as invalid, because tp1_1 does not have an index.

It seems to me that there are two potential solutions here:
1) We could recurse and make sure that the index of t *also* becomes
*invalid*, because t_a_idx and tp1_a_idx form a partition chain, still
the leaf tp1_1 does *not* have an index yet, making it unusable.
2) Then I found about AttachPartitionEnsureIndexes().  This is called
when attaching a partition to make sure that indexes are created to
match the indexes between the partitioned table and the new partition
attached to it, but it does not recurse when attaching a partitioned
table, or an index would have been created on tp1_1 when running the
6th command to complete the index tree all the way down.

At the end, I'd like to think that 2) is the right path forward.  For
example, this creates a partitioned index automatically when attaching
a partition:
create table t(a int) partition by range (a);
create index on t(a);
create table tp1(a int) partition by range (a);
alter table t attach partition tp1 for values from (1) to (100);
=# \di
                    List of relations
 Schema |   Name    |       Type        | Owner  | Table
--------+-----------+-------------------+--------+-------
 public | t_a_idx   | partitioned index | popopo | t
 public | tp1_a_idx | partitioned index | popopo | tp1
(2 rows)
--
Michael

Вложения
On Mon, Jun 26, 2023 at 03:49:05PM +0900, Michael Paquier wrote:
> It seems to me that there are two potential solutions here:
> 1) We could recurse and make sure that the index of t *also* becomes
> *invalid*, because t_a_idx and tp1_a_idx form a partition chain, still
> the leaf tp1_1 does *not* have an index yet, making it unusable.
> 2) Then I found about AttachPartitionEnsureIndexes().  This is called
> when attaching a partition to make sure that indexes are created to
> match the indexes between the partitioned table and the new partition
> attached to it, but it does not recurse when attaching a partitioned
> table, or an index would have been created on tp1_1 when running the
> 6th command to complete the index tree all the way down.
>
> At the end, I'd like to think that 2) is the right path forward.  For
> example, this creates a partitioned index automatically when attaching
> a partition:
> create table t(a int) partition by range (a);
> create index on t(a);
> create table tp1(a int) partition by range (a);
> alter table t attach partition tp1 for values from (1) to (100);
> =# \di
>                     List of relations
>  Schema |   Name    |       Type        | Owner  | Table
> --------+-----------+-------------------+--------+-------
>  public | t_a_idx   | partitioned index | popopo | t
>  public | tp1_a_idx | partitioned index | popopo | tp1
> (2 rows)

A third solution that came into my mind just now would be to revisit
the choice done in AttachPartitionEnsureIndexes() where an invalid
index can be chosen as a match when creating the indexes on the
partitions, so as we are able to get to the bottom of a chain with
valid indexes for the whole tree.  I have been testing the attached
and it has been working here the way I'd expect when manipulating
partition trees with ATTACH PARTITION, though this breaks the scenario
of this bug report because we would now get a failure when attempting
to attach an index in the last command.

Thoughts or comments?
--
Michael

Вложения
26.06.2023 11:05, Michael Paquier wrote:
> A third solution that came into my mind just now would be to revisit
> the choice done in AttachPartitionEnsureIndexes() where an invalid
> index can be chosen as a match when creating the indexes on the
> partitions, so as we are able to get to the bottom of a chain with
> valid indexes for the whole tree.  I have been testing the attached
> and it has been working here the way I'd expect when manipulating
> partition trees with ATTACH PARTITION, though this breaks the scenario
> of this bug report because we would now get a failure when attempting
> to attach an index in the last command.

Thanks for the fix!

This solution seems sensible to me. The only downside I see is that an
invalid index would be left orphaned after ATTACH PARTITION, but I couldn't
find in doc/ or src/test/regress/ any promises that such index must be
used. I also don't see a way to make a previously valid index inside the
partition index tree invalid and available to attaching a child index to it
in the same time.

Beside that, maybe it would be better to place the test for the fix in
indexing.sql, where many similar operations with partitioned indexes are
performed.

Best regards,
Alexander



27.06.2023 14:00, Alexander Lakhin wrote:
26.06.2023 11:05, Michael Paquier wrote:
A third solution that came into my mind just now would be to revisit
the choice done in AttachPartitionEnsureIndexes() where an invalid
index can be chosen as a match when creating the indexes on the
partitions, so as we are able to get to the bottom of a chain with
valid indexes for the whole tree.  I have been testing the attached
and it has been working here the way I'd expect when manipulating
partition trees with ATTACH PARTITION, though this breaks the scenario
of this bug report because we would now get a failure when attempting
to attach an index in the last command.

Thanks for the fix!

This solution seems sensible to me. The only downside I see is that an
invalid index would be left orphaned after ATTACH PARTITION, but I couldn't
find in doc/ or src/test/regress/ any promises that such index must be
used. I also don't see a way to make a previously valid index inside the
partition index tree invalid and available to attaching a child index to it
in the same time.

There is also another scenario where the new behavior could be considered as more sensible:
create table t(a int, b int) partition by range (a);
create index on t((a / b));
create table tp1(a int, b int);
insert into tp1 values (1, 0);
create index concurrently on tp1((a/b)); --  division by zero occurs, but the index is created (as invalid)
alter table t attach partition tp1 for values from (1) to (10);

Without the fix you get partition tp1_1 attached and the following partition indexes:
Partitioned index "public.t_expr_idx"
btree, for table "public.t"
Partitions: tp1_expr_idx

Index "public.tp1_expr_idx"
btree, for table "public.tp1", invalid

But with the patch applied ATTACH PARTITION fails with ERROR:  division by zero.

Though we still can get a partition index chain with invalid indexes as follows:
create table t(a int, b int) partition by range (a);
create table tp1(a int, b int) partition by range (a);
alter table t attach partition tp1 for values from (1) to (100);
create table tp1_1(a int, b int);
insert into tp1_1 values (1, 0);
create index concurrently on tp1_1((a/b)); --  division by zero occurs, but the index is created (as invalid)
alter table tp1 attach partition tp1_1 for values from (1) to (10);
create index on t((a / b));

here we get the following index chain:
Partitioned index "public.t_expr_idx"
btree, for table "public.t"
Partitions: tp1_expr_idx, PARTITIONED

Partitioned index "public.tp1_expr_idx"
btree, for table "public.tp1", invalid
Partitions: tp1_1_expr_idx

Index "public.tp1_1_expr_idx"
btree, for table "public.tp1_1", invalid

It's also interesting that REINDEX for the index tree validates only a leaf index:
reindex index t_expr_idx;
ERROR:  division by zero

update tp1_1 set b=1;
reindex index t_expr_idx; -- or even reindex index tp1_expr_idx;

Partitioned index "public.t_expr_idx"
btree, for table "public.t"
Partitions: tp1_expr_idx, PARTITIONED

Partitioned index "public.tp1_expr_idx"
btree, for table "public.tp1", invalid
Partitions: tp1_1_expr_idx

Index "public.tp1_1_expr_idx"
btree, for table "public.tp1_1"

Although it looks like the invalid mark for a non-leaf index doesn't prevent using an index below it:
set enable_seqscan = off;
explain select * from t where a / b = 1;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Index Scan using tp1_1_expr_idx on tp1_1 t  (cost=0.12..8.14 rows=1 width=8)
   Index Cond: ((a / b) = 1)

So it's not clear (to me, at least), what exactly indisvalid means for indexes in a partition tree.

Best regards,
Alexander
On Wed, Jun 28, 2023 at 09:00:00AM +0300, Alexander Lakhin wrote:
> There is also another scenario where the new behavior could be
> considered as more sensible:
> create table t(a int, b int) partition by range (a);
> create index on t((a / b));
> create table tp1(a int, b int);
> insert into tp1 values (1, 0);
> create index concurrently on tp1((a/b)); --  division by zero occurs, but the index is created (as invalid)
> alter table t attach partition tp1 for values from (1) to (10);
>
> Without the fix you get partition tp1_1 attached and the following partition indexes:
> Partitioned index "public.t_expr_idx"
> btree, for table "public.t"
> Partitions: tp1_expr_idx
>
> Index "public.tp1_expr_idx"
> btree, for table "public.tp1", invalid
>
> But with the patch applied ATTACH PARTITION fails with ERROR: division by zero.

This makes rather sense to me.  You cannot really attach the index
in this case.  And the partition trees are clean.

> Though we still can get a partition index chain with invalid indexes
> as follows:
> create table t(a int, b int) partition by range (a);
> create table tp1(a int, b int) partition by range (a);
> alter table t attach partition tp1 for values from (1) to (100);
> create table tp1_1(a int, b int);
> insert into tp1_1 values (1, 0);
> create index concurrently on tp1_1((a/b)); -- division by zero occurs, but the index is created (as invalid)
> alter table tp1 attach partition tp1_1 for values from (1) to (10);
> create index on t((a / b));

Exotic case, leading to this tree:
=# select indisvalid, relid, parentrelid, isleaf, level
   from pg_partition_tree('t_expr_idx'), pg_index where indexrelid = relid;
 indisvalid |     relid      | parentrelid  | isleaf | level
------------+----------------+--------------+--------+-------
 t          | t_expr_idx     | null         | f      |     0
 f          | tp1_expr_idx   | t_expr_idx   | f      |     1
 f          | tp1_1_expr_idx | tp1_expr_idx | t      |     2
(3 rows)

This looks like a second bug to me in DefineIndex() where we shouldn't
even try to link to the invalid index tp1_1_expr_idx, and we should
try to create a new index instead.  Repeating the last CREATE INDEX
command leads to a failure, which is what I'd expect, because we get
the computation failure.

> It's also interesting that REINDEX for the index tree validates only
> a leaf index:
> reindex index t_expr_idx;
> ERROR:  division by zero

REINDEX on a partitioned index only works on the partitions, it does
nothing for the partitioned indexes.

> So it's not clear (to me, at least), what exactly indisvalid means
> for indexes in a partition tree.

It seems to me that validatePartitionedIndex() is what defines the
use of the flag for partitioned indexes.  indisvalid = false for a
partitioned index means that at least one of its partitions *may* not
have a valid index mapping to it, and that it should be switched to
true once we are sure that all its partitions have a valid index.
indisvalid = true on a partitioned table has much a stronger meaning,
where all the partitions *have* to have valid indexes.  Your first
case would follow that, but not the second.

For now, I have applied a patch down to v11 to fix the first issue for
the partition attaching.  Digging into the second one..
--
Michael

Вложения
On Wed, Jun 28, 2023 at 04:02:15PM +0900, Michael Paquier wrote:
> Exotic case, leading to this tree:
> =# select indisvalid, relid, parentrelid, isleaf, level
>    from pg_partition_tree('t_expr_idx'), pg_index where indexrelid = relid;
>  indisvalid |     relid      | parentrelid  | isleaf | level
> ------------+----------------+--------------+--------+-------
>  t          | t_expr_idx     | null         | f      |     0
>  f          | tp1_expr_idx   | t_expr_idx   | f      |     1
>  f          | tp1_1_expr_idx | tp1_expr_idx | t      |     2
> (3 rows)
>
> This looks like a second bug to me in DefineIndex() where we shouldn't
> even try to link to the invalid index tp1_1_expr_idx, and we should
> try to create a new index instead.  Repeating the last CREATE INDEX
> command leads to a failure, which is what I'd expect, because we get
> the computation failure.

I have looked into this one, and decided that we should just rely on
the existing business in DefineIndex() where we switch indisvalid to
false on a parent if the index found for the match is itself invalid.
The code had two bugs:
- First, we did not re-check if an index freshly created was itself
invalid, which could happen when recursing through more than one
level for the creation of a partitioned index.
- We need to have a CCI after switch the parent's indisvalid to false,
so as the follow up recursions are able to see that, and do the update
of indisvalid across the parents up to the top-most parent.

Attached is a patch to fix all that, with regression tests that play
with multiple partition levels to show the recursion and the problems
we'd have without the patch.

Alexander, what do you think?
--
Michael

Вложения
Hi Michael,

28.06.2023 10:02, Michael Paquier wrote:
>> This looks like a second bug to me in DefineIndex() where we shouldn't
>> even try to link to the invalid index tp1_1_expr_idx, and we should
>> try to create a new index instead.  Repeating the last CREATE INDEX
>> command leads to a failure, which is what I'd expect, because we get
>> the computation failure.
>
>
> I have looked into this one, and decided that we should just rely on
> the existing business in DefineIndex() where we switch indisvalid to
> false on a parent if the index found for the match is itself invalid.
> The code had two bugs:
> - First, we did not re-check if an index freshly created was itself
> invalid, which could happen when recursing through more than one
> level for the creation of a partitioned index.
> - We need to have a CCI after switch the parent's indisvalid to false,
> so as the follow up recursions are able to see that, and do the update
> of indisvalid across the parents up to the top-most parent.
>
> Attached is a patch to fix all that, with regression tests that play
> with multiple partition levels to show the recursion and the problems
> we'd have without the patch.
>
> Alexander, what do you think?

I'm somewhat confused by two things.
First, if we extend your test case by adding DETACH parted_isvalid_tab_11, we get:
            indexrelid           | indisvalid |       indrelid        |           inhparent
--------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    |
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 |
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
(5 rows)

That is, the partition tree is containing no invalid indexes now, but the
upper-level indexes in the tree are still invalid.
Moreover, I don't know how to make them valid:
reindex index parted_isvalid_idx; / reindex index parted_isvalid_tab_1_expr_idx;
doesn't affect their flags indisvalid.
(Though REINDEX for a top-level index can make leaf indexes valid.)
Reattaching parted_isvalid_tab_11 after "update parted_isvalid_tab_11 set b=1"
doesn't help either.

> It seems to me that validatePartitionedIndex() is what defines the
> use of the flag for partitioned indexes.  indisvalid = false for a
> partitioned index means that at least one of its partitions *may* not
> have a valid index mapping to it, and that it should be switched to
> true once we are sure that all its partitions have a valid index.
> indisvalid = true on a partitioned table has much a stronger meaning,
> where all the partitions *have* to have valid indexes.  Your first
> case would follow that, but not the second.

So the above test case makes me wonder, is it realistic in principle to
maintain this definition of indisvalid up-to-date for a partition tree?
And even if so, which visible behavior this definition leads (or should
lead) to?
As I noted previously, the leaf index is still used when it's parent has
indisvalid = false. (That's the second thing that confuses me.)
For example, with the partition index tree:
            indexrelid           | indisvalid |       indrelid        |           inhparent
--------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    |
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 |
  parted_isvalid_tab_11_expr_idx | t          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
(6 rows)

explain select * from parted_isvalid_tab where a / b = 1;
  Append  (cost=0.12..23.14 rows=12 width=8)
    ->  Index Scan using parted_isvalid_tab_11_expr_idx on parted_isvalid_tab_11 parted_isvalid_tab_1  (cost=0.1
2..8.14 rows=1 width=8)
          Index Cond: ((a / b) = 1)
    ->  Bitmap Heap Scan on parted_isvalid_tab_12 parted_isvalid_tab_2  (cost=4.24..14.94 rows=11 width=8)
          Recheck Cond: ((a / b) = 1)
          ->  Bitmap Index Scan on parted_isvalid_tab_12_expr_idx (cost=0.00..4.24 rows=11 width=0)
                Index Cond: ((a / b) = 1)

So I can't resist asking a question: what would break if indisvalid flags
were always true (or always false) for the upper-level partition indexes?
I've made validatePartitionedIndex() a no-op just to check that and the
only differences I observe doing `make check` are indisvalid flags printed;
there are no behavioral changes.
Though `make check-world` fails on src/bin/pg_upgrade (only); I need more
time to investigate this — maybe more sophisticated query to select
appropriate indexes for processing could help there.

Best regards,
Alexander



On Thu, Jun 29, 2023 at 12:00:00PM +0300, Alexander Lakhin wrote:
> That is, the partition tree is containing no invalid indexes now, but the
> upper-level indexes in the tree are still invalid.
> Moreover, I don't know how to make them valid:
> reindex index parted_isvalid_idx; / reindex index parted_isvalid_tab_1_expr_idx;
> doesn't affect their flags indisvalid.
> (Though REINDEX for a top-level index can make leaf indexes valid.)
> Reattaching parted_isvalid_tab_11 after "update parted_isvalid_tab_11 set b=1"
> doesn't help either.

DETACH PARTITION is not possible for indexes, but you should be able
to get back to a cleaner tree by detaching a partition, dropping an
index, and re-attaching back the partitions, so as
validatePartitionedIndex() is able to trigger a refresh of
indisvalid depending on the state of the tree.  Yeah, that's annoying,
and things are what they are for a few years now.  It does not seem
like we've looked at how all that should behave in combination with
concurrent index builds, for example, that can lead to a transient
state of a tree.

> So the above test case makes me wonder, is it realistic in principle to
> maintain this definition of indisvalid up-to-date for a partition
> tree?
>
> So I can't resist asking a question: what would break if indisvalid flags
> were always true (or always false) for the upper-level partition indexes?
> I've made validatePartitionedIndex() a no-op just to check that and the
> only differences I observe doing `make check` are indisvalid flags printed;
> there are no behavioral changes.
> Though `make check-world` fails on src/bin/pg_upgrade (only); I need more
> time to investigate this — maybe more sophisticated query to select
> appropriate indexes for processing could help there.

It offers a hint to the planner about which indexes should be used,
but I would spend a bit more time on plancat.c to see the implications
of that for partitioned indexes.  validatePartitionedIndex() has a
clean implied definition, at least.

Anyway, DefineIndex() and its handling of indisvalid is clearly wrong
when it comes to multiple partition layers as it forgets that we may
have to update the parents recursively or we could once again trigger
the assertion from validatePartitionedIndex(), which would not be a
good idea..  So my previous patch fixes that, at least.
--
Michael

Вложения
29.06.2023 13:39, Michael Paquier wrote:
> On Thu, Jun 29, 2023 at 12:00:00PM +0300, Alexander Lakhin wrote:
>> That is, the partition tree is containing no invalid indexes now, but the
>> upper-level indexes in the tree are still invalid.
>> Moreover, I don't know how to make them valid:
>> reindex index parted_isvalid_idx; / reindex index parted_isvalid_tab_1_expr_idx;
>> doesn't affect their flags indisvalid.
>> (Though REINDEX for a top-level index can make leaf indexes valid.)
>> Reattaching parted_isvalid_tab_11 after "update parted_isvalid_tab_11 set b=1"
>> doesn't help either.
> DETACH PARTITION is not possible for indexes, but you should be able
> to get back to a cleaner tree by detaching a partition, dropping an
> index, and re-attaching back the partitions, so as
> validatePartitionedIndex() is able to trigger a refresh of
> indisvalid depending on the state of the tree.

Yeah, I meant DETACH PARTITION for a table. And when I tried the sequence
you have described, I couldn't get a completely valid tree:
alter table parted_isvalid_tab_1 detach partition parted_isvalid_tab_11;
drop index parted_isvalid_idx_11;
update parted_isvalid_tab_11 set b = 1;
alter table parted_isvalid_tab_1 attach partition parted_isvalid_tab_11 for values from (1) to (5);

            indexrelid           | indisvalid | indrelid        |           inhparent
--------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    |
  parted_isvalid_tab_11_expr_idx | t          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
(5 rows)

The index  parted_isvalid_tab_11_expr_idx is created anew and it's valid,
but the higher-level indexes are still invalid.

> Anyway, DefineIndex() and its handling of indisvalid is clearly wrong
> when it comes to multiple partition layers as it forgets that we may
> have to update the parents recursively or we could once again trigger
> the assertion from validatePartitionedIndex(), which would not be a
> good idea..  So my previous patch fixes that, at least.

Yeah, in regard to that, your patch looks good to me.
Thank you!

Best regards,
Alexander



On Fri, Jun 30, 2023 at 07:00:00AM +0300, Alexander Lakhin wrote:
> Yeah, I meant DETACH PARTITION for a table. And when I tried the sequence
> you have described, I couldn't get a completely valid tree:
> alter table parted_isvalid_tab_1 detach partition parted_isvalid_tab_11;
> drop index parted_isvalid_idx_11;
> update parted_isvalid_tab_11 set b = 1;
> alter table parted_isvalid_tab_1 attach partition parted_isvalid_tab_11 for values from (1) to (5);
>
>            indexrelid           | indisvalid | indrelid        |           inhparent
> --------------------------------+------------+-----------------------+-------------------------------
>  parted_isvalid_idx             | f          | parted_isvalid_tab    |
>  parted_isvalid_tab_11_expr_idx | t          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
>  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
>  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
>  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
> (5 rows)
>
> The index  parted_isvalid_tab_11_expr_idx is created anew and it's valid,
> but the higher-level indexes are still invalid.

Indeed.  Hmm.  There is room for improvement to make some of the
partitioned indexes valid when doing such manipulations and be more
opportunistic when switching the flag on.  At least, based on what
validatePartitionedIndex() defines, this is not wrongly shaped as far
as I get it, indisvalid is a bit lossy as well when disabled.  Perhaps
this could justify having an ALTER INDEX .. VALID or something like
that as grammar?  Some folks I know of wanted an INVALID flavor to
influence the planner in the index choices, similarly to a hint.

One thing that can be done is to drop the top-parent and recreate it,
though it is not the best option performance-wise, of course.

>> Anyway, DefineIndex() and its handling of indisvalid is clearly wrong
>> when it comes to multiple partition layers as it forgets that we may
>> have to update the parents recursively or we could once again trigger
>> the assertion from validatePartitionedIndex(), which would not be a
>> good idea..  So my previous patch fixes that, at least.
>
> Yeah, in regard to that, your patch looks good to me.
> Thank you!

Cool, I have fixed this issue, then.  The buildfarm looks OK with it.
--
Michael

Вложения
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Jun 30, 2023 at 07:00:00AM +0300, Alexander Lakhin wrote:
>> Yeah, in regard to that, your patch looks good to me.
>> Thank you!

> Cool, I have fixed this issue, then.  The buildfarm looks OK with it.

While going through the commit log to prepare release notes, I was
struck by the fact that this commit (fc55c7ff8 et al) takes the
approach of ignoring invalid child indexes during ATTACH PARTITION:

    I have studied a few options here (like the possibility to switch
    indisvalid to false for the parent), but came down to the conclusion
    that we'd better rely on a simple rule: invalid indexes had better never
    be chosen, so as the partition attached uses and creates indexes that
    the parent expects.

However, your later fix for handling invalid child indexes during
CREATE PARTITIONED INDEX (cfc43aeb3 et al) does the opposite:

    This patch makes sure that indisvalid is set to false on a partitioned
    index if at least one of its partition is invalid.  The flag is set to
    true if *all* its partitions are valid.

This seems inconsistent: why do we behave one way during CREATE and
another during ATTACH?  Do we even need fc55c7ff8's logic change
anymore?  That is, is it possible that the fixes added by cfc43aeb3
would be enough to resolve the problem (though with a different
outcome that the partitioned index remains invalid after ATTACH
PARTITION)?

Perhaps the week before minor releases is not the time to be fooling
around with this, but it seems like an odd set of choices.

            regards, tom lane



On Wed, Aug 02, 2023 at 02:01:40PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Cool, I have fixed this issue, then.  The buildfarm looks OK with it.

(I was out for the last couple of days, apologies for the reply
delay.)

> While going through the commit log to prepare release notes, I was
> struck by the fact that this commit (fc55c7ff8 et al) takes the
> approach of ignoring invalid child indexes during ATTACH PARTITION:
>
>     I have studied a few options here (like the possibility to switch
>     indisvalid to false for the parent), but came down to the conclusion
>     that we'd better rely on a simple rule: invalid indexes had better never
>     be chosen, so as the partition attached uses and creates indexes that
>     the parent expects.
>
> However, your later fix for handling invalid child indexes during
> CREATE PARTITIONED INDEX (cfc43aeb3 et al) does the opposite:
>
>     This patch makes sure that indisvalid is set to false on a partitioned
>     index if at least one of its partition is invalid.  The flag is set to
>     true if *all* its partitions are valid.
>
> This seems inconsistent: why do we behave one way during CREATE and
> another during ATTACH?  Do we even need fc55c7ff8's logic change
> anymore?  That is, is it possible that the fixes added by cfc43aeb3
> would be enough to resolve the problem (though with a different
> outcome that the partitioned index remains invalid after ATTACH
> PARTITION)?

Removing fc55c7f is not enough as outlined by the test added in this
commit.  There is a lot of logic in tablecmds.c to flip indisvalid
from false to true in a partitioned index after attaching a new
partition table once all its index partitions are valid, but this
would require us to support the opposite switch.

In short, there would be cases where a partitioned index is switched
from true to false, while we work hard in making sure that only the
opposite is possible, say:
- A partitioned table is created to be a top-most parent "p", with an
index "i".  indisvalid is true for "i" since its creation as it has no
partitions.
- Now assume that there's a second partitioned table "p1" attached
nowhere, that has an existing set of partition.  It will be attached
to the previous partitioned "p".  "p1" has an index "i1" that has been
created with ONLY, some of the partitions may have, or not, a valid
index matching it.  Hence, "i1" has indisvalid=false.

When attaching "p1" to "p", should we force "i" to become invalid as
an effect of "i1" being invalid, meaning that for some reason a
portion of the partitioned index tree is not complete yet (either
index not created, or concurrent index creation failure)?  Note that
we don't move to the creation of indexes in the partitions if finding
an invalid index.  Or it is better to ignore "i1" and go through the
index creation, finding out that suddenly a part of the tree is
invalid?

CREATE INDEX .. ON ONLY gives a lot of ways to manipulate index
partitions (one cannot detach indexes with ALTER INDEX), and I don't
really see it being in many useful ways, TBH, but we support the
grammar.  Anyway, I think that the point is in usability here: by
ensuring that the creation of valid indexes are enforced when
attaching a partition, we have less bad surprises with portions of a
partitioned index tree becoming suddenly invalid when it was
previously valid.

Perhaps Alvaro could weigh in here, as the former author of 8b08f7d
and CREATE INDEX ON ONLY, so I am attaching him in CC.
--
Michael

Вложения