Обсуждение: how to create index concurrently on paritioned table
On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote: > Partitioning is necessary for very large tables. > However, I found that postgresql does not support create index concurrently on partitioned tables. > The document show that we need to create an index on each partition individually and then finally create the partitionedindex non-concurrently. > This is undoubtedly a complex operation for DBA, especially when there are many partitions. > Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? > What are the difficulties of this function? > If I want to implement it, what should I pay attention? Maybe I'm wrong, but I don't think there's any known difficulty - just that nobody did it yet. You should pay attention to what happens on error, but hopefully you wouldn't need to add much code and can rely on existing code to paths to handle that right. I think you'd look at the commits and code implementing indexes on partitioned tables and CREATE INDEX CONCURRENTLY. And maybe any following commits with fixes. You'd first loop around all children (recursively if there are partitions which are themselves partitioned) and create indexes concurrently. -- Justin
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote: > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote: > > Partitioning is necessary for very large tables. > > However, I found that postgresql does not support create index concurrently on partitioned tables. > > The document show that we need to create an index on each partition individually and then finally create the partitionedindex non-concurrently. > > This is undoubtedly a complex operation for DBA, especially when there are many partitions. > > > Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? > > What are the difficulties of this function? > > If I want to implement it, what should I pay attention? > > Maybe I'm wrong, but I don't think there's any known difficulty - just that > nobody did it yet. I said that but I was actually thinking about the code for "REINDEX CONCURRENTLY" (which should also handle partitioned tables). I looked at CIC now and came up with the attached. All that's needed to allow this case is to close the relation before recursing to partitions - it needs to be closed before calling CommitTransactionCommand(). There's probably a better way to write this, but I can't see that there's anything complicated about handling partitioned tables. -- Justin
Вложения
On Sun, Jun 07, 2020 at 01:04:48PM -0500, Justin Pryzby wrote: > On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote: > > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote: > > > Partitioning is necessary for very large tables. However, I found that > > > postgresql does not support create index concurrently on partitioned > > > tables. The document show that we need to create an index on each > > > partition individually and then finally create the partitioned index > > > non-concurrently. This is undoubtedly a complex operation for DBA, > > > especially when there are many partitions. I added functionality for C-I-C, REINDEX-CONCURRENTLY, and CLUSTER of partitioned tables. We already recursively handle VACUUM and ANALYZE since v10. And added here: https://commitfest.postgresql.org/28/2584/ Adger, if you're familiar with compilation and patching, do you want to try the patch ? Note, you could do this now using psql like: SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM pg_partition_ancestors() AS a; \gexec -- Justin
Вложения
On Thu, Jun 11, 2020 at 10:35:02AM -0500, Justin Pryzby wrote: > Note, you could do this now using psql like: > SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM pg_partition_ancestors() AS a; > \gexec I have skimmed quickly through the patch set, and something has caught my attention. > drop table idxpart; > --- Some unsupported features > +-- CIC on partitioned table > create table idxpart (a int, b int, c text) partition by range (a); > create table idxpart1 partition of idxpart for values from (0) to (10); > create index concurrently on idxpart (a); > -ERROR: cannot create index on partitioned table "idxpart" concurrently > +\d idxpart1 When it comes to test behaviors specific to partitioning, there are in my experience three things to be careful about and stress in the tests: - Use at least two layers of partitioning. - Include into the partition tree a partition that has no leaf partitions. - Test the commands on the top-most parent, a member in the middle of the partition tree, the partition with no leaves, and one leaf, making sure that relfilenode changes where it should and that partition trees remain intact (you can use pg_partition_tree() for that.) That's to say that the amount of regression tests added here is not sufficient in my opinion. -- Michael
Вложения
> > > Partitioning is necessary for very large tables.
> > > However, I found that postgresql does not support create index concurrently on partitioned tables.
> > > The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently.
> > > This is undoubtedly a complex operation for DBA, especially when there are many partitions.
> >
> > > Therefore, I wonder why pg does not support concurrent index creation on partitioned tables?
> > > What are the difficulties of this function?
> > > If I want to implement it, what should I pay attention?
> >
> > Maybe I'm wrong, but I don't think there's any known difficulty - just that
> > nobody did it yet.
> I said that but I was actually thinking about the code for "REINDEX
> CONCURRENTLY" (which should also handle partitioned tables).
> I looked at CIC now and came up with the attached. All that's needed to allow
> this case is to close the relation before recursing to partitions - it needs to
> be closed before calling CommitTransactionCommand(). There's probably a better
> way to write this, but I can't see that there's anything complicated about
> handling partitioned tables.
#0 PopActiveSnapshot () at snapmgr.c:822
#1 0x00000000005ca687 in DefineIndex (relationId=relationId@entry=16400,
stmt=stmt@entry=0x1aa5e28, indexRelationId=16408, indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=16406,
parentConstraintId=0, is_alter_table=is_alter_table@entry=false,
check_rights=true, check_not_in_use=true, skip_build=false, quiet=false)
at indexcmds.c:1426
#2 0x00000000005ca5ab in DefineIndex (relationId=relationId@entry=16384,
stmt=stmt@entry=0x1b35278, indexRelationId=16406, indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=0,
parentConstraintId=parentConstraintId@entry=0, is_alter_table=
is_alter_table@entry=false, check_rights=true, check_not_in_use=true,
skip_build=false, quiet=false) at indexcmds.c:1329
#3 0x000000000076bf80 in ProcessUtilitySlow (pstate=pstate@entry=0x1b350c8,
pstmt=pstmt@entry=0x1a2bf40,
queryString=queryString@entry=0x1a2b2c8 "create index CONCURRENTLY
idxpart_a_idx on idxpart (a);", context=context@entry=PROCESS_UTILITY_TOPLEVEL,
params=params@entry=0x0,
queryEnv=queryEnv@entry=0x0, qc=0x7ffc86cc7630, dest=0x1a2c200) at
utility.c:1474
#4 0x000000000076afeb in standard_ProcessUtility (pstmt=0x1a2bf40,
queryString=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x1a2c200, qc=0x7ffc86cc7630) at utility.c:1069
#5 0x0000000000768992 in PortalRunUtility (portal=0x1a8d1f8, pstmt=0x1a2bf40,
isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>,
qc=0x7ffc86cc7630) at pquery.c:1157
#6 0x00000000007693f3 in PortalRunMulti (portal=portal@entry=0x1a8d1f8,
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x1a2c200,
altdest=altdest@entry=0x1a2c200, qc=qc@entry=0x7ffc86cc7630) at pquery.c:1310
#7 0x0000000000769ed3 in PortalRun (portal=portal@entry=0x1a8d1f8, count=count
@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once
@entry=true, dest=dest@entry=0x1a2c200,
altdest=altdest@entry=0x1a2c200, qc=0x7ffc86cc7630) at pquery.c:779
#8 0x0000000000765b06 in exec_simple_query (query_string=0x1a2b2c8 "create
index CONCURRENTLY idxpart_a_idx on idxpart (a);") at postgres.c:1239
#9 0x0000000000767de5 in PostgresMain (argc=<optimized out>, argv=argv@entry
=0x1a552c8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4315
#10 0x00000000006f2b23 in BackendRun (port=0x1a4d1e0, port=0x1a4d1e0) at postmaster.c:4523
#11 BackendStartup (port=0x1a4d1e0) at postmaster.c:4215
#12 ServerLoop () at postmaster.c:1727
#13 0x00000000006f3a1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1a25ea0)
at postmaster.c:1400
#14 0x00000000004857f9 in main (argc=3, argv=0x1a25ea0) at main.c:210
You can re-produce it like this:
create table idxpart (a int, b int, c text) partition by range (a);
create table idxpart1 partition of idxpart for values from (0) to (10);
create table idxpart2 partition of idxpart for values from (10) to (20);
create index CONCURRENTLY idxpart_a_idx on idxpart (a);
Thank you very much,
Regards, Adger
> hopefully you wouldn't need to add much code and can rely on existing code to
> paths to handle that right.
> I think you'd look at the commits and code implementing indexes on partitioned
> tables and CREATE INDEX CONCURRENTLY. And maybe any following commits with
> fixes.
> You'd first loop around all children (recursively if there are partitions which
> are themselves partitioned) and create indexes concurrently.
Thank you very much,
Regards, Adger
On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote: > As we all know, CIC has three transactions. If we recursively in n partitioned tables, > it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal... > > If an error occurs when an index is created in one of the partitions, > what should we do with our new index? My (tentative) understanding is that these types of things should use a "subtransaction" internally.. So if the toplevel transaction rolls back, its changes are lost. In some cases, it might be desirable to not roll back, in which case the user(client) should first create indexes (concurrently if needed) on every child, and then later create index on parent (that has the advtantage of working on older servers, too). postgres=# SET client_min_messages=debug; postgres=# CREATE INDEX ON t(i); DEBUG: building index "t1_i_idx" on table "t1" with request for 1 parallel worker DEBUG: index "t1_i_idx" can safely use deduplication DEBUG: creating and filling new WAL file DEBUG: done creating and filling new WAL file DEBUG: creating and filling new WAL file DEBUG: done creating and filling new WAL file DEBUG: building index "t2_i_idx" on table "t2" with request for 1 parallel worker ^C2020-06-12 13:08:17.001 CDT [19291] ERROR: canceling statement due to user request 2020-06-12 13:08:17.001 CDT [19291] STATEMENT: CREATE INDEX ON t(i); 2020-06-12 13:08:17.001 CDT [27410] FATAL: terminating connection due to administrator command 2020-06-12 13:08:17.001 CDT [27410] STATEMENT: CREATE INDEX ON t(i); Cancel request sent If the index creation is interrupted at this point, no indexes will exist. On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote: > >On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote: > > I looked at CIC now and came up with the attached. All that's needed to allow > > this case is to close the relation before recursing to partitions - it needs to > > be closed before calling CommitTransactionCommand(). There's probably a better > > way to write this, but I can't see that there's anything complicated about > > handling partitioned tables. > > I'm so sorry about getting back late. > Thank you very much for helping me consider this issue. > I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql. > I got a coredump that was easy to reproduce. As follows: > I have been trying to get familiar with the source code of create index. > Can you solve this bug first? I will try my best to implement CIC with you. > Next, I will read your patchs v2-002 and v2-003. Thanks, fixed. On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote: > When it comes to test behaviors specific to partitioning, there are in > my experience three things to be careful about and stress in the tests: > - Use at least two layers of partitioning. > - Include into the partition tree a partition that has no leaf > partitions. > - Test the commands on the top-most parent, a member in the middle of > the partition tree, the partition with no leaves, and one leaf, making > sure that relfilenode changes where it should and that partition trees > remain intact (you can use pg_partition_tree() for that.) Added, thanks for looking. -- Justin
Вложения
> "subtransaction" internally.. So if the toplevel transaction rolls back, its
> changes are lost. In some cases, it might be desirable to not roll back, in
> which case the user(client) should first create indexes (concurrently if
> needed) on every child, and then later create index on parent (that has the
> advtantage of working on older servers, too).
postgres=# CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE
postgres=# CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (25);
CREATE TABLE
postgres=# CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (25) TO (50);
CREATE TABLE
postgres=# CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (50) TO (75);
CREATE TABLE
postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 74) i;
INSERT 0 75
postgres=# insert into prt1 values (26,1,'FM0026');
INSERT 0 1
postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);
ERROR: could not create unique index "prt1_p2_a_idx"
DETAIL: Key (a)=(26) is duplicated.
postgres=# \d+ prt1
Partitioned table "public.prt1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition key: RANGE (a)
Indexes:
"idexpart_cic" UNIQUE, btree (a) INVALID
Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),
prt1_p2 FOR VALUES FROM (25) TO (50),
prt1_p3 FOR VALUES FROM (50) TO (75)
postgres=# \d+ prt1_p1
Table "public.prt1_p1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (0) TO (25)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))
Indexes:
"prt1_p1_a_idx" UNIQUE, btree (a)
Access method: heap
postgres=# \d+ prt1_p2
Table "public.prt1_p2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (25) TO (50)
Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))
Indexes:
"prt1_p2_a_idx" UNIQUE, btree (a) INVALID
Access method: heap
postgres=# \d+ prt1_p3
Table "public.prt1_p3"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (50) TO (75)
Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))
Access method: heap
postgres=# truncate table prt1;
TRUNCATE TABLE
postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 74) i;
INSERT 0 75
postgres=# insert into prt1 values (51,1,'FM0051');
INSERT 0 1
postgres=# drop index idexpart_cic;
DROP INDEX
postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);
ERROR: could not create unique index "prt1_p3_a_idx"
DETAIL: Key (a)=(51) is duplicated.
postgres=# \d+ prt1
Partitioned table "public.prt1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition key: RANGE (a)
Indexes:
"idexpart_cic" UNIQUE, btree (a) INVALID
Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),
prt1_p2 FOR VALUES FROM (25) TO (50),
prt1_p3 FOR VALUES FROM (50) TO (75)
postgres=# \d+ prt1_p1
Table "public.prt1_p1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (0) TO (25)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))
Indexes:
"prt1_p1_a_idx" UNIQUE, btree (a)
Access method: heap
postgres=# \d+ prt1_p2
Table "public.prt1_p2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (25) TO (50)
Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))
Indexes:
"prt1_p2_a_idx" UNIQUE, btree (a)
Access method: heap
postgres=# \d+ prt1_p3
Table "public.prt1_p3"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (50) TO (75)
Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))
Access method: heap
```
Now we can see that the first two partitions have indexes,
but the third partition has no indexes due to an error.
Therefore, in our first case, it should not be what we expected that the third partition has no index.
That is to say, when our CIC goes wrong, either roll back all or go down instead of stopping in the middle.
This is my shallow opinion, please take it as your reference.
Thank you very much,
Regards, Adger
------------------------------------------------------------------发件人:Justin Pryzby <pryzby@telsasoft.com>发送时间:2020年6月13日(星期六) 02:15收件人:Michael Paquier <michael@paquier.xyz>; 李杰(慎追) <adger.lj@alibaba-inc.com>抄 送:pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>主 题:Re: how to create index concurrently on partitioned tableOn Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote:
> As we all know, CIC has three transactions. If we recursively in n partitioned tables,
> it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal...
>
> If an error occurs when an index is created in one of the partitions,
> what should we do with our new index?
My (tentative) understanding is that these types of things should use a
"subtransaction" internally.. So if the toplevel transaction rolls back, its
changes are lost. In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).
postgres=# SET client_min_messages=debug;
postgres=# CREATE INDEX ON t(i);
DEBUG: building index "t1_i_idx" on table "t1" with request for 1 parallel worker
DEBUG: index "t1_i_idx" can safely use deduplication
DEBUG: creating and filling new WAL file
DEBUG: done creating and filling new WAL file
DEBUG: creating and filling new WAL file
DEBUG: done creating and filling new WAL file
DEBUG: building index "t2_i_idx" on table "t2" with request for 1 parallel worker
^C2020-06-12 13:08:17.001 CDT [19291] ERROR: canceling statement due to user request
2020-06-12 13:08:17.001 CDT [19291] STATEMENT: CREATE INDEX ON t(i);
2020-06-12 13:08:17.001 CDT [27410] FATAL: terminating connection due to administrator command
2020-06-12 13:08:17.001 CDT [27410] STATEMENT: CREATE INDEX ON t(i);
Cancel request sent
If the index creation is interrupted at this point, no indexes will exist.
On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote:
> >On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > I looked at CIC now and came up with the attached. All that's needed to allow
> > this case is to close the relation before recursing to partitions - it needs to
> > be closed before calling CommitTransactionCommand(). There's probably a better
> > way to write this, but I can't see that there's anything complicated about
> > handling partitioned tables.
>
> I'm so sorry about getting back late.
> Thank you very much for helping me consider this issue.
> I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
> I got a coredump that was easy to reproduce. As follows:
> I have been trying to get familiar with the source code of create index.
> Can you solve this bug first? I will try my best to implement CIC with you.
> Next, I will read your patchs v2-002 and v2-003.
Thanks, fixed.
On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote:
> When it comes to test behaviors specific to partitioning, there are in
> my experience three things to be careful about and stress in the tests:
> - Use at least two layers of partitioning.
> - Include into the partition tree a partition that has no leaf
> partitions.
> - Test the commands on the top-most parent, a member in the middle of
> the partition tree, the partition with no leaves, and one leaf, making
> sure that relfilenode changes where it should and that partition trees
> remain intact (you can use pg_partition_tree() for that.)
Added, thanks for looking.
--
Justin
On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote: > As shown above, an error occurred while creating an index in the second partition. > It can be clearly seen that the index of the partitioned table is invalid > and the index of the first partition is normal, the second partition is invalid, > and the Third Partition index does not exist at all. That's a problem. I really think that we should make the steps of the concurrent operation consistent across all relations, meaning that all the indexes should be created as invalid for all the parts of the partition tree, including partitioned tables as well as their partitions, in the same transaction. Then a second new transaction gets used for the index build, followed by a third one for the validation that switches the indexes to become valid. -- Michael
Вложения
> the indexes should be created as invalid for all the parts of the
> partition tree, including partitioned tables as well as their
> partitions, in the same transaction. Then a second new transaction
> gets used for the index build, followed by a third one for the
> validation that switches the indexes to become valid.
------------------------------------------------------------------发件人:Michael Paquier <michael@paquier.xyz>发送时间:2020年6月15日(星期一) 20:37收件人:李杰(慎追) <adger.lj@alibaba-inc.com>抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>主 题:Re: 回复:how to create index concurrently on partitioned tableOn Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> As shown above, an error occurred while creating an index in the second partition.
> It can be clearly seen that the index of the partitioned table is invalid
> and the index of the first partition is normal, the second partition is invalid,
> and the Third Partition index does not exist at all.
That's a problem. I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction. Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael
On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote: > This is a good idea. > We should maintain the consistency of the entire partition table. > However, this is not a small change in the code. > May be we need to make a new design for DefineIndex function.... Indeed. I have looked at the patch set a bit and here is the related bit for CIC in 0001, meaning that you handle the basic index creation for a partition tree within one transaction for each: + /* + * CIC needs to mark a partitioned table as VALID, which itself + * requires setting READY, which is unset for CIC (even though + * it's meaningless for an index without storage). + */ + if (concurrent) + { + PopActiveSnapshot(); + CommitTransactionCommand(); + StartTransactionCommand(); + index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY); + + CommandCounterIncrement(); + index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID); } I am afraid that this makes the error handling more complicated, with risks of having inconsistent partition trees. That's the point you raised. This one is going to need more thoughts. > But most importantly, if we use three steps to implement CIC, > we will spend more time than ordinary tables, especially in high > concurrency cases. To wait for one of partitions which the users to > use frequently, the creation of other partition indexes is delayed. > Is it worth doing this? CIC is an operation that exists while allowing read and writes to still happen in parallel, so that's fine IMO if it takes time. Now it is true that it may not scale well so we should be careful with the approach taken. Also, I think that the case of REINDEX would require less work overall because we already have some code in place to gather multiple indexes from one or more relations and work on these consistently, all at once. -- Michael
Вложения
> risks of having inconsistent partition trees. That's the point you
> raised. This one is going to need more thoughts.
> still happen in parallel, so that's fine IMO if it takes time. Now it
> is true that it may not scale well so we should be careful with the
> approach taken. Also, I think that the case of REINDEX would require
> less work overall because we already have some code in place to gather
> multiple indexes from one or more relations and work on these
> consistently, all at once.
------------------------------------------------------------------发件人:Michael Paquier <michael@paquier.xyz>发送时间:2020年6月16日(星期二) 09:02收件人:李杰(慎追) <adger.lj@alibaba-inc.com>抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>主 题:Re: 回复:回复:how to create index concurrently on partitioned tableOn Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
> This is a good idea.
> We should maintain the consistency of the entire partition table.
> However, this is not a small change in the code.
> May be we need to make a new design for DefineIndex function....
Indeed. I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+ /*
+ * CIC needs to mark a partitioned table as VALID, which itself
+ * requires setting READY, which is unset for CIC (even though
+ * it's meaningless for an index without storage).
+ */
+ if (concurrent)
+ {
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+ CommandCounterIncrement();
+ index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
}
I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees. That's the point you
raised. This one is going to need more thoughts.
> But most importantly, if we use three steps to implement CIC,
> we will spend more time than ordinary tables, especially in high
> concurrency cases. To wait for one of partitions which the users to
> use frequently, the creation of other partition indexes is delayed.
> Is it worth doing this?
CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time. Now it
is true that it may not scale well so we should be careful with the
approach taken. Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael
Thank you very much,
Regards, Adger
On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote: > However, I found a problem. If there are many partitions, > we may need to handle too many missing index entries when > validate_index(). Especially for the first partition, the time may > have been long and many entries are missing. In this case, why > don't we put the second and third phase together into a transaction > for each partition? Not sure I am following. In the case of REINDEX, it seems to me that the calls to validate_index() and index_concurrently_build() can happen in a separate transaction for each index, as long as all the calls to index_concurrently_swap() are grouped together in the same transaction to make sure that index partition trees are switched consistently when all entries are swapped from an invalid state to a valid state, because the swapping phase is also when we attach a fresh index to a partition tree. See also index_concurrently_create_copy() where we don't set parentIndexRelid for the lower call to index_create(). It would be good of course to check that when swapping we have the code to handle that for a lot of indexes at once. -- Michael
Вложения
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree. See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create(). It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.
StartTransaction three
StartTransaction four
StartTransaction five
StartTransaction six
Thank you very much,
Regards, Adger
------------------------------------------------------------------发件人:Michael Paquier <michael@paquier.xyz>发送时间:2020年6月18日(星期四) 10:41收件人:李杰(慎追) <adger.lj@alibaba-inc.com>抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned tableOn Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions,
> we may need to handle too many missing index entries when
> validate_index(). Especially for the first partition, the time may
> have been long and many entries are missing. In this case, why
> don't we put the second and third phase together into a transaction
> for each partition?
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree. See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create(). It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.
StartTransaction four
StartTransaction six
StartTransaction xx
StartTransaction xx
Thank you very much,
Regards, Adger
------------------------------------------------------------------发件人:李杰(慎追) <adger.lj@alibaba-inc.com>发送时间:2020年6月18日(星期四) 14:37收件人:Michael Paquier <michael@paquier.xyz>抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>主 题:回复:回复:回复:回复:how to create index concurrently on partitioned table> Not sure I am following. In the case of REINDEX, it seems to me that
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree. See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create(). It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.Let's look at this example:A partition table has five partitions,parttable: part1, part2, part3, part3 ,part5We simply abstract the following definitions:phase 1: index_create(), it is only registered in catalogs.phase 2: index_concurrently_build(), Build the indexes.phase 3: validate_index(), insert any missing index entries, mark the index as valid.(schema 1)```StartTransaction oneparttable phase 1part 1 phase 1part 2 phase 1part 3 phase 1part 4 phase 1part 5 phase 1CommitTransactionStartTransaction twoparttable phase 2part 1 phase 2part 2 phase 2part 3 phase 2 (error occurred )part 4 phase 2part 5 phase 2CommitTransactionStartTransaction threeparttable phase 3part 1 phase 3part 2 phase 3part 3 phase 3part 4 phase 4part 5 phase 5CommitTransaction...```Now, the following steps cannot continue due to an error in Transaction two .so, Transaction two roll back, Transaction three haven't started.All of our indexes are invalid. In this way,we ensure the strong consistency of indexes in the partition tree.However, we need to rebuild all indexes when reindex.(schema 2)```StartTransaction oneparttable phase 1part 1 phase 1part 2 phase 1part 3 phase 1part 4 phase 1part 5 phase 1CommitTransactionStartTransaction twopart 1 phase 2part 1 phase 3CommitTransactionpart 2 phase 2
StartTransaction threepart 2 phase 3CommitTransactionpart 3 phase 2 (error occurred )
StartTransaction fourpart 3 phase 3CommitTransactionpart 4 phase 2
StartTransaction fivepart 4 phase 3part 5 phase 2
StartTransaction sixpart 5 phase 3CommitTransactionStartTransaction sevenparttable phase 2parttable phase 3CommitTransaction```Now, the following steps cannot continue due to an error in Transaction four .so, Transaction four roll back, Transactions behind Transaction 3 have not startedThe indexes of the p1 and p2 partitions are available. Other indexes are invalid.In reindex, we can ignore the rebuild of p1 and p2.This seems better, although it seems to be inconsistent.Do you think that scheme is more suitable for CIC?Thank you very much,
Regards, Adger
------------------------------------------------------------------发件人:Michael Paquier <michael@paquier.xyz>发送时间:2020年6月18日(星期四) 10:41收件人:李杰(慎追) <adger.lj@alibaba-inc.com>抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned tableOn Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions,
> we may need to handle too many missing index entries when
> validate_index(). Especially for the first partition, the time may
> have been long and many entries are missing. In this case, why
> don't we put the second and third phase together into a transaction
> for each partition?
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael
On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier wrote: > On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote: > > As shown above, an error occurred while creating an index in the second partition. > > It can be clearly seen that the index of the partitioned table is invalid > > and the index of the first partition is normal, the second partition is invalid, > > and the Third Partition index does not exist at all. > > That's a problem. I really think that we should make the steps of the > concurrent operation consistent across all relations, meaning that all > the indexes should be created as invalid for all the parts of the > partition tree, including partitioned tables as well as their > partitions, in the same transaction. Then a second new transaction > gets used for the index build, followed by a third one for the > validation that switches the indexes to become valid. Note that the mentioned problem wasn't serious: there was missing index on child table, therefor the parent index was invalid, as intended. However I agree that it's not nice that the command can fail so easily and leave behind some indexes created successfully and some failed some not created at all. But I took your advice initially creating invalid inds. On Tue, Jun 16, 2020 at 10:02:21AM +0900, Michael Paquier wrote: > CIC is an operation that exists while allowing read and writes to > still happen in parallel, so that's fine IMO if it takes time. Now it > is true that it may not scale well so we should be careful with the > approach taken. Also, I think that the case of REINDEX would require > less work overall because we already have some code in place to gather > multiple indexes from one or more relations and work on these > consistently, all at once. I'm not sure if by reindex you mean my old 0002. That's now 0001, so if it can be simplified somehow, that's great.. That gave me the idea to layer CIC on top of Reindex, since I think it does exactly what's needed. -- Justin
Вложения
On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote: > That gave me the idea to layer CIC on top of Reindex, since I think it does > exactly what's needed. For now, I would recommend to focus first on 0001 to add support for partitioned tables and indexes to REINDEX. CIC is much more complicated btw, but I am not entering in the details now. - /* - * This may be useful when implemented someday; but that day is not today. - * For now, avoid erroring out when called in a multi-table context - * (REINDEX SCHEMA) and happen to come across a partitioned table. The - * partitions may be reindexed on their own anyway. - */ + /* Avoid erroring out */ if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { This comment does not help, and actually this becomes incorrect as reindex for this relkind becomes supported once 0001 is done. + case RELKIND_INDEX: + reindex_index(inhrelid, false, get_rel_persistence(inhrelid), + options | REINDEXOPT_REPORT_PROGRESS); + break; + case RELKIND_RELATION: + (void) reindex_relation(inhrelid, + REINDEX_REL_PROCESS_TOAST | + REINDEX_REL_CHECK_CONSTRAINTS, + options | REINDEXOPT_REPORT_PROGRESS); ReindexPartitionedRel() fails to consider the concurrent case here for partition indexes and tables, as reindex_index()/reindex_relation() are the APIs used in the non-concurrent case. Once you consider the concurrent case correctly, we also need to be careful with partitions that have a temporary persistency (note that we don't allow partition trees to mix persistency types, all partitions have to be temporary or permanent). I think that you are right to make the entry point to handle partitioned index in ReindexIndex() and partitioned table in ReindexTable(), but the structure of the patch should be different: - The second portion of ReindexMultipleTables() should be moved into a separate routine, taking in input a list of relation OIDs. This needs to be extended a bit so as reindex_index() gets called for an index relkind if the relpersistence is temporary or if we have a non-concurrent reindex. The idea is that we finish with a single code path able to work on a list of relations. And your patch adds more of that as of ReindexPartitionedRel(). - We should *not* handle directly partitioned index and/or table in ReindexRelationConcurrently() to not complicate the logic where we gather all the indexes of a table/matview. So I think that the list of partition indexes/tables to work on should be built directly in ReindexIndex() and ReindexTable(), and then this should call the second part of ReindexMultipleTables() refactored in the previous point. This way, each partition index gets done individually in its own transaction. For a partition table, all indexes of this partition are rebuilt in the same set of transactions. For the concurrent case, we have already reindex_concurrently_swap that it able to switch the dependencies of two indexes within a partition tree, so we can rely on that so as a failure in the middle of the operation never leaves the a partition structure in an inconsistent state. -- Michael
Вложения
Thanks for looking. On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote: > > exactly what's needed. > > For now, I would recommend to focus first on 0001 to add support for > partitioned tables and indexes to REINDEX. CIC is much more > complicated btw, but I am not entering in the details now. > > + /* Avoid erroring out */ > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > { > This comment does not help, and actually this becomes incorrect as > reindex for this relkind becomes supported once 0001 is done. I made a minimal change to avoid forgetting to eventually change that part. > ReindexPartitionedRel() fails to consider the concurrent case here for > partition indexes and tables, as reindex_index()/reindex_relation() > are the APIs used in the non-concurrent case. Once you consider the > concurrent case correctly, we also need to be careful with partitions > that have a temporary persistency (note that we don't allow partition > trees to mix persistency types, all partitions have to be temporary or > permanent). Fixed. > I think that you are right to make the entry point to handle > partitioned index in ReindexIndex() and partitioned table in > ReindexTable(), but the structure of the patch should be different: > - The second portion of ReindexMultipleTables() should be moved into a > separate routine, taking in input a list of relation OIDs. This needs > to be extended a bit so as reindex_index() gets called for an index > relkind if the relpersistence is temporary or if we have a > non-concurrent reindex. The idea is that we finish with a single code > path able to work on a list of relations. And your patch adds more of > that as of ReindexPartitionedRel(). It's a good idea. > - We should *not* handle directly partitioned index and/or table in > ReindexRelationConcurrently() to not complicate the logic where we > gather all the indexes of a table/matview. So I think that the list > of partition indexes/tables to work on should be built directly in > ReindexIndex() and ReindexTable(), and then this should call the > second part of ReindexMultipleTables() refactored in the previous > point. I think I addressed these mostly as you intended. > This way, each partition index gets done individually in its > own transaction. For a partition table, all indexes of this partition > are rebuilt in the same set of transactions. For the concurrent case, > we have already reindex_concurrently_swap that it able to switch the > dependencies of two indexes within a partition tree, so we can rely on > that so as a failure in the middle of the operation never leaves the > a partition structure in an inconsistent state. -- Justin
Вложения
On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote: > On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote: >> For now, I would recommend to focus first on 0001 to add support for >> partitioned tables and indexes to REINDEX. CIC is much more >> complicated btw, but I am not entering in the details now. >> >> + /* Avoid erroring out */ >> if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) >> { >> This comment does not help, and actually this becomes incorrect as >> reindex for this relkind becomes supported once 0001 is done. > > I made a minimal change to avoid forgetting to eventually change > that part. Why not changing it then? We already filter out per relkind in all the code paths calling reindex_relation(), be it in indexcmds.c for schema-level reindex or even tablecmds.c, so I have switched this part to an elog(). >> - We should *not* handle directly partitioned index and/or table in >> ReindexRelationConcurrently() to not complicate the logic where we >> gather all the indexes of a table/matview. So I think that the list >> of partition indexes/tables to work on should be built directly in >> ReindexIndex() and ReindexTable(), and then this should call the >> second part of ReindexMultipleTables() refactored in the previous >> point. > > I think I addressed these mostly as you intended. Mostly. I have been hacking on this patch, and basically rewrote it as the attached. The handling of the memory context used to keep the list of partitions intact across transactions was rather clunky: the context was not reset when we are done, and we would call more APIs than necessary while switching to it, like find_all_inheritors() which could do much more allocations. I have fixed that by minimizing the areas where the private context is used, switching to it only when saving a new OID in the list of partitions, or a session lock (see below for this part). While on it, I found that the test coverage was not enough, so I have extended the set of tests to make sure any concurrent and non-concurrent operation for partitioned tables and indexes change the correct set of relfilenodes for each operation. I have written some custom functions to minimize the duplication (the whole thing cannot be grouped as those commands cannot run in a transaction block). Speaking of which, the patch missed that REINDEX INDEX/TABLE should not run in a transaction block when working on a partitioned relation. And the documentation needs to be clear about the limitation of each operation, so I have written more about all that. The patch also has commented out areas with slashes or such, and I have added some elog() and some asserts to make sure that we don't cross any areas that should not work with partitioned relations. While hacking on this patch, I have found an old bug in the REINDEX logic: we build a list of relations to reindex in ReindexMultipleTables() for schema and database reindexes, but it happens that we don't recheck if the relations listed actually exists or not, so dropping a relation during a large reindex can cause sparse failures because of relations that cannot be found anymore. In the case of this thread, the problem is different though (the proposed patch was full of holes regarding that) and we need to use session locks on the parent *table* partitions (not the indexes) to avoid any issues within the first transaction building the list of relations to work on, similarly to REINDEX CONCURRENTLY. So I fixed this problem this way. For the schema and database cases, I think that we would need to do something similar to VACUUM, aka have an extra code path to skip relations not defined. I'll leave that for another thread. One last thing. I think that the patch is in a rather good shape, but there is one error message I am not happy with when running some commands in a transaction block. Say, this sequence: CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id); CREATE INDEX parent_index ON parent_tab (id); BEGIN; REINDEX INDEX parent_index; -- error ERROR: 25001: REINDEX INDEX cannot run inside a transaction block LOCATION: PreventInTransactionBlock, xact.c:3386 This error can be confusing, because we don't tell directly that the relation involved here is partitioned, and REINDEX INDEX/TABLE are fine when doing their stuff on non-partitions. For other code paths, we have leveraged such errors to use the grammar specific to partitions, for example "CREATE TABLE .. PARTITION OF" or such as these don't cause translation issues, but we don't have a specific syntax of REINDEX for partitioned relations, and I don't think that we need more grammar just for that. The simplest idea I have here is to just use an error callback to set an errcontext(), saying roughly: "while reindexing partitioned table/index %s" while we go through PreventInTransactionBlock(). I have done nothing about that yet but adding an errcallback is simple enough. Perhaps somebody has a different idea here? -- Michael
Вложения
Thanks for helping with this. On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote: > +++ b/src/backend/catalog/index.c > @@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options) > + elog(ERROR, "unsupported relation kind for relation \"%s\"", > + RelationGetRelationName(rel)); I guess it should show the relkind(%c) in the message, like these: src/backend/commands/tablecmds.c: elog(ERROR, "unexpected relkind: %d", (int) relkind); src/backend/tcop/utility.c: elog(ERROR, "unexpected relkind \"%c\"on partition \"%s\"", ISTM reindex_index is missing that, too: 8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99 + if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) + elog(ERROR, "unsupported relation kind for index \"%s\"", + RelationGetRelationName(iRel)); > diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml > @@ -259,8 +263,12 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { IN > </para> > > <para> > - Reindexing partitioned tables or partitioned indexes is not supported. > - Each individual partition can be reindexed separately instead. > + Reindexing partitioned indexes or partitioned tables is supported > + with respectively <command>REINDEX INDEX</command> or > + <command>REINDEX TABLE</command>. Should say "..with REINDEX INDEX or REINDEX TABLE, respectively". > + Each partition of the partitioned > + relation defined is rebuilt in its own transaction. => Each partition of the specified partitioned relation is reindexed in a separate transaction. -- Justin
On Wed, Aug 12, 2020 at 12:28:20AM -0500, Justin Pryzby wrote: > On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote: >> +++ b/src/backend/catalog/index.c >> @@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options) >> + elog(ERROR, "unsupported relation kind for relation \"%s\"", >> + RelationGetRelationName(rel)); > > I guess it should show the relkind(%c) in the message, like these: Sure, but I don't see much the point in adding the relkind here knowing that we know which one it is. > ISTM reindex_index is missing that, too: > > 8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99 > + if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) > + elog(ERROR, "unsupported relation kind for index \"%s\"", > + RelationGetRelationName(iRel)); The error string does not follow the usual project style either, so I have updated both. >> <para> >> - Reindexing partitioned tables or partitioned indexes is not supported. >> - Each individual partition can be reindexed separately instead. >> + Reindexing partitioned indexes or partitioned tables is supported >> + with respectively <command>REINDEX INDEX</command> or >> + <command>REINDEX TABLE</command>. > > Should say "..with REINDEX INDEX or REINDEX TABLE, respectively". > >> + Each partition of the partitioned >> + relation defined is rebuilt in its own transaction. > > => Each partition of the specified partitioned relation is reindexed in a > separate transaction. Thanks, good idea. I have been able to work more on this patch today, and finally added an error context for the transaction block check, as that's cleaner. In my manual tests, I have also bumped into a case that failed with the original patch (where there were no session locks), and created an isolation test based on it: drop of a partition leaf concurrently to REINDEX done on the parent. One last thing I have spotted is that we failed to discard properly foreign tables defined as leaves of a partition tree, causing a reindex to fail, so reindex_partitions() ought to just use RELKIND_HAS_STORAGE() to do its filtering work. I am leaving this patch alone for a couple of days now, and I'll try to come back to it after and potentially commit it. The attached has been indented by the way. -- Michael
Вложения
On Wed, Aug 12, 2020 at 10:37:08PM +0900, Michael Paquier wrote: > I have been able to work more on this patch today, and finally added > an error context for the transaction block check, as that's cleaner. > In my manual tests, I have also bumped into a case that failed with > the original patch (where there were no session locks), and created > an isolation test based on it: drop of a partition leaf concurrently > to REINDEX done on the parent. One last thing I have spotted is that > we failed to discard properly foreign tables defined as leaves of a > partition tree, causing a reindex to fail, so reindex_partitions() > ought to just use RELKIND_HAS_STORAGE() to do its filtering work. I > am leaving this patch alone for a couple of days now, and I'll try to > come back to it after and potentially commit it. The attached has > been indented by the way. I got to think more about the locking strategy used in this patch, and I am afraid that we should fix the bug with REINDEX SCHEMA/DATABASE first. What we have here is rather similar to a REINDEX SCHEMA with all the partitions on the same schema, so it would be better to apply the same set of assumptions and logic for the reindex of partitioned relations as we do for the others. This makes the whole logic more consistent, but it also reduces the surface of bugs. I have created a separate thread for the problem, and posted a patch: https://www.postgresql.org/message-id/20200813043805.GE11663@paquier.xyz Once this gets done, we should then be able to get rid of the extra session locking taken when building the list of partitions, limiting session locks to only be taken during the concurrent reindex of a single partition (the table itself for a partition table, and the parent table for a partition index), making the whole operation less invasive. -- Michael
Вложения
On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote: > Thanks for looking. The REINDEX patch is progressing its way, so I have looked a bit at the part for CIC. Visibly, the case of multiple partition layers is not handled correctly. Here is a sequence that gets broken: CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id); CREATE TABLE child_0_10 PARTITION OF parent_tab FOR VALUES FROM (0) TO (10); CREATE TABLE child_10_20 PARTITION OF parent_tab FOR VALUES FROM (10) TO (20); CREATE TABLE child_20_30 PARTITION OF parent_tab FOR VALUES FROM (20) TO (30); INSERT INTO parent_tab VALUES (generate_series(0,29)); CREATE TABLE child_30_40 PARTITION OF parent_tab FOR VALUES FROM (30) TO (40) PARTITION BY RANGE(id); CREATE TABLE child_30_35 PARTITION OF child_30_40 FOR VALUES FROM (30) TO (35); CREATE TABLE child_35_40 PARTITION OF child_30_40 FOR VALUES FROM (35) TO (40); INSERT INTO parent_tab VALUES (generate_series(30,39)); CREATE INDEX CONCURRENTLY parent_index ON parent_tab (id); This fails as follows: ERROR: XX000: unrecognized node type: 2139062143 LOCATION: copyObjectImpl, copyfuncs.c:5718 And the problem visibly comes from some handling with the second level of partitions, child_30_40 in my example above. Even with that, this outlines a rather severe problem in the patch: index_set_state_flags() flips indisvalid on/off using a non-transactional update (see the use of heap_inplace_update), meaning that if we fail in the middle of the operation we may finish with a partition index tree where some of the indexes are valid and some of them are invalid. In order to make this logic consistent, I am afraid that we will need to do two things: - Change index_set_state_flags() so as it uses a transactional update. That's something I would like to change for other reasons, like making sure that the REPLICA IDENTITY of a parent relation is correctly reset when dropping a replica index. - Make all the indexes of the partition tree valid in the *same* sub-transaction. You can note that this case is different than a concurrent REINDEX, because in this case we just do an in-place change between the old and new index, meaning that even if there is a failure happening while processing, we may have some invalid indexes, but there are still valid indexes attached to the partition tree, at any time. + MemoryContext oldcontext = MemoryContextSwitchTo(ind_context); PartitionDesc partdesc = RelationGetPartitionDesc(rel); int nparts = partdesc->nparts; + char *relname = pstrdup(RelationGetRelationName(rel)); Er, no. We should not play with the relation cache calls in a private memory context. I think that this needs much more thoughts. -- Michael
Вложения
On Fri, Aug 14, 2020 at 09:29:45AM +0900, Michael Paquier wrote: > Once this gets done, we should then be able to get rid of the extra > session locking taken when building the list of partitions, limiting > session locks to only be taken during the concurrent reindex of a > single partition (the table itself for a partition table, and the > parent table for a partition index), making the whole operation less > invasive. The problem with dropped relations in REINDEX has been addressed by 1d65416, so I have gone through this patch again and simplified the use of session locks, these being taken only when doing a REINDEX CONCURRENTLY for a given partition. This part is in a rather committable shape IMO, so I would like to get it done first, before looking more at the other cases with CIC and CLUSTER. I am still planning to go through it once again. -- Michael
Вложения
The problem with dropped relations in REINDEX has been addressed by 1d65416, so I have gone through this patch again and simplified the use of session locks, these being taken only when doing a REINDEX CONCURRENTLY for a given partition. This part is in a rather committable shape IMO, so I would like to get it done first, before looking more at the other cases with CIC and CLUSTER. I am still planning to go through it once again. -- Michael
Thank you for advancing this work.
I was reviewing the previous version, but the review became outdated before I sent it. Overall design is fine, but I see a bunch of things that need to be fixed before commit.
First of all, this patch fails at cfbot:
indexcmds.c:2848:7: error: variable ‘parentoid’ set but not used [-Werror=unused-but-set-variable]Oid parentoid; ^
It seems to be just a typo. With this minor fix the patch compiles and passes tests.
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 75008eebde..f5b3c53a83 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -2864,7 +2864,7 @@ reindex_partitions(Oid relid, int options, bool concurrent, /* Save partition OID */ old_context = MemoryContextSwitchTo(reindex_context); - partitions = lappend_oid(partitions, partoid); + partitions = lappend_oid(partitions, parentoid); MemoryContextSwitchTo(old_context); }
If this guessed fix is correct, I see the problem in the patch logic. In reindex_partitions() we collect parent relations to pass them to reindex_multiple_internal(). It implicitly changes the logic from REINDEX INDEX to REINDEX RELATION, which is not the same, if table has more than one index. For example, if I add one more index to a partitioned table from a create_index.sql test:
create index idx on concur_reindex_part_0_2 using btree (c2);
and call
idx will be reindexed as well. I doubt that it is the desired behavior.REINDEX INDEX CONCURRENTLY concur_reindex_part_index;
A few more random issues, that I noticed at first glance:
1) in create_index.sql
Are these two lines intentional checks that must fail? If so, I propose to add a comment.
REINDEX TABLE concur_reindex_part_index;
REINDEX TABLE CONCURRENTLY concur_reindex_part_index;
A few lines around also look like they were copy-pasted and need a second look.
2) This part of ReindexRelationConcurrently() is misleading.
case RELKIND_PARTITIONED_TABLE: case RELKIND_PARTITIONED_INDEX: default: /* Return error if type of relation is not supported */ ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("cannot reindex this type of relation concurrently")));
Maybe assertion is enough. It seems, that we should never reach this code because both ReindexTable and ReindexIndex handle those relkinds separately. Which leads me to the next question.
3) Is there a reason, why reindex_partitions() is not inside ReindexRelationConcurrently() ? I think that logically it belongs there.
4) I haven't tested multi-level partitions yet. In any case, it would be good to document this behavior explicitly.
I need a bit more time to review this patch more thoroughly. Please, wait for it, before committing.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Sep 03, 2020 at 10:02:58PM +0300, Anastasia Lubennikova wrote: > First of all, this patch fails at cfbot: > > indexcmds.c:2848:7: error: variable ‘parentoid’ set but not used > [-Werror=unused-but-set-variable] > Oid parentoid;^ Missed this warning, thanks for pointing it out. This is an incorrect rebase: this variable was used as a workaround to take a session lock on the parent table to be reindexed, session lock logic existing to prevent cache lookup errors when dropping some portions of the tree concurrently (1d65416 as you already know). But we don't need that anymore. > If this guessed fix is correct, I see the problem in the patch logic. In > reindex_partitions() we collect parent relations to pass them to > reindex_multiple_internal(). It implicitly changes the logic from REINDEX > INDEX to REINDEX RELATION, which is not the same, if table has more than one > index. Incorrect guess here. parentoid refers to the parent table of an index, so by saving its OID in the list of things to reindex, you would finish by reindexing all the indexes of a partition. We need to use partoid, as returned by find_all_inheritors() for all the members of the partition tree (relid can be a partitioned index or partitioned table in reindex_partitions). > 1) in create_index.sql > > Are these two lines intentional checks that must fail? If so, I propose to > add a comment. > > REINDEX TABLE concur_reindex_part_index; > REINDEX TABLE CONCURRENTLY concur_reindex_part_index; > > A few lines around also look like they were copy-pasted and need a second > look. You can note some slight differences though. These are test cases for REINDEX INDEX with tables, and REINDEX TABLE with indexes. What you are quoting here is the part for indexes with REINDEX TABLE. And there are already comments, see: "-- REINDEX INDEX fails for partitioned tables" "-- REINDEX TABLE fails for partitioned indexes" Perhaps that's not enough, so I have added some more comments to outline that these commands fail (8 commands in total). > 2) This part of ReindexRelationConcurrently() is misleading. > > Maybe assertion is enough. It seems, that we should never reach this code > because both ReindexTable and ReindexIndex handle those relkinds > separately. Which leads me to the next question. Yes, we could use an assert, but I did not feel any strong need to change that either for this patch. > 3) Is there a reason, why reindex_partitions() is not inside > ReindexRelationConcurrently() ? I think that logically it belongs there. Yes, it simplifies the build of the index list indexIds, as there is no need to loop back into a different routine if working on a table or a matview. > 4) I haven't tested multi-level partitions yet. In any case, it would be > good to document this behavior explicitly. Not sure what addition we could do here. The patch states that each partition of the partitioned relation defined gets reindexed, which implies that this handles multiple layers automatically. > I need a bit more time to review this patch more thoroughly. Please, wait > for it, before committing. Glad to hear that, please take the time you need. -- Michael
Вложения
On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote: > Glad to hear that, please take the time you need. Attached is a rebased version to address the various conflicts after 844c05a. -- Michael
Вложения
On 07.09.2020 04:58, Michael Paquier wrote: > On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote: >> Glad to hear that, please take the time you need. > Attached is a rebased version to address the various conflicts after > 844c05a. > -- > Michael > Thank you. With the fix for the cycle in reindex_partitions() and new comments added, I think this patch is ready for commit. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Thanks for completing and pushing the REINDEX patch and others. Here's a rebasified + fixed version of the others. On Tue, Sep 01, 2020 at 02:51:58PM +0900, Michael Paquier wrote: > The REINDEX patch is progressing its way, so I have looked a bit at > the part for CIC. > > Visibly, the case of multiple partition layers is not handled > correctly. Here is a sequence that gets broken: .. > This fails as follows: > ERROR: XX000: unrecognized node type: 2139062143 > LOCATION: copyObjectImpl, copyfuncs.c:5718 Because copyObject needed to be called within a longlived context. Also, my previous revision failed to implement your suggestion to first build catalog entries with INVALID indexes and to then reindex them. Fixed. -- Justin
Вложения
On Mon, Sep 07, 2020 at 09:39:16PM -0500, Justin Pryzby wrote: > Also, my previous revision failed to implement your suggestion to first build > catalog entries with INVALID indexes and to then reindex them. Fixed. - childStmt->oldCreateSubid = InvalidSubTransactionId; - childStmt->oldFirstRelfilenodeSubid = childStmt->InvalidSubTransactionId; + // childStmt->oldCreateSubid = childStmt->InvalidSubTransactionId; + // childStmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId; In the CIC patch, what is that about? It is hard to follow what you are trying to achieve here. + index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY); + CommandCounterIncrement(); + index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID); Anyway, this part of the logic is still not good here. If we fail in the middle here, we may run into problems with a single partition index that has only a portion of its flags set. As this gets called for each partitioned index, it also means that we could still finish in a state where we have only a partially-valid partition tree. For example, imagine a partition tree with 2 levels (as reported by pg_partition_tree), then the following happens if an index is created concurrently from the partitioned table of level 0: 1) indexes are created in level 2 first 2) partitioned table of level 1 is switched to be ready and valid 3) indexes of level 1 are created. 4) partitioned table of level 0 is switched to be ready and valid If the command has a failure, say between 2 and 3, we would have as result a command that has partially succeeded in creating a partition tree, while the user was looking for an index for the full tree. This comes back to my previous points, where we should make index_set_state_flags() first, and I have begun a separate thread about that: https://commitfest.postgresql.org/30/2725/ Second, it also means that the patch should really switch all the indexes to be valid in one single transaction, and that we also need more careful refactoring of DefineIndex(). I also had a quick look at the patch for CLUSTER, and it does a much better job, still it has issues. + MemoryContext old_context = MemoryContextSwitchTo(cluster_context); + + inhoids = find_all_inheritors(indexOid, NoLock, NULL); + foreach(lc, inhoids) The area where a private memory context is used should be minimized. In this case, you just need to hold the context while saving the relation and clustered index information in the list of RelToCluster items. As a whole, this case is more simple than CIC, so I'd like to think that it would be good to work on that as next target. Coming to my last point.. This thread has dealt since the beginning with three different problems: - REINDEX on partitioned relations. - CLUSTER on partitioned relations. - CIC on partitioned relations. (Should we also care about DROP INDEX CONCURRENTLY as well?) The first problem has been solved, not the two others yet. Do you think that it could be possible to begin two new separate threads for the remaining issues, with dedicated CF entries? We could also mark the existing one as committed, retitled for REINDEX as a matter of clarity. Also, please note that I am not sure if I will be able to look more at this thread in this CF. -- Michael
Вложения
On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote: > - CIC on partitioned relations. (Should we also care about DROP INDEX > CONCURRENTLY as well?) Do you have any idea what you think that should look like for DROP INDEX CONCURRENTLY ? -- Justin
On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote: > On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote: >> - CIC on partitioned relations. (Should we also care about DROP INDEX >> CONCURRENTLY as well?) > > Do you have any idea what you think that should look like for DROP INDEX > CONCURRENTLY ? Making the maintenance of the partition tree consistent to the user is the critical part here, so my guess on this matter is: 1) Remove each index from the partition tree and mark the indexes as invalid in the same transaction. This makes sure that after commit no indexes would get used for scans, and the partition dependency tree pis completely removed with the parent table. That's close to what we do in index_concurrently_swap() except that we just want to remove the dependencies with the partitions, and not just swap them of course. 2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction should be fine as that prevents inserts. 3) Finish the index drop. Step 2) and 3) could be completely done for each index as part of index_drop(). The tricky part is to integrate 1) cleanly within the existing dependency machinery while still knowing about the list of partitions that can be removed. I think that this part is not that straight-forward, but perhaps we could just make this logic part of RemoveRelations() when listing all the objects to remove. -- Michael
Вложения
On Sat, Sep 12, 2020 at 10:35:34AM +0900, Michael Paquier wrote: > On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote: > > On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote: > >> - CIC on partitioned relations. (Should we also care about DROP INDEX > >> CONCURRENTLY as well?) > > > > Do you have any idea what you think that should look like for DROP INDEX > > CONCURRENTLY ? > > Making the maintenance of the partition tree consistent to the user is > the critical part here, so my guess on this matter is: > 1) Remove each index from the partition tree and mark the indexes as > invalid in the same transaction. This makes sure that after commit no > indexes would get used for scans, and the partition dependency tree > pis completely removed with the parent table. That's close to what we > do in index_concurrently_swap() except that we just want to remove the > dependencies with the partitions, and not just swap them of course. > 2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction > should be fine as that prevents inserts. > 3) Finish the index drop. > > Step 2) and 3) could be completely done for each index as part of > index_drop(). The tricky part is to integrate 1) cleanly within the > existing dependency machinery while still knowing about the list of > partitions that can be removed. I think that this part is not that > straight-forward, but perhaps we could just make this logic part of > RemoveRelations() when listing all the objects to remove. Thanks. I see three implementation ideas.. 1. I think your way has an issue that the dependencies are lost. If there's an interruption, the user is maybe left with hundreds or thousands of detached indexes to clean up. This is strange since there's actually no detach command for indexes (but they're implicitly "attached" when a matching parent index is created). A 2nd issue is that DETACH currently requires an exclusive lock (but see Alvaro's WIP patch). 2. Maybe the easiest way is to mark all indexes invalid and then drop all partitions (concurrently) and then the partitioned parent. If interrupted, this would leave a parent index marked "invalid", and some child tables with no indexes. I think this may be "ok". The same thing is possible if a concurrent build is interrupted, right ? 3. I have a patch which changes index_drop() to "expand" a partitioned index into its list of children. Each of these becomes a List: | indexId, heapId, userIndexRelation, userHeapRelation, heaplocktag, heaprelid, indexrelid The same process is followed as for a single index, but handling all partitions at once in two transactions total. Arguably, this is bad since that function currently takes a single Oid but would now ends up operating on a list of indexes. Anyway, for now this is rebased on 83158f74d. -- Justin
Вложения
On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote: > Anyway, for now this is rebased on 83158f74d. I have not thought yet about all the details of CIC and DIC and what you said upthread, but I have gone through CLUSTER for now, as a start. So, the goal of the patch, as I would define it, is to give a way to CLUSTER to work on a partitioned table using a given partitioned index. Hence, we would perform CLUSTER automatically from the top-most parent for any partitions that have an index on the same partition tree as the partitioned index defined in USING, switching indisclustered automatically depending on the index used. +CREATE TABLE clstrpart3 PARTITION OF clstrpart DEFAULT PARTITION BY RANGE(a); +CREATE TABLE clstrpart33 PARTITION OF clstrpart3 DEFAULT; CREATE INDEX clstrpart_idx ON clstrpart (a); -ALTER TABLE clstrpart CLUSTER ON clstrpart_idx So.. For any testing of partitioned trees, we should be careful to check if relfilenodes have been changed or not as part of an operation, to see if the operation has actually done something. From what I can see, attempting to use a CLUSTER on a top-most partitioned table fails to work on child tables, but isn't the goal of the patch to make sure that if we attempt to do the operation on a partitioned table using a partitioned index, then the operation should be done as well on the partitions with the partition index part of the same partition tree as the parent partitioned index? If using CLUSTER on a new partitioned index with USING, it seems to me that we may want to check that indisclustered is correctly set for all the partitions concerned in the regression tests. It would be good also to check if we have a partition index tree that maps partially with a partition table tree (aka no all table partitions have a partition index), where these don't get clustered because there is no index to work on. + MemoryContext old_context = MemoryContextSwitchTo(cluster_context); + inhoids = find_all_inheritors(indexOid, NoLock, NULL); + MemoryContextSwitchTo(old_context); Er, isn't that incorrect? I would have assumed that what should be saved in the context of cluster is the list of RelToCluster items. And we should not do any lookup of the partitions in a different context, because this may do allocations of things we don't really need to keep around for the context dedicated to CLUSTER. Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on the parent? It seems to me that at least schema changes should be prevented with a ShareLock in the first transaction building the list of elements to cluster. + /* + * We have a full list of direct and indirect children, so skip + * partitioned tables and just handle their children. + */ + if (get_rel_relkind(relid) == RELKIND_PARTITIONED_TABLE) + continue; It would be better to use RELKIND_HAS_STORAGE here. All this stuff needs to be documented clearly. -- Michael
Вложения
On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote: > start. So, the goal of the patch, as I would define it, is to give a > way to CLUSTER to work on a partitioned table using a given > partitioned index. Hence, we would perform CLUSTER automatically from > the top-most parent for any partitions that have an index on the same > partition tree as the partitioned index defined in USING, switching > indisclustered automatically depending on the index used. I think that's right, except there's no need to look for a compatible partitioned index: we just use the child index. Also, if a partitioned index is clustered, when we clear indisclustered for other indexes, should we also propogate that to their parent indexes, if any ? > From what I can see, attempting to use a CLUSTER on a top-most > partitioned table fails to work on child tables, Oops - it looks like this patch never worked right, due to the RECHECK on indisclustered. I think maybe I returned to the CIC patch and never finishing with cluster. > It would be good also to check if > we have a partition index tree that maps partially with a partition > table tree (aka no all table partitions have a partition index), where > these don't get clustered because there is no index to work on. This should not happen, since a incomplete partitioned index is "invalid". > Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on > the parent? It seems to me that at least schema changes should be > prevented with a ShareLock in the first transaction building the list > of elements to cluster. Thanks for noticing. I chose ShareUpdateExclusiveLock since it's set-exclusive. -- Justin
Вложения
On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote: > Also, if a partitioned index is clustered, when we clear indisclustered for > other indexes, should we also propogate that to their parent indexes, if any ? I am not sure what you mean here. Each partition's cluster runs in its own individual transaction based on the patch you sent. Are you suggesting to update indisclustered for the partitioned index of a partitioned table and all its parent partitioned in the same transaction, aka a transaction working on the partitioned table? Doesn't that mean that if we have a partition tree with multiple layers then we finish by doing multiple time the same operation for the parents? >> It would be good also to check if >> we have a partition index tree that maps partially with a partition >> table tree (aka no all table partitions have a partition index), where >> these don't get clustered because there is no index to work on. > > This should not happen, since a incomplete partitioned index is "invalid". Indeed, I did not know this property. I can see also that you have added a test for this case, so that's good if we can rely on that. I am still in the process of reviewing this patch, all this handling around indisclustered makes it rather complex. -- Michael
Вложения
On Mon, Oct 05, 2020 at 05:46:27PM +0900, Michael Paquier wrote: > On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote: > > Also, if a partitioned index is clustered, when we clear indisclustered for > > other indexes, should we also propogate that to their parent indexes, if any ? > > I am not sure what you mean here. Each partition's cluster runs in > its own individual transaction based on the patch you sent. Are you > suggesting to update indisclustered for the partitioned index of a > partitioned table and all its parent partitioned in the same > transaction, aka a transaction working on the partitioned table? No, I mean that if a partition is no longer clustered on some index, then its parent isn't clustered on that indexes parent, either. It means that we might do N catalog updates for a partition heirarchy that's N levels deep. Normally, N=2, and we'd clear indisclustered for the index as well as its parent. This is not essential, though. > >> It would be good also to check if > >> we have a partition index tree that maps partially with a partition > >> table tree (aka no all table partitions have a partition index), where > >> these don't get clustered because there is no index to work on. > > > > This should not happen, since a incomplete partitioned index is "invalid". > > Indeed, I did not know this property. I think that's something we can apply for CIC/DIC, too. It's not essential to avoid leaving an "invalid" or partial index if interrupted. It's only essential that a partial, partitioned index is not "valid". For DROP IND CONCURRENTLY, I wrote: On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote: > 2. Maybe the easiest way is to mark all indexes invalid and then drop all > partitions (concurrently) and then the partitioned parent. If interrupted, > this would leave a parent index marked "invalid", and some child tables with no > indexes. I think this may be "ok". The same thing is possible if a concurrent > build is interrupted, right ? -- Justin
On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote: > It means that we might do N catalog updates for a partition heirarchy that's N > levels deep. Normally, N=2, and we'd clear indisclustered for the index as > well as its parent. This is not essential, though. Hmm. I got to think more about this one, and being able to ensure a consistent state of indisclustered for partitioned tables and all their partitions across multiple transactions at all times would be a nice property, as we could easily track down if an operation has failed in-flight. The problem here is that we are limited by indisclustered being a boolean, so we may set indisclustered one way or another on some partitions, or on some parent partitioned tables, but we are not sure if what's down is actually clustered for the leaves of a partitioned index, or not. Or maybe we even have an inconsistent state, so this does not provide a good user experience. The consistency of a partition tree is a key thing, and we have such guarantees with REINDEX (with or without concurrently), so what I'd like to think that what makes sense for indisclustered on a partitioned index is to have the following set of guarantees, and relying on indisvalid as being true iff an index partition tree is complete on a given table partition tree is really important: - If indisclustered is set to true for a partitioned index, then we have the guarantee that all its partition and partitioned indexes have indisclustered set to true, across all the layers down to the leaves. - If indisclustered is set to false for a partitioned index, then we have the guarantee that all its partition and partitioned indexes have indisclustered set to false. If we happen to attach a new partition to a partitioned table of such a tree, I guess that it is then logic to have indisclustered set to the same value as the partitioned index when the partition inherits an index. So, it seems to me that with the current catalogs we are limited by indisclustered as being a boolean to maintain some kind of consistency across transactions of CLUSTER, as we would use one transaction per leaf for the work. In order to make things better here, we could switch indisclustered to a char, able to use three states: - 'e' or enabled, equivalent to indisclustered == true. There should be only one index on a table with 'e' set at a given time. - 'd' or disabled, equivalent to indisclustered == false. - a new third state, for an equivalent of work-in-progress, let's call it 'w', or whatever. Then, when we begin a CLUSTER on a partitioned table, I would imagine the following: - Switch all the indexes selected to 'w' in a first transaction, and do not reset the state of other indexes if there is one 'e'. For CLUSTER without USING, we switch the existing 'e' to 'w' if there is such an index. If there are no indexes select-able, issue an error. If we find an index with 'w', we have a candidate from a previous failed command, so this gets used. I don't think that this design requires a new DDL either as we could reset all 'w' and 'e' to 'd' if using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table. For CLUSTER with USING, the partitioned index selected and its leaves are switched to 'w', similarly. - Then do the work for all the partitions, one partition per transaction, keeping 'w'. - In a last transaction, switch all the partitions from 'w' to 'e', resetting any existing 'e'. ALTER TABLE CLUSTER ON should also be a supported operation, where 'e' gets switched for all the partitions in a single transaction. Of course, this should not work for an invalid index. Even with such a design, I got to wonder if there could be cases where a user does *not* want to cluster the leaves of a partition tree with the same index. If that were to happen, the partition tree may need a redesign so making things work so as we force partitions to inherit what's wanted for the partitioned table makes the most sense to me. By the way, I mentioned that previously, but this thread got used for REINDEX that is finished, and now we have a discussion going on with CLUSTER. There is also stuff related to CIC and DIC. There is also only one CF entry for all that. I really think that this work had better be split into separate threads, with separate CF entries. Do you mind if I change the contents of the CF app so as the existing item is renamed for REINDEX, that got committed? Then we could create a new entry for CIC/DIC (it may make sense to split these two as well, but I am not if there are any overlaps between the APIs of the two), and a new entry for CLUSTER, depending on the state of the work. The original subject of the thread is also unrelated, this makes the review process unnecessarily complicated, and that's really confusing. Each discussion should go into its own, independent, thread. -- Michael
Вложения
On Tue, Oct 06, 2020 at 11:42:55AM +0900, Michael Paquier wrote: > On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote: > > It means that we might do N catalog updates for a partition heirarchy that's N > > levels deep. Normally, N=2, and we'd clear indisclustered for the index as > > well as its parent. This is not essential, though. > > Hmm. I got to think more about this one, and being able to ensure a > consistent state of indisclustered for partitioned tables and all > their partitions across multiple transactions at all times would be a > nice property, as we could easily track down if an operation has > failed in-flight. The problem here is that we are limited by > indisclustered being a boolean, so we may set indisclustered one way > or another on some partitions, or on some parent partitioned tables, > but we are not sure if what's down is actually clustered for the > leaves of a partitioned index, or not. Or maybe we even have an > inconsistent state, so this does not provide a good user experience. > The consistency of a partition tree is a key thing, and we have such > guarantees with REINDEX (with or without concurrently), so what I'd > like to think that what makes sense for indisclustered on a > partitioned index is to have the following set of guarantees, and > relying on indisvalid as being true iff an index partition tree is > complete on a given table partition tree is really important: > - If indisclustered is set to true for a partitioned index, then we > have the guarantee that all its partition and partitioned indexes have > indisclustered set to true, across all the layers down to the leaves. > - If indisclustered is set to false for a partitioned index, then we > have the guarantee that all its partition and partitioned indexes have > indisclustered set to false. > > If we happen to attach a new partition to a partitioned table of such > a tree, I guess that it is then logic to have indisclustered set to > the same value as the partitioned index when the partition inherits an > index. So, it seems to me that with the current catalogs we are > limited by indisclustered as being a boolean to maintain some kind of > consistency across transactions of CLUSTER, as we would use one > transaction per leaf for the work. In order to make things better > here, we could switch indisclustered to a char, able to use three > states: > - 'e' or enabled, equivalent to indisclustered == true. There should > be only one index on a table with 'e' set at a given time. > - 'd' or disabled, equivalent to indisclustered == false. > - a new third state, for an equivalent of work-in-progress, let's call > it 'w', or whatever. > > Then, when we begin a CLUSTER on a partitioned table, I would imagine > the following: > - Switch all the indexes selected to 'w' in a first transaction, and > do not reset the state of other indexes if there is one 'e'. For > CLUSTER without USING, we switch the existing 'e' to 'w' if there is > such an index. If there are no indexes select-able, issue an error. > If we find an index with 'w', we have a candidate from a previous > failed command, so this gets used. I don't think that this design > requires a new DDL either as we could reset all 'w' and 'e' to 'd' if > using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table. For > CLUSTER with USING, the partitioned index selected and its leaves are > switched to 'w', similarly. > - Then do the work for all the partitions, one partition per > transaction, keeping 'w'. > - In a last transaction, switch all the partitions from 'w' to 'e', > resetting any existing 'e'. Honestly, I think you're over-thinking and over-engineering indisclustered. If "clusteredness" was something we offered to maintain across DML, I think that might be important to provide stronger guarantees. As it is now, I don't think this patch is worth changing the catalog definition. > ALTER TABLE CLUSTER ON should also be a supported operation, where 'e' > gets switched for all the partitions in a single transaction. Of > course, this should not work for an invalid index. Even with such a > design, I got to wonder if there could be cases where a user does > *not* want to cluster the leaves of a partition tree with the same > index. If that were to happen, the partition tree may need a redesign > so making things work so as we force partitions to inherit what's > wanted for the partitioned table makes the most sense to me. I wondered the same thing: should clustering a parent index cause the the child indexes to be marked as clustered ? Or should it be possible for an intermediate child to say (marked as) clustered on some other index. I don't have strong feelings, but the patch currently sets indisclustered as a natural consequence of the implementation, so I tentatively think that's what's right. After all, CLUSTER sets indisclustered without having to also say "ALTER..CLUSTER ON". This is relevant to the question I raised about unsetting indisclustered for each indexes *parent* when clustering on a different index. I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition just because a different partitioned index was set clustered. We'd clear that, like always, and then (in my proposal) also clear its parents "indisclustered". I still don't think that's essential, though. > By the way, I mentioned that previously, but this thread got used for > REINDEX that is finished, and now we have a discussion going on with > CLUSTER. There is also stuff related to CIC and DIC. There is also > only one CF entry for all that. I really think that this work had > better be split into separate threads, with separate CF entries. Do > you mind if I change the contents of the CF app so as the existing > item is renamed for REINDEX, that got committed? Then we could create > a new entry for CIC/DIC (it may make sense to split these two as > well, but I am not if there are any overlaps between the APIs of the > two), and a new entry for CLUSTER, depending on the state of the work. > The original subject of the thread is also unrelated, this makes the > review process unnecessarily complicated, and that's really > confusing. Each discussion should go into its own, independent, > thread. I didn't think it's worth the overhead of closing and opening more CFs. But I don't mind. -- Justin
On Mon, Oct 05, 2020 at 10:07:33PM -0500, Justin Pryzby wrote: > Honestly, I think you're over-thinking and over-engineering indisclustered. > > If "clusteredness" was something we offered to maintain across DML, I think > that might be important to provide stronger guarantees. As it is now, I don't > think this patch is worth changing the catalog definition. Well, this use case is new because we are discussing the relationship of indisclustered across multiple transactions for multiple indexes, so I'd rather have this discussion than not, and I have learnt the hard way with REINDEX that we should care a lot about the consistency of partition trees at any step of the operation. Let's imagine a simple example here, take this partition tree: p (parent), and two partitions p1 and p2. p has two partitioned indexes i and j, indexes also present in p1 and p2 as i1, i2, j1 and j2. Let's assume that the user has done a CLUSTER on p USING i that completes, meaning that i, i1 and i2 have indisclustered set. Now let's assume that the user does a CLUSTER on p USING j this time, and that this command fails while processing p2, meaning that indisclustered is set for j1, i2, and perhaps i or j depending on what the patch does. Per the latest arguments, j would be the one set to indisclustered. From this inconsistent state comes a couple of interesting things: - A database-wide CLUSTER would finish by using j1 and i2 for the operation on the partitions, while the intention was to use j2 for the second partition as the previous command failed. - With CLUSTER p, without USING. Logically, I would assume that we would rely on the value of indisclustered as of j, meaning that we would *enforce* p2 to use j2. But it could also be seen as incorrect by the user because we would not use the index originally marked as such. So keeping this consistent has the advantage to have clear rules here. > I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition > just because a different partitioned index was set clustered. We'd clear that, > like always, and then (in my proposal) also clear its parents "indisclustered". > I still don't think that's essential, though. Why? Blocking a partition, which may be itself partitioned, to switch to a different index if its partitioned parent uses something else sounds kind of logic to me, at the end, because the user originally intended to use CLUSTER with a specific index on this tree. So I would say that the partitioned table takes priority, and this should be released with a WITHOUT CLUSTER from the partitioned table. > I didn't think it's worth the overhead of closing and opening more CFs. > But I don't mind. Thanks, I'll do some cleanup. -- Michael