Обсуждение: how to create index concurrently on paritioned table

Поиск
Список
Период
Сортировка

how to create index concurrently on paritioned table

От
"李杰(慎追)"
Дата:
Hi hackers,

 Partitioning is necessary for very large tables.
 However, I found that postgresql does not support create index concurrently on partitioned tables.
The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently. 
This is undoubtedly a complex operation for DBA, especially when there are many partitions. 

Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? 
What are the difficulties of this function? 
If I want to implement it, what should I pay attention?

Sincerely look forward to your reply. 

Regards & Thanks Adger

Re: how to create index concurrently on paritioned table

От
Justin Pryzby
Дата:
On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
> Partitioning is necessary for very large tables.
>  However, I found that postgresql does not support create index concurrently on partitioned tables.
> The document show that we need to create an index on each partition individually and then finally create the
partitionedindex non-concurrently. 
 
> This is undoubtedly a complex operation for DBA, especially when there are many partitions. 

> Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? 
> What are the difficulties of this function? 
> If I want to implement it, what should I pay attention?

Maybe I'm wrong, but I don't think there's any known difficulty - just that
nobody did it yet.  You should pay attention to what happens on error, but
hopefully you wouldn't need to add much code and can rely on existing code to
paths to handle that right.

I think you'd look at the commits and code implementing indexes on partitioned
tables and CREATE INDEX CONCURRENTLY.  And maybe any following commits with
fixes.

You'd first loop around all children (recursively if there are partitions which
are themselves partitioned) and create indexes concurrently.  

-- 
Justin



Re: how to create index concurrently on paritioned table

От
Justin Pryzby
Дата:
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
> > Partitioning is necessary for very large tables.
> >  However, I found that postgresql does not support create index concurrently on partitioned tables.
> > The document show that we need to create an index on each partition individually and then finally create the
partitionedindex non-concurrently. 
 
> > This is undoubtedly a complex operation for DBA, especially when there are many partitions. 
> 
> > Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? 
> > What are the difficulties of this function? 
> > If I want to implement it, what should I pay attention?
> 
> Maybe I'm wrong, but I don't think there's any known difficulty - just that
> nobody did it yet.

I said that but I was actually thinking about the code for "REINDEX
CONCURRENTLY" (which should also handle partitioned tables).

I looked at CIC now and came up with the attached.  All that's needed to allow
this case is to close the relation before recursing to partitions - it needs to
be closed before calling CommitTransactionCommand().  There's probably a better
way to write this, but I can't see that there's anything complicated about
handling partitioned tables.

-- 
Justin

Вложения

Re: how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Sun, Jun 07, 2020 at 01:04:48PM -0500, Justin Pryzby wrote:
> On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
> > > Partitioning is necessary for very large tables.  However, I found that
> > > postgresql does not support create index concurrently on partitioned
> > > tables.  The document show that we need to create an index on each
> > > partition individually and then finally create the partitioned index
> > > non-concurrently.  This is undoubtedly a complex operation for DBA,
> > > especially when there are many partitions. 

I added functionality for C-I-C, REINDEX-CONCURRENTLY, and CLUSTER of
partitioned tables.  We already recursively handle VACUUM and ANALYZE since
v10.

And added here:
https://commitfest.postgresql.org/28/2584/

Adger, if you're familiar with compilation and patching, do you want to try the
patch ?

Note, you could do this now using psql like:
SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM pg_partition_ancestors() AS a;
\gexec

-- 
Justin

Вложения

Re: how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Thu, Jun 11, 2020 at 10:35:02AM -0500, Justin Pryzby wrote:
> Note, you could do this now using psql like:
> SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM pg_partition_ancestors() AS a;
> \gexec

I have skimmed quickly through the patch set, and something has caught
my attention.

>  drop table idxpart;
> --- Some unsupported features
> +-- CIC on partitioned table
>  create table idxpart (a int, b int, c text) partition by range (a);
>  create table idxpart1 partition of idxpart for values from (0) to (10);
>  create index concurrently on idxpart (a);
> -ERROR:  cannot create index on partitioned table "idxpart" concurrently
> +\d idxpart1

When it comes to test behaviors specific to partitioning, there are in
my experience three things to be careful about and stress in the tests:
- Use at least two layers of partitioning.
- Include into the partition tree a partition that has no leaf
partitions.
- Test the commands on the top-most parent, a member in the middle of
the partition tree, the partition with no leaves, and one leaf, making
sure that relfilenode changes where it should and that partition trees
remain intact (you can use pg_partition_tree() for that.)

That's to say that the amount of regression tests added here is not
sufficient in my opinion.
--
Michael

Вложения

回复:how to create index concurrently on paritioned table

От
"李杰(慎追)"
Дата:

>On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
> > > Partitioning is necessary for very large tables.
> > >  However, I found that postgresql does not support create index concurrently on partitioned tables.
> > > The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently. 
> > This is undoubtedly a complex operation for DBA, especially when there are many partitions. 
> > 
> > > Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? 
> > > What are the difficulties of this function? 
> > > If I want to implement it, what should I pay attention?
> > 
> > Maybe I'm wrong, but I don't think there's any known difficulty - just that
> > nobody did it yet.

> I said that but I was actually thinking about the code for "REINDEX
> CONCURRENTLY" (which should also handle partitioned tables).

> I looked at CIC now and came up with the attached.  All that's needed to allow
> this case is to close the relation before recursing to partitions - it needs to
> be closed before calling CommitTransactionCommand().  There's probably a better
> way to write this, but I can't see that there's anything complicated about
> handling partitioned tables.

Hi, Justin Pryzby

I'm so sorry about getting back late.
Thank you very much for helping me consider this issue.

I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
I got a coredump that was easy to reproduce. As follows:

#0  PopActiveSnapshot () at snapmgr.c:822

#1  0x00000000005ca687 in DefineIndex (relationId=relationId@entry=16400, 

stmt=stmt@entry=0x1aa5e28, indexRelationId=16408, indexRelationId@entry=0, 

parentIndexId=parentIndexId@entry=16406, 

    parentConstraintId=0, is_alter_table=is_alter_table@entry=false,

check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) 

at indexcmds.c:1426

#2  0x00000000005ca5ab in DefineIndex (relationId=relationId@entry=16384, 

stmt=stmt@entry=0x1b35278, indexRelationId=16406, indexRelationId@entry=0,

parentIndexId=parentIndexId@entry=0, 

    parentConstraintId=parentConstraintId@entry=0, is_alter_table=

is_alter_table@entry=false, check_rights=true, check_not_in_use=true,

skip_build=false, quiet=false) at indexcmds.c:1329

#3  0x000000000076bf80 in ProcessUtilitySlow (pstate=pstate@entry=0x1b350c8, 

pstmt=pstmt@entry=0x1a2bf40, 

    queryString=queryString@entry=0x1a2b2c8 "create index CONCURRENTLY 

idxpart_a_idx on idxpart (a);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, 

params=params@entry=0x0, 

    queryEnv=queryEnv@entry=0x0, qc=0x7ffc86cc7630, dest=0x1a2c200) at 

utility.c:1474

#4  0x000000000076afeb in standard_ProcessUtility (pstmt=0x1a2bf40, 

queryString=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);",

context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 

    queryEnv=0x0, dest=0x1a2c200, qc=0x7ffc86cc7630) at utility.c:1069

#5  0x0000000000768992 in PortalRunUtility (portal=0x1a8d1f8, pstmt=0x1a2bf40, 

isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>,

qc=0x7ffc86cc7630) at pquery.c:1157

#6  0x00000000007693f3 in PortalRunMulti (portal=portal@entry=0x1a8d1f8, 

isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, 

dest=dest@entry=0x1a2c200, 

    altdest=altdest@entry=0x1a2c200, qc=qc@entry=0x7ffc86cc7630) at pquery.c:1310

#7  0x0000000000769ed3 in PortalRun (portal=portal@entry=0x1a8d1f8, count=count

@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once

@entry=true, dest=dest@entry=0x1a2c200, 

    altdest=altdest@entry=0x1a2c200, qc=0x7ffc86cc7630) at pquery.c:779

#8  0x0000000000765b06 in exec_simple_query (query_string=0x1a2b2c8 "create 

index CONCURRENTLY idxpart_a_idx on idxpart (a);") at postgres.c:1239

#9  0x0000000000767de5 in PostgresMain (argc=<optimized out>, argv=argv@entry

=0x1a552c8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4315

#10 0x00000000006f2b23 in BackendRun (port=0x1a4d1e0, port=0x1a4d1e0) at postmaster.c:4523

#11 BackendStartup (port=0x1a4d1e0) at postmaster.c:4215

#12 ServerLoop () at postmaster.c:1727

#13 0x00000000006f3a1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1a25ea0) 

at postmaster.c:1400

#14 0x00000000004857f9 in main (argc=3, argv=0x1a25ea0) at main.c:210



You can re-produce it like this:

```

create table idxpart (a int, b int, c text) partition by range (a);

create table idxpart1 partition of idxpart for values from (0) to (10);

create table idxpart2 partition of idxpart for values from (10) to (20);

create index CONCURRENTLY idxpart_a_idx on idxpart (a);


````
I have been trying to get familiar with the source code of create index.
Can you solve this bug first? I will try my best to implement CIC with you.
Next, I will read your patchs v2-002 and v2-003.

Thank you very much,

 Regards,  Adger




回复:how to create index concurrently on paritioned table

От
"李杰(慎追)"
Дата:
Hi Justin,

> Maybe I'm wrong, but I don't think there's any known difficulty - just that
> nobody did it yet.  You should pay attention to what happens on error, but
> hopefully you wouldn't need to add much code and can rely on existing code to
> paths to handle that right.

yes,  I am trying to learn the code of index definition.

> I think you'd look at the commits and code implementing indexes on partitioned
> tables and CREATE INDEX CONCURRENTLY.  And maybe any following commits with
> fixes.

> You'd first loop around all children (recursively if there are partitions which
> are themselves partitioned) and create indexes concurrently. 

As we all know, CIC has three transactions. If we recursively in n partitioned tables, 
it will become 3N transactions. If an error occurs in these transactions, 
we have too many things to deal...
 
If an error occurs when an index is created in one of the partitions, 
what should we do with our new index?

Thank you very much,

 Regards,  Adger




      



Re: how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote:
> As we all know, CIC has three transactions. If we recursively in n partitioned tables, 
> it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal...
> 
> If an error occurs when an index is created in one of the partitions, 
> what should we do with our new index?

My (tentative) understanding is that these types of things should use a
"subtransaction" internally..  So if the toplevel transaction rolls back, its
changes are lost.  In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).

postgres=# SET client_min_messages=debug;
postgres=# CREATE INDEX ON t(i);
DEBUG:  building index "t1_i_idx" on table "t1" with request for 1 parallel worker
DEBUG:  index "t1_i_idx" can safely use deduplication
DEBUG:  creating and filling new WAL file
DEBUG:  done creating and filling new WAL file
DEBUG:  creating and filling new WAL file
DEBUG:  done creating and filling new WAL file
DEBUG:  building index "t2_i_idx" on table "t2" with request for 1 parallel worker
^C2020-06-12 13:08:17.001 CDT [19291] ERROR:  canceling statement due to user request
2020-06-12 13:08:17.001 CDT [19291] STATEMENT:  CREATE INDEX ON t(i);
2020-06-12 13:08:17.001 CDT [27410] FATAL:  terminating connection due to administrator command
2020-06-12 13:08:17.001 CDT [27410] STATEMENT:  CREATE INDEX ON t(i);
Cancel request sent

If the index creation is interrupted at this point, no indexes will exist.

On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote:
> >On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > I looked at CIC now and came up with the attached.  All that's needed to allow
> > this case is to close the relation before recursing to partitions - it needs to
> > be closed before calling CommitTransactionCommand().  There's probably a better
> > way to write this, but I can't see that there's anything complicated about
> > handling partitioned tables.
> 
> I'm so sorry about getting back late.
> Thank you very much for helping me consider this issue.
> I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
> I got a coredump that was easy to reproduce. As follows:

> I have been trying to get familiar with the source code of create index.
> Can you solve this bug first? I will try my best to implement CIC with you.
> Next, I will read your patchs v2-002 and v2-003.

Thanks, fixed.

On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote:
> When it comes to test behaviors specific to partitioning, there are in
> my experience three things to be careful about and stress in the tests:
> - Use at least two layers of partitioning.
> - Include into the partition tree a partition that has no leaf
> partitions.
> - Test the commands on the top-most parent, a member in the middle of
> the partition tree, the partition with no leaves, and one leaf, making
> sure that relfilenode changes where it should and that partition trees
> remain intact (you can use pg_partition_tree() for that.)

Added, thanks for looking.

-- 
Justin

Вложения

回复:how to create index concurrently on partitioned table

От
"李杰(慎追)"
Дата:
> My (tentative) understanding is that these types of things should use a
> "subtransaction" internally..  So if the toplevel transaction rolls back, its
> changes are lost.  In some cases, it might be desirable to not roll back, in
> which case the user(client) should first create indexes (concurrently if
> needed) on every child, and then later create index on parent (that has the
> advtantage of working on older servers, too).

Hi Justin, 
I have a case here, you see if it meets your expectations.

`````

postgres=# CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);

CREATE TABLE

postgres=# CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (25);

CREATE TABLE

postgres=# CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (25) TO (50);

CREATE TABLE

postgres=# CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (50) TO (75);

CREATE TABLE

postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 74) i;

INSERT 0 75

postgres=# insert into prt1 values (26,1,'FM0026');

INSERT 0 1

postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);

ERROR:  could not create unique index "prt1_p2_a_idx"

DETAIL:  Key (a)=(26) is duplicated.

postgres=# \d+ prt1

                                   Partitioned table "public.prt1"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition key: RANGE (a)

Indexes:

    "idexpart_cic" UNIQUE, btree (a) INVALID

Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),

            prt1_p2 FOR VALUES FROM (25) TO (50),

            prt1_p3 FOR VALUES FROM (50) TO (75)


postgres=# \d+ prt1_p1

                                       Table "public.prt1_p1"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition of: prt1 FOR VALUES FROM (0) TO (25)

Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))

Indexes:

    "prt1_p1_a_idx" UNIQUE, btree (a)

Access method: heap


postgres=# \d+ prt1_p2

                                       Table "public.prt1_p2"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition of: prt1 FOR VALUES FROM (25) TO (50)

Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))

Indexes:

    "prt1_p2_a_idx" UNIQUE, btree (a) INVALID

Access method: heap


postgres=# \d+ prt1_p3

                                       Table "public.prt1_p3"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition of: prt1 FOR VALUES FROM (50) TO (75)

Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))

Access method: heap

```````
As shown above, an error occurred while creating an index in the second partition. 
It can be clearly seen that the index of the partitioned table is invalid 
and the index of the first partition is normal, the second partition is invalid, 
and the Third Partition index does not exist at all.

But we do the following tests again:
```

postgres=# truncate table prt1;

TRUNCATE TABLE

postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 74) i;

INSERT 0 75

postgres=# insert into prt1 values (51,1,'FM0051');

INSERT 0 1

postgres=# drop index idexpart_cic;

DROP INDEX

postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);

ERROR:  could not create unique index "prt1_p3_a_idx"

DETAIL:  Key (a)=(51) is duplicated.

postgres=# \d+ prt1

                                   Partitioned table "public.prt1"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition key: RANGE (a)

Indexes:

    "idexpart_cic" UNIQUE, btree (a) INVALID

Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),

            prt1_p2 FOR VALUES FROM (25) TO (50),

            prt1_p3 FOR VALUES FROM (50) TO (75)


postgres=# \d+ prt1_p1

                                       Table "public.prt1_p1"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition of: prt1 FOR VALUES FROM (0) TO (25)

Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))

Indexes:

    "prt1_p1_a_idx" UNIQUE, btree (a)

Access method: heap


postgres=# \d+ prt1_p2

                                       Table "public.prt1_p2"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition of: prt1 FOR VALUES FROM (25) TO (50)

Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))

Indexes:

    "prt1_p2_a_idx" UNIQUE, btree (a)

Access method: heap


postgres=# \d+ prt1_p3

                                       Table "public.prt1_p3"

 Column |       Type        | Collation | Nullable | Default | Storage  | Stats target | Description 

--------+-------------------+-----------+----------+---------+----------+--------------+-------------

 a      | integer           |           |          |         | plain    |              | 

 b      | integer           |           |          |         | plain    |              | 

 c      | character varying |           |          |         | extended |              | 

Partition of: prt1 FOR VALUES FROM (50) TO (75)

Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))

Access method: heap

```


Now we can see that the first two partitions have indexes, 

but the third partition has no indexes due to an error. 

Therefore, in our first case, it should not be what we expected that the third partition has no index.

That is to say, when our CIC goes wrong, either roll back all or go down instead of stopping in the middle.

This is my shallow opinion, please take it as your reference.


Thank you very much,

 Regards,  Adger



------------------------------------------------------------------
发件人:Justin Pryzby <pryzby@telsasoft.com>
发送时间:2020年6月13日(星期六) 02:15
收件人:Michael Paquier <michael@paquier.xyz>; 李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: how to create index concurrently on partitioned table

On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote:
> As we all know, CIC has three transactions. If we recursively in n partitioned tables, 
> it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal...

> If an error occurs when an index is created in one of the partitions, 
> what should we do with our new index?

My (tentative) understanding is that these types of things should use a
"subtransaction" internally..  So if the toplevel transaction rolls back, its
changes are lost.  In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).

postgres=# SET client_min_messages=debug;
postgres=# CREATE INDEX ON t(i);
DEBUG:  building index "t1_i_idx" on table "t1" with request for 1 parallel worker
DEBUG:  index "t1_i_idx" can safely use deduplication
DEBUG:  creating and filling new WAL file
DEBUG:  done creating and filling new WAL file
DEBUG:  creating and filling new WAL file
DEBUG:  done creating and filling new WAL file
DEBUG:  building index "t2_i_idx" on table "t2" with request for 1 parallel worker
^C2020-06-12 13:08:17.001 CDT [19291] ERROR:  canceling statement due to user request
2020-06-12 13:08:17.001 CDT [19291] STATEMENT:  CREATE INDEX ON t(i);
2020-06-12 13:08:17.001 CDT [27410] FATAL:  terminating connection due to administrator command
2020-06-12 13:08:17.001 CDT [27410] STATEMENT:  CREATE INDEX ON t(i);
Cancel request sent

If the index creation is interrupted at this point, no indexes will exist.

On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote:
> >On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > I looked at CIC now and came up with the attached.  All that's needed to allow
> > this case is to close the relation before recursing to partitions - it needs to
> > be closed before calling CommitTransactionCommand().  There's probably a better
> > way to write this, but I can't see that there's anything complicated about
> > handling partitioned tables.

> I'm so sorry about getting back late.
> Thank you very much for helping me consider this issue.
> I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
> I got a coredump that was easy to reproduce. As follows:

> I have been trying to get familiar with the source code of create index.
> Can you solve this bug first? I will try my best to implement CIC with you.
> Next, I will read your patchs v2-002 and v2-003.

Thanks, fixed.

On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote:
> When it comes to test behaviors specific to partitioning, there are in
> my experience three things to be careful about and stress in the tests:
> - Use at least two layers of partitioning.
> - Include into the partition tree a partition that has no leaf
> partitions.
> - Test the commands on the top-most parent, a member in the middle of
> the partition tree, the partition with no leaves, and one leaf, making
> sure that relfilenode changes where it should and that partition trees
> remain intact (you can use pg_partition_tree() for that.)

Added, thanks for looking.

-- 
Justin

Re: 回复:how to create indexconcurrently on partitioned table

От
Michael Paquier
Дата:
On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> As shown above, an error occurred while creating an index in the second partition.
> It can be clearly seen that the index of the partitioned table is invalid
> and the index of the first partition is normal, the second partition is invalid,
> and the Third Partition index does not exist at all.

That's a problem.  I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction.  Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael

Вложения

回复:回复:how to create index concurrently on partitioned table

От
"李杰(慎追)"
Дата:
> That's a problem.  I really think that we should make the steps of the
> concurrent operation consistent across all relations, meaning that all
> the indexes should be created as invalid for all the parts of the
> partition tree, including partitioned tables as well as their
> partitions, in the same transaction.  Then a second new transaction
> gets used for the index build, followed by a third one for the
> validation that switches the indexes to become valid.

This is a good idea. 
We should maintain the consistency of the entire partition table.
However, this is not a small change in the code.
May be we need to make a new design for DefineIndex function....

But most importantly, if we use three steps to implement CIC, 
we will spend more time than ordinary tables, especially in high concurrency cases.
To wait for one of partitions which the users to use  frequently, 
the creation of other partition indexes is delayed.
Is it worth doing this?

 Regards,  Adger




------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月15日(星期一) 20:37
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:how to create index concurrently on partitioned table

On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> As shown above, an error occurred while creating an index in the second partition. 
> It can be clearly seen that the index of the partitioned table is invalid 
> and the index of the first partition is normal, the second partition is invalid, 
> and the Third Partition index does not exist at all.

That's a problem.  I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction.  Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael

Re: 回复:回复:how to createindex concurrently on partitioned table

От
Michael Paquier
Дата:
On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
> This is a good idea.
> We should maintain the consistency of the entire partition table.
> However, this is not a small change in the code.
> May be we need to make a new design for DefineIndex function....

Indeed.  I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+       /*
+        * CIC needs to mark a partitioned table as VALID, which itself
+        * requires setting READY, which is unset for CIC (even though
+        * it's meaningless for an index without storage).
+        */
+       if (concurrent)
+       {
+           PopActiveSnapshot();
+           CommitTransactionCommand();
+           StartTransactionCommand();
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+           CommandCounterIncrement();
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
        }

I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees.  That's the point you
raised.  This one is going to need more thoughts.

> But most importantly, if we use three steps to implement CIC,
> we will spend more time than ordinary tables, especially in high
> concurrency cases.  To wait for one of partitions which the users to
> use  frequently, the creation of other partition indexes is delayed.
> Is it worth doing this?

CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time.  Now it
is true that it may not scale well so we should be careful with the
approach taken.  Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael

Вложения

回复:回复:回复:how to create index concurrently on partitioned table

От
"李杰(慎追)"
Дата:
> I am afraid that this makes the error handling more complicated, with
> risks of having inconsistent partition trees.  That's the point you
> raised.  This one is going to need more thoughts.
> CIC is an operation that exists while allowing read and writes to
> still happen in parallel, so that's fine IMO if it takes time.  Now it
> is true that it may not scale well so we should be careful with the
> approach taken.  Also, I think that the case of REINDEX would require
> less work overall because we already have some code in place to gather
> multiple indexes from one or more relations and work on these
> consistently, all at once.


I'm with you on that. 
(Scheme 1)
We can refer to the implementation in the ReindexRelationConcurrently,
 in the three phases of the REINDEX CONCURRENTLY,
all indexes of the partitions are operated one by one in each phase.
In this way, we can maintain the consistency of the entire partitioned table index.
After we implement CIC in this way, we can also complete reindex partitioned table index concurrently (this is not done now.)

Looking back, let's talk about our original schema.
(Scheme 2)
If CIC is performed one by one on each partition, 
how can we make it on subsequent partitions continue when an error occurs in the second partition?
 If this problem can be solved, It's not that I can't accept it.
Because a partition CIC error is acceptable, you can reindex it later.
Pseudo indexes on partitioned tables are useless for real queries, 
but the indexes on partitions are really useful.

Scheme 1, more elegant, can also implement reindex concurrently on partitioned table, 
but the implementation is more complex.
Scheme 2: simple implementation, but not so friendly.

Hi Jsutin,
Which scheme do you think is more helpful to realize CIC?





------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月16日(星期二) 09:02
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:回复:how to create index concurrently on partitioned table

On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
> This is a good idea.
> We should maintain the consistency of the entire partition table.
> However, this is not a small change in the code.
> May be we need to make a new design for DefineIndex function....

Indeed.  I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+       /*
+        * CIC needs to mark a partitioned table as VALID, which itself
+        * requires setting READY, which is unset for CIC (even though
+        * it's meaningless for an index without storage).
+        */
+       if (concurrent)
+       {
+           PopActiveSnapshot();
+           CommitTransactionCommand();
+           StartTransactionCommand();
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+           CommandCounterIncrement();
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
        }

I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees.  That's the point you
raised.  This one is going to need more thoughts.

> But most importantly, if we use three steps to implement CIC, 
> we will spend more time than ordinary tables, especially in high
> concurrency cases.  To wait for one of partitions which the users to
> use  frequently, the creation of other partition indexes is delayed. 
> Is it worth doing this?

CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time.  Now it
is true that it may not scale well so we should be careful with the
approach taken.  Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael

回复:回复:回复:how to create index concurrently on partitioned table

От
"李杰(慎追)"
Дата:
> We can refer to the implementation in the ReindexRelationConcurrently,
> in the three phases of the REINDEX CONCURRENTLY,
> all indexes of the partitions are operated one by one in each phase.
> In this way, we can maintain the consistency of the entire partitioned table index.
> After we implement CIC in this way, we can also complete reindex partitioned table index concurrently (this is not done now.)

 After careful analysis, I found that there were two choices that embarrassed me.
 Although we can handle the entire partition tree with one transaction 
 in each of the three phases of CIC, just like ordinary tables.

However, I found a problem. If there are many partitions, 
we may need to handle too many missing index entries when validate_index() .
Especially for the first partition, the time may have been long and many entries are missing.
In this case, why don't we put the second and third phase together into a transaction for each partition?

So, which schema do you think is better?
Choose to maintain consistency in all three phases, 
or just maintain consistency in the first phase?

Thank you very much,

 Regards,  Adger




Re: 回复:回复:回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions,
> we may need to handle too many missing index entries when
> validate_index().  Especially for the first partition, the time may
> have been long and many entries are missing.  In this case, why
> don't we put the second and third phase together into a transaction
> for each partition?

Not sure I am following.  In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree.  See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create().  It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael

Вложения

回复:回复:回复:回复:how to create index concurrently on partitioned table

От
"李杰(慎追)"
Дата:
> Not sure I am following.  In the case of REINDEX, it seems to me that
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree.  See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create().  It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.

Let's look at this example:  
A partition table has five partitions,
    parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
     phase 1:  index_create(),   it is  only registered in catalogs.
    phase 2: index_concurrently_build(), Build the indexes.
    phase 3: validate_index(), insert any missing index entries, mark the index as valid.

(schema 1)
```
StartTransaction  one
parttable  phase 1
part 1       phase 1
part 2       phase 1
part 3       phase 1
part 4       phase 1
part 5       phase 1
CommitTransaction

StartTransaction two
parttable  phase 2
part 1   phase 2  
part 2   phase 2  
part 3   phase 2  (error occurred )
part 4   phase 2  
part 5   phase 2  
CommitTransaction

StartTransaction three
parttable  phase 3
part 1   phase 3
part 2   phase 3 
part 3   phase 3  
part 4   phase 4  
part 5   phase 5  
CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two  roll back, Transaction three haven't started.
All of our indexes are invalid.  In this way, 
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.

(schema 2)
```
StartTransaction  one
parttable  phase 1
part 1       phase 1
part 2       phase 1
part 3       phase 1
part 4       phase 1
part 5       phase 1
CommitTransaction

StartTransaction two 
part 1   phase 2  
part 1   phase 3
CommitTransaction

StartTransaction three 
part 2   phase 2  
part 2   phase 3
CommitTransaction

StartTransaction four
part 3   phase 2  (error  occurred )
part 3   phase 3
CommitTransaction

StartTransaction five 
part 4   phase 2  
part 4   phase 3

StartTransaction six
part 5   phase 2  
part 5   phase 3
CommitTransaction

StartTransaction seven
parttable   phase 2  
parttable  phase 3
CommitTransaction

```

Now, the following steps cannot continue due to an error  in Transaction four .
so, Transaction four roll back, Transactions behind Transaction 3 have not started
The indexes of the p1 and p2 partitions are available. Other indexes are invalid.
In reindex, we can ignore the rebuild of p1 and p2.
This seems better, although it seems to be inconsistent.

Do you think that scheme is more suitable for CIC?


Thank you very much,

 Regards,  Adger








------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月18日(星期四) 10:41
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned table

On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions, 
> we may need to handle too many missing index entries when
> validate_index().  Especially for the first partition, the time may
> have been long and many entries are missing.  In this case, why
> don't we put the second and third phase together into a transaction
> for each partition? 

Not sure I am following.  In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree.  See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create().  It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael

回复:回复:回复:回复:how to create index concurrently on partitioned table

От
"李杰(慎追)"
Дата:
> Not sure I am following.  In the case of REINDEX, it seems to me that
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree.  See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create().  It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.

Some errors in the last email were not clearly expressed.

Let's look at this example:  
A partition table has five partitions,
    parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
     phase 1:  index_create(),   it is  only registered in catalogs.
    phase 2: index_concurrently_build(), Build the indexes.
    phase 3: validate_index(), insert any missing index entries, mark the index as valid.

(scheme 1)
```
StartTransaction  one
parttable  phase 1
part 1       phase 1
part 2       phase 1
part 3       phase 1
part 4       phase 1
part 5       phase 1
CommitTransaction

StartTransaction two
parttable  phase 2
part 1   phase 2  
part 2   phase 2  
part 3   phase 2  (error occurred )
part 4   phase 2  
part 5   phase 2  
CommitTransaction

StartTransaction three
parttable  phase 3
part 1   phase 3
part 2   phase 3 
part 3   phase 3  
part 4   phase 4  
part 5   phase 5  
CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two  roll back, Transaction three haven't started.
All of our indexes are invalid.  In this way, 
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.

(scheme 2)
```
StartTransaction  one
parttable  phase 1
part 1       phase 1
part 2       phase 1
part 3       phase 1
part 4       phase 1
part 5       phase 1
CommitTransaction

StartTransaction two 
part 1   phase 2  
CommitTransaction
StartTransaction three
part 1   phase 3
CommitTransaction

StartTransaction four
part 2   phase 2  
CommitTransaction
StartTransaction five
part 2   phase 3
CommitTransaction

StartTransaction six
part 3   phase 2  (error  occurred )
CommitTransaction
StartTransaction seven
part 3   phase 3
CommitTransaction

StartTransaction xx
part 4   phase 2  
CommitTransaction
StartTransaction xx
part 4   phase 3
CommitTransaction

StartTransaction xx
part 5   phase 2  
CommitTransaction
StartTransaction xx
part 5   phase 3
CommitTransaction

StartTransaction xx
parttable   phase 2  
CommitTransaction
StartTransaction xx
parttable  phase 3
CommitTransaction

```

Now, the following steps cannot continue due to an error  in Transaction six .
so, Transaction six roll back, Transactions behind Transaction six have not started
The indexes of the p1 and p2 partitions are available. Other indexes are invalid.
In reindex, we can ignore the rebuild of p1 and p2. 
Especially when there are many partitions, 
this can save the rebuild of some partition indexes,
This seems better, although it seems to be inconsistent.

Do you think that scheme is more suitable for CIC?


Thank you very much,

 Regards,  Adger



------------------------------------------------------------------
发件人:李杰(慎追) <adger.lj@alibaba-inc.com>
发送时间:2020年6月18日(星期四) 14:37
收件人:Michael Paquier <michael@paquier.xyz>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:回复:回复:回复:回复:how to create index concurrently on partitioned table

> Not sure I am following.  In the case of REINDEX, it seems to me that
> the calls to validate_index() and index_concurrently_build() can
> happen in a separate transaction for each index, as long as all the
> calls to index_concurrently_swap() are grouped together in the same
> transaction to make sure that index partition trees are switched
> consistently when all entries are swapped from an invalid state to a
> valid state, because the swapping phase is also when we attach a fresh
> index to a partition tree.  See also index_concurrently_create_copy()
> where we don't set parentIndexRelid for the lower call to
> index_create().  It would be good of course to check that when
> swapping we have the code to handle that for a lot of indexes at
> once.

Let's look at this example:  
A partition table has five partitions,
    parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
     phase 1:  index_create(),   it is  only registered in catalogs.
    phase 2: index_concurrently_build(), Build the indexes.
    phase 3: validate_index(), insert any missing index entries, mark the index as valid.

(schema 1)
```
StartTransaction  one
parttable  phase 1
part 1       phase 1
part 2       phase 1
part 3       phase 1
part 4       phase 1
part 5       phase 1
CommitTransaction

StartTransaction two
parttable  phase 2
part 1   phase 2  
part 2   phase 2  
part 3   phase 2  (error occurred )
part 4   phase 2  
part 5   phase 2  
CommitTransaction

StartTransaction three
parttable  phase 3
part 1   phase 3
part 2   phase 3 
part 3   phase 3  
part 4   phase 4  
part 5   phase 5  
CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two  roll back, Transaction three haven't started.
All of our indexes are invalid.  In this way, 
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.

(schema 2)
```
StartTransaction  one
parttable  phase 1
part 1       phase 1
part 2       phase 1
part 3       phase 1
part 4       phase 1
part 5       phase 1
CommitTransaction

StartTransaction two 
part 1   phase 2  
part 1   phase 3
CommitTransaction

StartTransaction three 
part 2   phase 2  
part 2   phase 3
CommitTransaction

StartTransaction four
part 3   phase 2  (error  occurred )
part 3   phase 3
CommitTransaction

StartTransaction five 
part 4   phase 2  
part 4   phase 3

StartTransaction six
part 5   phase 2  
part 5   phase 3
CommitTransaction

StartTransaction seven
parttable   phase 2  
parttable  phase 3
CommitTransaction

```

Now, the following steps cannot continue due to an error  in Transaction four .
so, Transaction four roll back, Transactions behind Transaction 3 have not started
The indexes of the p1 and p2 partitions are available. Other indexes are invalid.
In reindex, we can ignore the rebuild of p1 and p2.
This seems better, although it seems to be inconsistent.

Do you think that scheme is more suitable for CIC?


Thank you very much,

 Regards,  Adger








------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月18日(星期四) 10:41
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned table

On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
> However, I found a problem. If there are many partitions, 
> we may need to handle too many missing index entries when
> validate_index().  Especially for the first partition, the time may
> have been long and many entries are missing.  In this case, why
> don't we put the second and third phase together into a transaction
> for each partition? 

Not sure I am following.  In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree.  See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create().  It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier wrote:
> On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> > As shown above, an error occurred while creating an index in the second partition. 
> > It can be clearly seen that the index of the partitioned table is invalid 
> > and the index of the first partition is normal, the second partition is invalid, 
> > and the Third Partition index does not exist at all.
> 
> That's a problem.  I really think that we should make the steps of the
> concurrent operation consistent across all relations, meaning that all
> the indexes should be created as invalid for all the parts of the
> partition tree, including partitioned tables as well as their
> partitions, in the same transaction.  Then a second new transaction
> gets used for the index build, followed by a third one for the
> validation that switches the indexes to become valid.

Note that the mentioned problem wasn't serious: there was missing index on
child table, therefor the parent index was invalid, as intended.  However I
agree that it's not nice that the command can fail so easily and leave behind
some indexes created successfully and some failed some not created at all.

But I took your advice initially creating invalid inds.

On Tue, Jun 16, 2020 at 10:02:21AM +0900, Michael Paquier wrote:
> CIC is an operation that exists while allowing read and writes to
> still happen in parallel, so that's fine IMO if it takes time.  Now it
> is true that it may not scale well so we should be careful with the
> approach taken.  Also, I think that the case of REINDEX would require
> less work overall because we already have some code in place to gather
> multiple indexes from one or more relations and work on these
> consistently, all at once.

I'm not sure if by reindex you mean my old 0002.  That's now 0001, so if it can
be simplified somehow, that's great..

That gave me the idea to layer CIC on top of Reindex, since I think it does
exactly what's needed.

-- 
Justin

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote:
> That gave me the idea to layer CIC on top of Reindex, since I think it does
> exactly what's needed.

For now, I would recommend to focus first on 0001 to add support for
partitioned tables and indexes to REINDEX.  CIC is much more
complicated btw, but I am not entering in the details now.

-   /*
-    * This may be useful when implemented someday; but that day is not today.
-    * For now, avoid erroring out when called in a multi-table context
-    * (REINDEX SCHEMA) and happen to come across a partitioned table.  The
-    * partitions may be reindexed on their own anyway.
-    */
+   /* Avoid erroring out */
    if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
    {
This comment does not help, and actually this becomes incorrect as
reindex for this relkind becomes supported once 0001 is done.

+       case RELKIND_INDEX:
+           reindex_index(inhrelid, false, get_rel_persistence(inhrelid),
+                         options | REINDEXOPT_REPORT_PROGRESS);
+           break;
+       case RELKIND_RELATION:
+           (void) reindex_relation(inhrelid,
+                                     REINDEX_REL_PROCESS_TOAST |
+                                     REINDEX_REL_CHECK_CONSTRAINTS,
+                                     options | REINDEXOPT_REPORT_PROGRESS);
ReindexPartitionedRel() fails to consider the concurrent case here for
partition indexes and tables, as reindex_index()/reindex_relation()
are the APIs used in the non-concurrent case.  Once you consider the
concurrent case correctly, we also need to be careful with partitions
that have a temporary persistency (note that we don't allow partition
trees to mix persistency types, all partitions have to be temporary or
permanent).

I think that you are right to make the entry point to handle
partitioned index in ReindexIndex() and partitioned table in
ReindexTable(), but the structure of the patch should be different:
- The second portion of ReindexMultipleTables() should be moved into a
separate routine, taking in input a list of relation OIDs.  This needs
to be extended a bit so as reindex_index() gets called for an index
relkind if the relpersistence is temporary or if we have a
non-concurrent reindex.  The idea is that we finish with a single code
path able to work on a list of relations.  And your patch adds more of
that as of ReindexPartitionedRel().
- We should *not* handle directly partitioned index and/or table in
ReindexRelationConcurrently() to not complicate the logic where we
gather all the indexes of a table/matview.  So I think that the list
of partition indexes/tables to work on should be built directly in
ReindexIndex() and ReindexTable(), and then this should call the
second part of ReindexMultipleTables() refactored in the previous
point.  This way, each partition index gets done individually in its
own transaction.  For a partition table, all indexes of this partition
are rebuilt in the same set of transactions.  For the concurrent case,
we have already reindex_concurrently_swap that it able to switch the
dependencies of two indexes within a partition tree, so we can rely on
that so as a failure in the middle of the operation never leaves the
a partition structure in an inconsistent state.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
Thanks for looking.

On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote:
> > exactly what's needed.
> 
> For now, I would recommend to focus first on 0001 to add support for
> partitioned tables and indexes to REINDEX.  CIC is much more
> complicated btw, but I am not entering in the details now.
> 
> +   /* Avoid erroring out */
>     if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>     {
> This comment does not help, and actually this becomes incorrect as
> reindex for this relkind becomes supported once 0001 is done.

I made a minimal change to avoid forgetting to eventually change that part.

> ReindexPartitionedRel() fails to consider the concurrent case here for
> partition indexes and tables, as reindex_index()/reindex_relation()
> are the APIs used in the non-concurrent case.  Once you consider the
> concurrent case correctly, we also need to be careful with partitions
> that have a temporary persistency (note that we don't allow partition
> trees to mix persistency types, all partitions have to be temporary or
> permanent).

Fixed.

> I think that you are right to make the entry point to handle
> partitioned index in ReindexIndex() and partitioned table in
> ReindexTable(), but the structure of the patch should be different:
> - The second portion of ReindexMultipleTables() should be moved into a
> separate routine, taking in input a list of relation OIDs.  This needs
> to be extended a bit so as reindex_index() gets called for an index
> relkind if the relpersistence is temporary or if we have a
> non-concurrent reindex.  The idea is that we finish with a single code
> path able to work on a list of relations.  And your patch adds more of
> that as of ReindexPartitionedRel().

It's a good idea.

> - We should *not* handle directly partitioned index and/or table in
> ReindexRelationConcurrently() to not complicate the logic where we
> gather all the indexes of a table/matview.  So I think that the list
> of partition indexes/tables to work on should be built directly in
> ReindexIndex() and ReindexTable(), and then this should call the
> second part of ReindexMultipleTables() refactored in the previous
> point.

I think I addressed these mostly as you intended.

> This way, each partition index gets done individually in its
> own transaction.  For a partition table, all indexes of this partition
> are rebuilt in the same set of transactions.  For the concurrent case,
> we have already reindex_concurrently_swap that it able to switch the
> dependencies of two indexes within a partition tree, so we can rely on
> that so as a failure in the middle of the operation never leaves the
> a partition structure in an inconsistent state.

-- 
Justin

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote:
> On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote:
>> For now, I would recommend to focus first on 0001 to add support for
>> partitioned tables and indexes to REINDEX.  CIC is much more
>> complicated btw, but I am not entering in the details now.
>>
>> +   /* Avoid erroring out */
>>     if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>>     {
>> This comment does not help, and actually this becomes incorrect as
>> reindex for this relkind becomes supported once 0001 is done.
>
> I made a minimal change to avoid forgetting to eventually change
> that part.

Why not changing it then?  We already filter out per relkind in all
the code paths calling reindex_relation(), be it in indexcmds.c for
schema-level reindex or even tablecmds.c, so I have switched this part
to an elog().

>> - We should *not* handle directly partitioned index and/or table in
>> ReindexRelationConcurrently() to not complicate the logic where we
>> gather all the indexes of a table/matview.  So I think that the list
>> of partition indexes/tables to work on should be built directly in
>> ReindexIndex() and ReindexTable(), and then this should call the
>> second part of ReindexMultipleTables() refactored in the previous
>> point.
>
> I think I addressed these mostly as you intended.

Mostly.  I have been hacking on this patch, and basically rewrote it
as the attached.  The handling of the memory context used to keep the
list of partitions intact across transactions was rather clunky: the
context was not reset when we are done, and we would call more APIs
than necessary while switching to it, like find_all_inheritors() which
could do much more allocations.  I have fixed that by minimizing the
areas where the private context is used, switching to it only when
saving a new OID in the list of partitions, or a session lock (see
below for this part).

While on it, I found that the test coverage was not enough, so I have
extended the set of tests to make sure any concurrent and
non-concurrent operation for partitioned tables and indexes change the
correct set of relfilenodes for each operation.  I have written some
custom functions to minimize the duplication (the whole thing cannot
be grouped as those commands cannot run in a transaction block).

Speaking of which, the patch missed that REINDEX INDEX/TABLE should
not run in a transaction block when working on a partitioned
relation.  And the documentation needs to be clear about the
limitation of each operation, so I have written more about all that.
The patch also has commented out areas with slashes or such, and I
have added some elog() and some asserts to make sure that we don't
cross any areas that should not work with partitioned relations.

While hacking on this patch, I have found an old bug in the REINDEX
logic: we build a list of relations to reindex in
ReindexMultipleTables() for schema and database reindexes, but it
happens that we don't recheck if the relations listed actually exists
or not, so dropping a relation during a large reindex can cause
sparse failures because of relations that cannot be found anymore.  In
the case of this thread, the problem is different though (the proposed
patch was full of holes regarding that) and we need to use session
locks on the parent *table* partitions (not the indexes) to avoid any
issues within the first transaction building the list of relations to
work on, similarly to REINDEX CONCURRENTLY.  So I fixed this problem
this way. For the schema and database cases, I think that we would
need to do something similar to VACUUM, aka have an extra code path to
skip relations not defined.  I'll leave that for another thread.

One last thing.  I think that the patch is in a rather good shape, but
there is one error message I am not happy with when running some
commands in a transaction block.  Say, this sequence:
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id);
CREATE INDEX parent_index ON parent_tab (id);
BEGIN;
REINDEX INDEX parent_index; -- error
ERROR:  25001: REINDEX INDEX cannot run inside a transaction block
LOCATION:  PreventInTransactionBlock, xact.c:3386

This error can be confusing, because we don't tell directly that the
relation involved here is partitioned, and REINDEX INDEX/TABLE are
fine when doing their stuff on non-partitions.  For other code paths,
we have leveraged such errors to use the grammar specific to
partitions, for example "CREATE TABLE .. PARTITION OF" or such as
these don't cause translation issues, but we don't have a specific
syntax of REINDEX for partitioned relations, and I don't think that we
need more grammar just for that.  The simplest idea I have here is to
just use an error callback to set an errcontext(), saying roughly:
"while reindexing partitioned table/index %s" while we go through
PreventInTransactionBlock().  I have done nothing about that yet but
adding an errcallback is simple enough.  Perhaps somebody has a
different idea here?
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
Thanks for helping with this.

On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote:
> +++ b/src/backend/catalog/index.c
> @@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options)
> +        elog(ERROR, "unsupported relation kind for relation \"%s\"",
> +             RelationGetRelationName(rel));

I guess it should show the relkind(%c) in the message, like these:

src/backend/commands/tablecmds.c:                       elog(ERROR, "unexpected relkind: %d", (int) relkind);
src/backend/tcop/utility.c:                                                             elog(ERROR, "unexpected relkind
\"%c\"on partition \"%s\"",
 

ISTM reindex_index is missing that, too:

8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99
+       if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+               elog(ERROR, "unsupported relation kind for index \"%s\"",
+                        RelationGetRelationName(iRel));


> diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
> @@ -259,8 +263,12 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { IN
>    </para>
>  
>    <para>
> -   Reindexing partitioned tables or partitioned indexes is not supported.
> -   Each individual partition can be reindexed separately instead.
> +   Reindexing partitioned indexes or partitioned tables is supported
> +   with respectively <command>REINDEX INDEX</command> or
> +   <command>REINDEX TABLE</command>.

Should say "..with REINDEX INDEX or REINDEX TABLE, respectively".

> + Each partition of the partitioned
> +   relation defined is rebuilt in its own transaction.

=> Each partition of the specified partitioned relation is reindexed in a
separate transaction.

-- 
Justin



Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Wed, Aug 12, 2020 at 12:28:20AM -0500, Justin Pryzby wrote:
> On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote:
>> +++ b/src/backend/catalog/index.c
>> @@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options)
>> +        elog(ERROR, "unsupported relation kind for relation \"%s\"",
>> +             RelationGetRelationName(rel));
>
> I guess it should show the relkind(%c) in the message, like these:

Sure, but I don't see much the point in adding the relkind here
knowing that we know which one it is.

> ISTM reindex_index is missing that, too:
>
> 8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99
> +       if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
> +               elog(ERROR, "unsupported relation kind for index \"%s\"",
> +                        RelationGetRelationName(iRel));

The error string does not follow the usual project style either, so I
have updated both.

>>    <para>
>> -   Reindexing partitioned tables or partitioned indexes is not supported.
>> -   Each individual partition can be reindexed separately instead.
>> +   Reindexing partitioned indexes or partitioned tables is supported
>> +   with respectively <command>REINDEX INDEX</command> or
>> +   <command>REINDEX TABLE</command>.
>
> Should say "..with REINDEX INDEX or REINDEX TABLE, respectively".
>
>> + Each partition of the partitioned
>> +   relation defined is rebuilt in its own transaction.
>
> => Each partition of the specified partitioned relation is reindexed in a
> separate transaction.

Thanks, good idea.

I have been able to work more on this patch today, and finally added
an error context for the transaction block check, as that's cleaner.
In my manual tests, I have also bumped into a case that failed with
the original patch (where there were no session locks), and created
an isolation test based on it: drop of a partition leaf concurrently
to REINDEX done on the parent.  One last thing I have spotted is that
we failed to discard properly foreign tables defined as leaves of a
partition tree, causing a reindex to fail, so reindex_partitions()
ought to just use RELKIND_HAS_STORAGE() to do its filtering work.  I
am leaving this patch alone for a couple of days now, and I'll try to
come back to it after and potentially commit it.  The attached has
been indented by the way.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Wed, Aug 12, 2020 at 10:37:08PM +0900, Michael Paquier wrote:
> I have been able to work more on this patch today, and finally added
> an error context for the transaction block check, as that's cleaner.
> In my manual tests, I have also bumped into a case that failed with
> the original patch (where there were no session locks), and created
> an isolation test based on it: drop of a partition leaf concurrently
> to REINDEX done on the parent.  One last thing I have spotted is that
> we failed to discard properly foreign tables defined as leaves of a
> partition tree, causing a reindex to fail, so reindex_partitions()
> ought to just use RELKIND_HAS_STORAGE() to do its filtering work.  I
> am leaving this patch alone for a couple of days now, and I'll try to
> come back to it after and potentially commit it.  The attached has
> been indented by the way.

I got to think more about the locking strategy used in this patch, and
I am afraid that we should fix the bug with REINDEX SCHEMA/DATABASE
first.  What we have here is rather similar to a REINDEX SCHEMA with
all the partitions on the same schema, so it would be better to apply
the same set of assumptions and logic for the reindex of partitioned
relations as we do for the others.  This makes the whole logic more
consistent, but it also reduces the surface of bugs.  I have created a
separate thread for the problem, and posted a patch:
https://www.postgresql.org/message-id/20200813043805.GE11663@paquier.xyz

Once this gets done, we should then be able to get rid of the extra
session locking taken when building the list of partitions, limiting
session locks to only be taken during the concurrent reindex of a
single partition (the table itself for a partition table, and the
parent table for a partition index), making the whole operation less
invasive.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote:
> Thanks for looking.

The REINDEX patch is progressing its way, so I have looked a bit at
the part for CIC.

Visibly, the case of multiple partition layers is not handled
correctly.  Here is a sequence that gets broken:
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id);
CREATE TABLE child_0_10 PARTITION OF parent_tab
     FOR VALUES FROM (0) TO (10);
CREATE TABLE child_10_20 PARTITION OF parent_tab
     FOR VALUES FROM (10) TO (20);
CREATE TABLE child_20_30 PARTITION OF parent_tab
     FOR VALUES FROM (20) TO (30);
INSERT INTO parent_tab VALUES (generate_series(0,29));
CREATE TABLE child_30_40 PARTITION OF parent_tab
     FOR VALUES FROM (30) TO (40)
     PARTITION BY RANGE(id);
CREATE TABLE child_30_35 PARTITION OF child_30_40
     FOR VALUES FROM (30) TO (35);
CREATE TABLE child_35_40 PARTITION OF child_30_40
     FOR VALUES FROM (35) TO (40);
INSERT INTO parent_tab VALUES (generate_series(30,39));
CREATE INDEX CONCURRENTLY parent_index ON parent_tab (id);

This fails as follows:
ERROR: XX000: unrecognized node type: 2139062143
LOCATION:  copyObjectImpl, copyfuncs.c:5718

And the problem visibly comes from some handling with the second level
of partitions, child_30_40 in my example above.  Even with that, this
outlines a rather severe problem in the patch: index_set_state_flags()
flips indisvalid on/off using a non-transactional update (see the use
of heap_inplace_update), meaning that if we fail in the middle of the
operation we may finish with a partition index tree where some of the
indexes are valid and some of them are invalid.  In order to make this
logic consistent, I am afraid that we will need to do two things:
- Change index_set_state_flags() so as it uses a transactional
update.  That's something I would like to change for other reasons,
like making sure that the REPLICA IDENTITY of a parent relation is
correctly reset when dropping a replica index.
- Make all the indexes of the partition tree valid in the *same*
sub-transaction.
You can note that this case is different than a concurrent REINDEX,
because in this case we just do an in-place change between the old and
new index, meaning that even if there is a failure happening while
processing, we may have some invalid indexes, but there are still
valid indexes attached to the partition tree, at any time.

+     MemoryContext   oldcontext = MemoryContextSwitchTo(ind_context);
      PartitionDesc partdesc = RelationGetPartitionDesc(rel);
      int         nparts = partdesc->nparts;
+     char        *relname = pstrdup(RelationGetRelationName(rel));
Er, no.  We should not play with the relation cache calls in a private
memory context.  I think that this needs much more thoughts.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Fri, Aug 14, 2020 at 09:29:45AM +0900, Michael Paquier wrote:
> Once this gets done, we should then be able to get rid of the extra
> session locking taken when building the list of partitions, limiting
> session locks to only be taken during the concurrent reindex of a
> single partition (the table itself for a partition table, and the
> parent table for a partition index), making the whole operation less
> invasive.

The problem with dropped relations in REINDEX has been addressed by
1d65416, so I have gone through this patch again and simplified the
use of session locks, these being taken only when doing a REINDEX
CONCURRENTLY for a given partition.  This part is in a rather
committable shape IMO, so I would like to get it done first, before
looking more at the other cases with CIC and CLUSTER.  I am still
planning to go through it once again.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Anastasia Lubennikova
Дата:
On 02.09.2020 04:39, Michael Paquier wrote:

The problem with dropped relations in REINDEX has been addressed by
1d65416, so I have gone through this patch again and simplified the
use of session locks, these being taken only when doing a REINDEX
CONCURRENTLY for a given partition.  This part is in a rather
committable shape IMO, so I would like to get it done first, before
looking more at the other cases with CIC and CLUSTER.  I am still
planning to go through it once again.
--
Michael

Thank you for advancing this work.

I was reviewing the previous version, but the review became outdated before I sent it. Overall design is fine, but I see a bunch of things that need to be fixed before commit.

First of all, this patch fails at cfbot:

indexcmds.c:2848:7: error: variable ‘parentoid’ set but not used [-Werror=unused-but-set-variable]
Oid parentoid; ^


It seems to be just a typo. With this minor fix the patch compiles and passes tests.

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 75008eebde..f5b3c53a83 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2864,7 +2864,7 @@ reindex_partitions(Oid relid, int options, bool concurrent,                 /* Save partition OID */                old_context = MemoryContextSwitchTo(reindex_context);
-               partitions = lappend_oid(partitions, partoid);
+               partitions = lappend_oid(partitions, parentoid);                MemoryContextSwitchTo(old_context);        }

If this guessed fix is correct, I see the problem in the patch logic. In reindex_partitions() we collect parent relations to pass them to reindex_multiple_internal(). It implicitly changes the logic from REINDEX INDEX to REINDEX RELATION, which is not the same, if table has more than one index. For example, if I add one more index to a partitioned table from a create_index.sql test:
create index idx on concur_reindex_part_0_2 using btree (c2);

and call

REINDEX INDEX CONCURRENTLY concur_reindex_part_index;
idx will be reindexed as well. I doubt that it is the desired behavior.


A few more random issues, that I noticed at first glance:

1) in create_index.sql

Are these two lines intentional checks that must fail? If so, I propose to add a comment.

REINDEX TABLE concur_reindex_part_index;
REINDEX TABLE CONCURRENTLY concur_reindex_part_index;

A few lines around also look like they were copy-pasted and need a second look.

2)  This part of ReindexRelationConcurrently() is misleading.

        case RELKIND_PARTITIONED_TABLE:
        case RELKIND_PARTITIONED_INDEX:
        default:
            /* Return error if type of relation is not supported */
            ereport(ERROR,
                    (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                     errmsg("cannot reindex this type of relation concurrently")));


Maybe assertion is enough. It seems, that we should never reach this code because both ReindexTable and ReindexIndex handle those relkinds separately.  Which leads me to the next question.

3) Is there a reason, why reindex_partitions() is not inside ReindexRelationConcurrently() ? I think that logically it belongs there.

4) I haven't tested multi-level partitions yet. In any case, it would be good to document this behavior explicitly.


I need a bit more time to review this patch more thoroughly. Please, wait for it, before committing.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Thu, Sep 03, 2020 at 10:02:58PM +0300, Anastasia Lubennikova wrote:
> First of all, this patch fails at cfbot:
>
> indexcmds.c:2848:7: error: variable ‘parentoid’ set but not used
> [-Werror=unused-but-set-variable]
> Oid parentoid;^

Missed this warning, thanks for pointing it out.  This is an incorrect
rebase: this variable was used as a workaround to take a session lock
on the parent table to be reindexed, session lock logic existing to
prevent cache lookup errors when dropping some portions of the tree
concurrently (1d65416 as you already know).  But we don't need that
anymore.

> If this guessed fix is correct, I see the problem in the patch logic. In
> reindex_partitions() we collect parent relations to pass them to
> reindex_multiple_internal(). It implicitly changes the logic from REINDEX
> INDEX to REINDEX RELATION, which is not the same, if table has more than one
> index.

Incorrect guess here.  parentoid refers to the parent table of an
index, so by saving its OID in the list of things to reindex, you
would finish by reindexing all the indexes of a partition.  We need to
use partoid, as returned by find_all_inheritors() for all the members
of the partition tree (relid can be a partitioned index or partitioned
table in reindex_partitions).

> 1) in create_index.sql
>
> Are these two lines intentional checks that must fail? If so, I propose to
> add a comment.
>
> REINDEX TABLE concur_reindex_part_index;
> REINDEX TABLE CONCURRENTLY concur_reindex_part_index;
>
> A few lines around also look like they were copy-pasted and need a second
> look.

You can note some slight differences though.  These are test cases for
REINDEX INDEX with tables, and REINDEX TABLE with indexes.  What you
are quoting here is the part for indexes with REINDEX TABLE.  And
there are already comments, see:
"-- REINDEX INDEX fails for partitioned tables"
"-- REINDEX TABLE fails for partitioned indexes"

Perhaps that's not enough, so I have added some more comments to
outline that these commands fail (8 commands in total).

> 2)  This part of ReindexRelationConcurrently() is misleading.
>
> Maybe assertion is enough. It seems, that we should never reach this code
> because both ReindexTable and ReindexIndex handle those relkinds
> separately.  Which leads me to the next question.

Yes, we could use an assert, but I did not feel any strong need to
change that either for this patch.

> 3) Is there a reason, why reindex_partitions() is not inside
> ReindexRelationConcurrently() ? I think that logically it belongs there.

Yes, it simplifies the build of the index list indexIds, as there is
no need to loop back into a different routine if working on a table or
a matview.

> 4) I haven't tested multi-level partitions yet. In any case, it would be
> good to document this behavior explicitly.

Not sure what addition we could do here.  The patch states that each
partition of the partitioned relation defined gets reindexed, which
implies that this handles multiple layers automatically.

> I need a bit more time to review this patch more thoroughly. Please, wait
> for it, before committing.

Glad to hear that, please take the time you need.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote:
> Glad to hear that, please take the time you need.

Attached is a rebased version to address the various conflicts after
844c05a.
--
Michael


Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Anastasia Lubennikova
Дата:
On 07.09.2020 04:58, Michael Paquier wrote:
> On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote:
>> Glad to hear that, please take the time you need.
> Attached is a rebased version to address the various conflicts after
> 844c05a.
> --
> Michael
>
Thank you.

With the fix for the cycle in reindex_partitions() and new comments 
added, I think this patch is ready for commit.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
Thanks for completing and pushing the REINDEX patch and others.

Here's a rebasified + fixed version of the others.

On Tue, Sep 01, 2020 at 02:51:58PM +0900, Michael Paquier wrote:
> The REINDEX patch is progressing its way, so I have looked a bit at
> the part for CIC.
> 
> Visibly, the case of multiple partition layers is not handled
> correctly.  Here is a sequence that gets broken:
..
> This fails as follows:
> ERROR: XX000: unrecognized node type: 2139062143
> LOCATION:  copyObjectImpl, copyfuncs.c:5718

Because copyObject needed to be called within a longlived context.

Also, my previous revision failed to implement your suggestion to first build
catalog entries with INVALID indexes and to then reindex them.  Fixed.

-- 
Justin

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Mon, Sep 07, 2020 at 09:39:16PM -0500, Justin Pryzby wrote:
> Also, my previous revision failed to implement your suggestion to first build
> catalog entries with INVALID indexes and to then reindex them.  Fixed.

-   childStmt->oldCreateSubid = InvalidSubTransactionId;
-   childStmt->oldFirstRelfilenodeSubid = childStmt->InvalidSubTransactionId;
+   // childStmt->oldCreateSubid = childStmt->InvalidSubTransactionId;
+   // childStmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
In the CIC patch, what is that about?  It is hard to follow what you
are trying to achieve here.

+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+           CommandCounterIncrement();
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
Anyway, this part of the logic is still not good here.  If we fail in
the middle here, we may run into problems with a single partition
index that has only a portion of its flags set.  As this gets called
for each partitioned index, it also means that we could still finish
in a state where we have only a partially-valid partition tree.  For
example, imagine a partition tree with 2 levels (as reported by
pg_partition_tree), then the following happens if an index is created
concurrently from the partitioned table of level 0:
1) indexes are created in level 2 first
2) partitioned table of level 1 is switched to be ready and valid
3) indexes of level 1 are created.
4) partitioned table of level 0 is switched to be ready and valid
If the command has a failure, say between 2 and 3, we would have as
result a command that has partially succeeded in creating a partition
tree, while the user was looking for an index for the full tree.  This
comes back to my previous points, where we should make
index_set_state_flags() first, and I have begun a separate thread
about that:
https://commitfest.postgresql.org/30/2725/

Second, it also means that the patch should really switch all the
indexes to be valid in one single transaction, and that we also need
more careful refactoring of DefineIndex().

I also had a quick look at the patch for CLUSTER, and it does a much
better job, still it has issues.

+   MemoryContext old_context = MemoryContextSwitchTo(cluster_context);
+
+   inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+   foreach(lc, inhoids)
The area where a private memory context is used should be minimized.
In this case, you just need to hold the context while saving the
relation and clustered index information in the list of RelToCluster
items.  As a whole, this case is more simple than CIC, so I'd like to
think that it would be good to work on that as next target.

Coming to my last point..  This thread has dealt since the beginning
with three different problems:
- REINDEX on partitioned relations.
- CLUSTER on partitioned relations.
- CIC on partitioned relations.  (Should we also care about DROP INDEX
CONCURRENTLY as well?)

The first problem has been solved, not the two others yet.  Do you
think that it could be possible to begin two new separate threads for
the remaining issues, with dedicated CF entries?  We could also mark
the existing one as committed, retitled for REINDEX as a matter of
clarity.  Also, please note that I am not sure if I will be able to
look more at this thread in this CF.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:
> - CIC on partitioned relations.  (Should we also care about DROP INDEX
> CONCURRENTLY as well?)

Do you have any idea what you think that should look like for DROP INDEX
CONCURRENTLY ?

-- 
Justin



Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote:
> On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:
>> - CIC on partitioned relations.  (Should we also care about DROP INDEX
>> CONCURRENTLY as well?)
>
> Do you have any idea what you think that should look like for DROP INDEX
> CONCURRENTLY ?

Making the maintenance of the partition tree consistent to the user is
the critical part here, so my guess on this matter is:
1) Remove each index from the partition tree and mark the indexes as
invalid in the same transaction.  This makes sure that after commit no
indexes would get used for scans, and the partition dependency tree
pis completely removed with the parent table.  That's close to what we
do in index_concurrently_swap() except that we just want to remove the
dependencies with the partitions, and not just swap them of course.
2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction
should be fine as that prevents inserts.
3) Finish the index drop.

Step 2) and 3) could be completely done for each index as part of
index_drop().  The tricky part is to integrate 1) cleanly within the
existing dependency machinery while still knowing about the list of
partitions that can be removed.  I think that this part is not that
straight-forward, but perhaps we could just make this logic part of
RemoveRelations() when listing all the objects to remove.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Sat, Sep 12, 2020 at 10:35:34AM +0900, Michael Paquier wrote:
> On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote:
> > On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:
> >> - CIC on partitioned relations.  (Should we also care about DROP INDEX
> >> CONCURRENTLY as well?)
> > 
> > Do you have any idea what you think that should look like for DROP INDEX
> > CONCURRENTLY ?
> 
> Making the maintenance of the partition tree consistent to the user is
> the critical part here, so my guess on this matter is:
> 1) Remove each index from the partition tree and mark the indexes as
> invalid in the same transaction.  This makes sure that after commit no
> indexes would get used for scans, and the partition dependency tree
> pis completely removed with the parent table.  That's close to what we
> do in index_concurrently_swap() except that we just want to remove the
> dependencies with the partitions, and not just swap them of course.
> 2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction
> should be fine as that prevents inserts.
> 3) Finish the index drop.
> 
> Step 2) and 3) could be completely done for each index as part of
> index_drop().  The tricky part is to integrate 1) cleanly within the
> existing dependency machinery while still knowing about the list of
> partitions that can be removed.  I think that this part is not that
> straight-forward, but perhaps we could just make this logic part of
> RemoveRelations() when listing all the objects to remove.

Thanks.

I see three implementation ideas..

1. I think your way has an issue that the dependencies are lost.  If there's an
interruption, the user is maybe left with hundreds or thousands of detached
indexes to clean up.  This is strange since there's actually no detach command
for indexes (but they're implicitly "attached" when a matching parent index is
created).  A 2nd issue is that DETACH currently requires an exclusive lock (but
see Alvaro's WIP patch).

2. Maybe the easiest way is to mark all indexes invalid and then drop all
partitions (concurrently) and then the partitioned parent.  If interrupted,
this would leave a parent index marked "invalid", and some child tables with no
indexes.  I think this may be "ok".  The same thing is possible if a concurrent
build is interrupted, right ?

3. I have a patch which changes index_drop() to "expand" a partitioned index into
its list of children.  Each of these becomes a List:
| indexId, heapId, userIndexRelation, userHeapRelation, heaplocktag, heaprelid, indexrelid
The same process is followed as for a single index, but handling all partitions
at once in two transactions total.  Arguably, this is bad since that function
currently takes a single Oid but would now ends up operating on a list of indexes.

Anyway, for now this is rebased on 83158f74d.

-- 
Justin

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:
> Anyway, for now this is rebased on 83158f74d.

I have not thought yet about all the details of CIC and DIC and what
you said upthread, but I have gone through CLUSTER for now, as a
start.  So, the goal of the patch, as I would define it, is to give a
way to CLUSTER to work on a partitioned table using a given
partitioned index.  Hence, we would perform CLUSTER automatically from
the top-most parent for any partitions that have an index on the same
partition tree as the partitioned index defined in USING, switching
indisclustered automatically depending on the index used.

+CREATE TABLE clstrpart3 PARTITION OF clstrpart DEFAULT PARTITION BY RANGE(a);
+CREATE TABLE clstrpart33 PARTITION OF clstrpart3 DEFAULT;
 CREATE INDEX clstrpart_idx ON clstrpart (a);
-ALTER TABLE clstrpart CLUSTER ON clstrpart_idx
So..  For any testing of partitioned trees, we should be careful to
check if relfilenodes have been changed or not as part of an
operation, to see if the operation has actually done something.

From what I can see, attempting to use a CLUSTER on a top-most
partitioned table fails to work on child tables, but isn't the goal of
the patch to make sure that if we attempt to do the operation on a
partitioned table using a partitioned index, then the operation should
be done as well on the partitions with the partition index part of the
same partition tree as the parent partitioned index?  If using CLUSTER
on a new partitioned index with USING, it seems to me that we may want
to check that indisclustered is correctly set for all the partitions
concerned in the regression tests.  It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

+   MemoryContext old_context = MemoryContextSwitchTo(cluster_context);
+   inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+   MemoryContextSwitchTo(old_context);
Er, isn't that incorrect?  I would have assumed that what should be
saved in the context of cluster is the list of RelToCluster items.
And we should not do any lookup of the partitions in a different
context, because this may do allocations of things we don't really
need to keep around for the context dedicated to CLUSTER.  Isn't
NoLock unsafe here, even knowing that an exclusive lock is taken on
the parent?  It seems to me that at least schema changes should be
prevented with a ShareLock in the first transaction building the list
of elements to cluster.

+       /*
+        * We have a full list of direct and indirect children, so skip
+        * partitioned tables and just handle their children.
+        */
+       if (get_rel_relkind(relid) == RELKIND_PARTITIONED_TABLE)
+           continue;
It would be better to use RELKIND_HAS_STORAGE here.

All this stuff needs to be documented clearly.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote:
> start.  So, the goal of the patch, as I would define it, is to give a
> way to CLUSTER to work on a partitioned table using a given
> partitioned index.  Hence, we would perform CLUSTER automatically from
> the top-most parent for any partitions that have an index on the same
> partition tree as the partitioned index defined in USING, switching
> indisclustered automatically depending on the index used.

I think that's right, except there's no need to look for a compatible
partitioned index: we just use the child index.

Also, if a partitioned index is clustered, when we clear indisclustered for
other indexes, should we also propogate that to their parent indexes, if any ?

> From what I can see, attempting to use a CLUSTER on a top-most
> partitioned table fails to work on child tables,

Oops - it looks like this patch never worked right, due to the RECHECK on
indisclustered.  I think maybe I returned to the CIC patch and never finishing
with cluster.

> It would be good also to check if
> we have a partition index tree that maps partially with a partition
> table tree (aka no all table partitions have a partition index), where
> these don't get clustered because there is no index to work on.

This should not happen, since a incomplete partitioned index is "invalid".

> Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on
> the parent?  It seems to me that at least schema changes should be
> prevented with a ShareLock in the first transaction building the list
> of elements to cluster.

Thanks for noticing.  I chose ShareUpdateExclusiveLock since it's
set-exclusive.

-- 
Justin

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:
> Also, if a partitioned index is clustered, when we clear indisclustered for
> other indexes, should we also propogate that to their parent indexes, if any ?

I am not sure what you mean here.  Each partition's cluster runs in
its own individual transaction based on the patch you sent.  Are you
suggesting to update indisclustered for the partitioned index of a
partitioned table and all its parent partitioned in the same
transaction, aka a transaction working on the partitioned table?
Doesn't that mean that if we have a partition tree with multiple
layers then we finish by doing multiple time the same operation for
the parents?

>> It would be good also to check if
>> we have a partition index tree that maps partially with a partition
>> table tree (aka no all table partitions have a partition index), where
>> these don't get clustered because there is no index to work on.
>
> This should not happen, since a incomplete partitioned index is "invalid".

Indeed, I did not know this property.  I can see also that you have
added a test for this case, so that's good if we can rely on that.  I
am still in the process of reviewing this patch, all this handling
around indisclustered makes it rather complex.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Mon, Oct 05, 2020 at 05:46:27PM +0900, Michael Paquier wrote:
> On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:
> > Also, if a partitioned index is clustered, when we clear indisclustered for
> > other indexes, should we also propogate that to their parent indexes, if any ?
> 
> I am not sure what you mean here.  Each partition's cluster runs in
> its own individual transaction based on the patch you sent.  Are you
> suggesting to update indisclustered for the partitioned index of a
> partitioned table and all its parent partitioned in the same
> transaction, aka a transaction working on the partitioned table?

No, I mean that if a partition is no longer clustered on some index, then its
parent isn't clustered on that indexes parent, either.

It means that we might do N catalog updates for a partition heirarchy that's N
levels deep.  Normally, N=2, and we'd clear indisclustered for the index as
well as its parent.  This is not essential, though.

> >> It would be good also to check if
> >> we have a partition index tree that maps partially with a partition
> >> table tree (aka no all table partitions have a partition index), where
> >> these don't get clustered because there is no index to work on.
> >
> > This should not happen, since a incomplete partitioned index is "invalid".
> 
> Indeed, I did not know this property.

I think that's something we can apply for CIC/DIC, too.
It's not essential to avoid leaving an "invalid" or partial index if
interrupted.  It's only essential that a partial, partitioned index is not
"valid".

For DROP IND CONCURRENTLY, I wrote:

On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:
> 2. Maybe the easiest way is to mark all indexes invalid and then drop all
> partitions (concurrently) and then the partitioned parent.  If interrupted,
> this would leave a parent index marked "invalid", and some child tables with no
> indexes.  I think this may be "ok".  The same thing is possible if a concurrent
> build is interrupted, right ?


-- 
Justin



Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote:
> It means that we might do N catalog updates for a partition heirarchy that's N
> levels deep.  Normally, N=2, and we'd clear indisclustered for the index as
> well as its parent.  This is not essential, though.

Hmm.  I got to think more about this one, and being able to ensure a
consistent state of indisclustered for partitioned tables and all
their partitions across multiple transactions at all times would be a
nice property, as we could easily track down if an operation has
failed in-flight.  The problem here is that we are limited by
indisclustered being a boolean, so we may set indisclustered one way
or another on some partitions, or on some parent partitioned tables,
but we are not sure if what's down is actually clustered for the
leaves of a partitioned index, or not.  Or maybe we even have an
inconsistent state, so this does not provide a good user experience.
The consistency of a partition tree is a key thing, and we have such
guarantees with REINDEX (with or without concurrently), so what I'd
like to think that what makes sense for indisclustered on a
partitioned index is to have the following set of guarantees, and
relying on indisvalid as being true iff an index partition tree is
complete on a given table partition tree is really important:
- If indisclustered is set to true for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to true, across all the layers down to the leaves.
- If indisclustered is set to false for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to false.

If we happen to attach a new partition to a partitioned table of such
a tree, I guess that it is then logic to have indisclustered set to
the same value as the partitioned index when the partition inherits an
index.  So, it seems to me that with the current catalogs we are
limited by indisclustered as being a boolean to maintain some kind of
consistency across transactions of CLUSTER, as we would use one
transaction per leaf for the work.  In order to make things better
here, we could switch indisclustered to a char, able to use three
states:
- 'e' or enabled, equivalent to indisclustered == true.  There should
be only one index on a table with 'e' set at a given time.
- 'd' or disabled, equivalent to indisclustered == false.
- a new third state, for an equivalent of work-in-progress, let's call
it 'w', or whatever.

Then, when we begin a CLUSTER on a partitioned table, I would imagine
the following:
- Switch all the indexes selected to 'w' in a first transaction, and
do not reset the state of other indexes if there is one 'e'.  For
CLUSTER without USING, we switch the existing 'e' to 'w' if there is
such an index.  If there are no indexes select-able, issue an error.
If we find an index with 'w', we have a candidate from a previous
failed command, so this gets used.  I don't think that this design
requires a new DDL either as we could reset all 'w' and 'e' to 'd' if
using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table.  For
CLUSTER with USING, the partitioned index selected and its leaves are
switched to 'w', similarly.
- Then do the work for all the partitions, one partition per
transaction, keeping 'w'.
- In a last transaction, switch all the partitions from 'w' to 'e',
resetting any existing 'e'.

ALTER TABLE CLUSTER ON should also be a supported operation, where 'e'
gets switched for all the partitions in a single transaction.  Of
course, this should not work for an invalid index.  Even with such a
design, I got to wonder if there could be cases where a user does
*not* want to cluster the leaves of a partition tree with the same
index.  If that were to happen, the partition tree may need a redesign
so making things work so as we force partitions to inherit what's
wanted for the partitioned table makes the most sense to me.

By the way, I mentioned that previously, but this thread got used for
REINDEX that is finished, and now we have a discussion going on with
CLUSTER.  There is also stuff related to CIC and DIC.  There is also
only one CF entry for all that.  I really think that this work had
better be split into separate threads, with separate CF entries. Do
you mind if I change the contents of the CF app so as the existing
item is renamed for REINDEX, that got committed?  Then we could create
a new entry for CIC/DIC (it may make sense to split these two as
well, but I am not if there are any overlaps between the APIs of the
two), and a new entry for CLUSTER, depending on the state of the work.
The original subject of the thread is also unrelated, this makes the
review process unnecessarily complicated, and that's really
confusing.  Each discussion should go into its own, independent,
thread.
--
Michael

Вложения

Re: 回复:how to create index concurrently on partitioned table

От
Justin Pryzby
Дата:
On Tue, Oct 06, 2020 at 11:42:55AM +0900, Michael Paquier wrote:
> On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote:
> > It means that we might do N catalog updates for a partition heirarchy that's N
> > levels deep.  Normally, N=2, and we'd clear indisclustered for the index as
> > well as its parent.  This is not essential, though.
> 
> Hmm.  I got to think more about this one, and being able to ensure a
> consistent state of indisclustered for partitioned tables and all
> their partitions across multiple transactions at all times would be a
> nice property, as we could easily track down if an operation has
> failed in-flight.  The problem here is that we are limited by
> indisclustered being a boolean, so we may set indisclustered one way
> or another on some partitions, or on some parent partitioned tables,
> but we are not sure if what's down is actually clustered for the
> leaves of a partitioned index, or not.  Or maybe we even have an
> inconsistent state, so this does not provide a good user experience.
> The consistency of a partition tree is a key thing, and we have such 
> guarantees with REINDEX (with or without concurrently), so what I'd
> like to think that what makes sense for indisclustered on a
> partitioned index is to have the following set of guarantees, and
> relying on indisvalid as being true iff an index partition tree is
> complete on a given table partition tree is really important:
> - If indisclustered is set to true for a partitioned index, then we
> have the guarantee that all its partition and partitioned indexes have
> indisclustered set to true, across all the layers down to the leaves.
> - If indisclustered is set to false for a partitioned index, then we
> have the guarantee that all its partition and partitioned indexes have
> indisclustered set to false.
> 
> If we happen to attach a new partition to a partitioned table of such
> a tree, I guess that it is then logic to have indisclustered set to
> the same value as the partitioned index when the partition inherits an
> index.  So, it seems to me that with the current catalogs we are
> limited by indisclustered as being a boolean to maintain some kind of
> consistency across transactions of CLUSTER, as we would use one
> transaction per leaf for the work.  In order to make things better
> here, we could switch indisclustered to a char, able to use three
> states:
> - 'e' or enabled, equivalent to indisclustered == true.  There should
> be only one index on a table with 'e' set at a given time.
> - 'd' or disabled, equivalent to indisclustered == false.
> - a new third state, for an equivalent of work-in-progress, let's call
> it 'w', or whatever.
> 
> Then, when we begin a CLUSTER on a partitioned table, I would imagine
> the following:
> - Switch all the indexes selected to 'w' in a first transaction, and
> do not reset the state of other indexes if there is one 'e'.  For
> CLUSTER without USING, we switch the existing 'e' to 'w' if there is
> such an index.  If there are no indexes select-able, issue an error.
> If we find an index with 'w', we have a candidate from a previous
> failed command, so this gets used.  I don't think that this design
> requires a new DDL either as we could reset all 'w' and 'e' to 'd' if
> using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table.  For
> CLUSTER with USING, the partitioned index selected and its leaves are
> switched to 'w', similarly.
> - Then do the work for all the partitions, one partition per
> transaction, keeping 'w'.
> - In a last transaction, switch all the partitions from 'w' to 'e',
> resetting any existing 'e'.

Honestly, I think you're over-thinking and over-engineering indisclustered.

If "clusteredness" was something we offered to maintain across DML, I think
that might be important to provide stronger guarantees.  As it is now, I don't
think this patch is worth changing the catalog definition.

> ALTER TABLE CLUSTER ON should also be a supported operation, where 'e'
> gets switched for all the partitions in a single transaction.  Of
> course, this should not work for an invalid index.  Even with such a
> design, I got to wonder if there could be cases where a user does
> *not* want to cluster the leaves of a partition tree with the same
> index.  If that were to happen, the partition tree may need a redesign
> so making things work so as we force partitions to inherit what's
> wanted for the partitioned table makes the most sense to me.

I wondered the same thing: should clustering a parent index cause the the child
indexes to be marked as clustered ?  Or should it be possible for an
intermediate child to say (marked as) clustered on some other index.  I don't
have strong feelings, but the patch currently sets indisclustered as a natural
consequence of the implementation, so I tentatively think that's what's right.
After all, CLUSTER sets indisclustered without having to also say
"ALTER..CLUSTER ON".

This is relevant to the question I raised about unsetting indisclustered for
each indexes *parent* when clustering on a different index.

I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition
just because a different partitioned index was set clustered.  We'd clear that,
like always, and then (in my proposal) also clear its parents "indisclustered".
I still don't think that's essential, though.

> By the way, I mentioned that previously, but this thread got used for
> REINDEX that is finished, and now we have a discussion going on with
> CLUSTER.  There is also stuff related to CIC and DIC.  There is also
> only one CF entry for all that.  I really think that this work had
> better be split into separate threads, with separate CF entries. Do
> you mind if I change the contents of the CF app so as the existing 
> item is renamed for REINDEX, that got committed?  Then we could create
> a new entry for CIC/DIC (it may make sense to split these two as
> well, but I am not if there are any overlaps between the APIs of the
> two), and a new entry for CLUSTER, depending on the state of the work.
> The original subject of the thread is also unrelated, this makes the
> review process unnecessarily complicated, and that's really
> confusing.  Each discussion should go into its own, independent,
> thread.

I didn't think it's worth the overhead of closing and opening more CFs.
But I don't mind.

-- 
Justin



Re: 回复:how to create index concurrently on partitioned table

От
Michael Paquier
Дата:
On Mon, Oct 05, 2020 at 10:07:33PM -0500, Justin Pryzby wrote:
> Honestly, I think you're over-thinking and over-engineering indisclustered.
>
> If "clusteredness" was something we offered to maintain across DML, I think
> that might be important to provide stronger guarantees.  As it is now, I don't
> think this patch is worth changing the catalog definition.

Well, this use case is new because we are discussing the relationship
of indisclustered across multiple transactions for multiple indexes,
so I'd rather have this discussion than not, and I have learnt
the hard way with REINDEX that we should care a lot about the
consistency of partition trees at any step of the operation.  Let's
imagine a simple example here, take this partition tree: p (parent),
and two partitions p1 and p2.  p has two partitioned indexes i and j,
indexes also present in p1 and p2 as i1, i2, j1 and j2.  Let's assume
that the user has done a CLUSTER on p USING i that completes, meaning
that i, i1 and i2 have indisclustered set.  Now let's assume that the
user does a CLUSTER on p USING j this time, and that this command
fails while processing p2, meaning that indisclustered is set for j1,
i2, and perhaps i or j depending on what the patch does.  Per the
latest arguments, j would be the one set to indisclustered.  From this
inconsistent state comes a couple of interesting things:
- A database-wide CLUSTER would finish by using j1 and i2 for the
operation on the partitions, while the intention was to use j2 for the
second partition as the previous command failed.
- With CLUSTER p, without USING.  Logically, I would assume that we
would rely on the value of indisclustered as of j, meaning that we
would *enforce* p2 to use j2.  But it could also be seen as incorrect
by the user because we would not use the index originally marked as
such.

So keeping this consistent has the advantage to have clear rules
here.

> I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition
> just because a different partitioned index was set clustered.  We'd clear that,
> like always, and then (in my proposal) also clear its parents "indisclustered".
> I still don't think that's essential, though.

Why?  Blocking a partition, which may be itself partitioned, to switch
to a different index if its partitioned parent uses something else
sounds kind of logic to me, at the end, because the user originally
intended to use CLUSTER with a specific index on this tree.  So I
would say that the partitioned table takes priority, and this should
be released with a WITHOUT CLUSTER from the partitioned table.

> I didn't think it's worth the overhead of closing and opening more CFs.
> But I don't mind.

Thanks, I'll do some cleanup.
--
Michael

Вложения