Обсуждение: Include sequence relation support in logical replication

Поиск
Список
Период
Сортировка

Include sequence relation support in logical replication

От
Cary Huang
Дата:
Hi

From the PG logical replication documentation, I see that there is a listed limitation that sequence relation is not replicated logically. After some examination, I see that retrieving the next value from a sequence using the nextval() call will emits a WAL update every 32 calls to nextval(). In fact, when it emits a WAL update, it will write a future value 32 increments from now, and maintain a internal cache for delivering sequence numbers. It is done this way to minimize the write operation to WAL record at a risk of losing some values during a crash. So if we were to replicate the sequence, the subscriber will receive a future value (32 calls to nextval()) from now, and it obviously does not reflect current status. Sequence changes caused by other sequence-related SQL functions like setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so replicating changes caused by these should not be a problem. 

I have shared a patch that allows sequence relation to be supported in logical replication via the decoding plugin ( test_decoding for example ); it does not support sequence relation in logical replication between a PG publisher and a PG subscriber via pgoutput plugin as it will require much more work. For the replication to make sense, the patch actually disables the WAL update at every 32 nextval() calls, so every call to nextval() will emit a WAL update for proper replication. This is done by setting SEQ_LOG_VALS to 0 in sequence.c

I think the question is that should we minimize WAL update frequency (every 32 calls) for getting next value in a sequence at a cost of losing values during crash or being able to replicate a sequence relation properly at a cost or more WAL updates?


Cary Huang
-------------
HighGo Software Inc. (Canada)

Вложения

Re: Include sequence relation support in logical replication

От
Andres Freund
Дата:
On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
> Hi
> 
> 
> 
> From the PG logical replication documentation, I see that there is a
> listed limitation that sequence relation is not replicated
> logically. After some examination, I see that retrieving the next
> value from a sequence using the nextval() call will emits a WAL update
> every 32 calls to nextval(). In fact, when it emits a WAL update, it
> will write a future value 32 increments from now, and maintain a
> internal cache for delivering sequence numbers. It is done this way to
> minimize the write operation to WAL record at a risk of losing some
> values during a crash. So if we were to replicate the sequence, the
> subscriber will receive a future value (32 calls to nextval()) from
> now, and it obviously does not reflect current status. Sequence
> changes caused by other sequence-related SQL functions like setval()
> or ALTER SEQUENCE xxx, will always emit a WAL update, so replicating
> changes caused by these should not be a problem. 
> 
> 
> 
> I have shared a patch that allows sequence relation to be supported in logical replication via the decoding plugin (
test_decodingfor example ); it does not support sequence relation in logical replication between a PG publisher and a
PGsubscriber via pgoutput plugin as it will require much more work. For the replication to make sense, the patch
actuallydisables the WAL update at every 32 nextval() calls, so every call to nextval() will emit a WAL update for
properreplication. This is done by setting SEQ_LOG_VALS to 0 in sequence.c
 
> 
> 
> 
> I think the question is that should we minimize WAL update frequency (every 32 calls) for getting next value in a
sequenceat a cost of losing values during crash or being able to replicate a sequence relation properly at a cost or
moreWAL updates?
 
> 
> 
> 
> 
> 
> Cary Huang
> 
> -------------
> 
> HighGo Software Inc. (Canada)
> 
> mailto:cary.huang@highgo.ca
> 
> http://www.highgo.ca

> diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
> index 93c948856e..7a7e572d6c 100644
> --- a/contrib/test_decoding/test_decoding.c
> +++ b/contrib/test_decoding/test_decoding.c
> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>                                      &change->data.tp.oldtuple->tuple,
>                                      true);
>              break;
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    appendStringInfoString(ctx->out, " SEQUENCE:");
> +                    if (change->data.sequence.newtuple == NULL)
> +                        appendStringInfoString(ctx->out, " (no-tuple-data)");
> +                    else
> +                        tuple_to_stringinfo(ctx->out, tupdesc,
> +                                            &change->data.sequence.newtuple->tuple,
> +                                            false);
> +                    break;
>          default:
>              Assert(false);
>      }
> diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
> index 6aab73bfd4..941015e4aa 100644
> --- a/src/backend/commands/sequence.c
> +++ b/src/backend/commands/sequence.c
> @@ -49,11 +49,10 @@
>  
>  
>  /*
> - * We don't want to log each fetching of a value from a sequence,
> - * so we pre-log a few fetches in advance. In the event of
> - * crash we can lose (skip over) as many values as we pre-logged.
> + * Sequence replication is now supported and we will now need to log each sequence
> + * update to WAL such that the standby can properly receive the sequence change
>   */
> -#define SEQ_LOG_VALS    32
> +#define SEQ_LOG_VALS    0
>  
>  /*
>   * The "special area" of a sequence's buffer page looks like this.
> diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> index c2e5e3abf8..3dc14ead08 100644
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -42,6 +42,7 @@
>  #include "replication/reorderbuffer.h"
>  #include "replication/snapbuild.h"
>  #include "storage/standby.h"
> +#include "commands/sequence.h"
>  
>  typedef struct XLogRecordBuffer
>  {
> @@ -70,9 +71,11 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
>                           xl_xact_parsed_commit *parsed, TransactionId xid);
>  static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
>                          xl_xact_parsed_abort *parsed, TransactionId xid);
> +static void DecodeSequence(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
>  
>  /* common function to decode tuples */
>  static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tup);
> +static void DecodeSeqTuple(char *data, Size len, ReorderBufferTupleBuf *tuple);
>  
>  /*
>   * Take every XLogReadRecord()ed record and perform the actions required to
> @@ -130,6 +133,10 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
>              DecodeLogicalMsgOp(ctx, &buf);
>              break;
>  
> +        case RM_SEQ_ID:
> +            DecodeSequence(ctx, &buf);
> +            break;
> +
>              /*
>               * Rmgrs irrelevant for logical decoding; they describe stuff not
>               * represented in logical decoding. Add new rmgrs in rmgrlist.h's
> @@ -145,7 +152,6 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
>          case RM_HASH_ID:
>          case RM_GIN_ID:
>          case RM_GIST_ID:
> -        case RM_SEQ_ID:
>          case RM_SPGIST_ID:
>          case RM_BRIN_ID:
>          case RM_COMMIT_TS_ID:
> @@ -1052,3 +1058,80 @@ DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
>      header->t_infomask2 = xlhdr.t_infomask2;
>      header->t_hoff = xlhdr.t_hoff;
>  }
> +
> +/*
> + * Decode Sequence Tuple
> + */
> +static void
> +DecodeSeqTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
> +{
> +    int            datalen = len - sizeof(xl_seq_rec) - SizeofHeapTupleHeader;
> +
> +    Assert(datalen >= 0);
> +
> +    tuple->tuple.t_len = datalen + SizeofHeapTupleHeader;;
> +
> +    ItemPointerSetInvalid(&tuple->tuple.t_self);
> +
> +    tuple->tuple.t_tableOid = InvalidOid;
> +
> +    memcpy(((char *) tuple->tuple.t_data),
> +           data + sizeof(xl_seq_rec),
> +           SizeofHeapTupleHeader);
> +
> +    memcpy(((char *) tuple->tuple.t_data) + SizeofHeapTupleHeader,
> +           data + sizeof(xl_seq_rec) + SizeofHeapTupleHeader,
> +           datalen);
> +}
> +
> +/*
> + * Handle sequence decode
> + */
> +static void
> +DecodeSequence(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> +{
> +    ReorderBufferChange *change;
> +    RelFileNode target_node;
> +    XLogReaderState *r = buf->record;
> +    char       *tupledata = NULL;
> +    Size        tuplelen;
> +    Size        datalen = 0;
> +    uint8        info = XLogRecGetInfo(buf->record) & ~XLR_INFO_MASK;
> +
> +    /* only decode changes flagged with XLOG_SEQ_LOG  */
> +    if (info != XLOG_SEQ_LOG)
> +        return;
> +
> +    /* only interested in our database */
> +    XLogRecGetBlockTag(r, 0, &target_node, NULL, NULL);
> +    if (target_node.dbNode != ctx->slot->data.database)
> +        return;
> +
> +    /* output plugin doesn't look for this origin, no need to queue */
> +    if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
> +        return;
> +
> +    change = ReorderBufferGetChange(ctx->reorder);
> +    change->action = REORDER_BUFFER_CHANGE_SEQUENCE;
> +    change->origin_id = XLogRecGetOrigin(r);
> +
> +    memcpy(&change->data.sequence.relnode, &target_node, sizeof(RelFileNode));
> +
> +    tupledata = XLogRecGetData(r);
> +    datalen = XLogRecGetDataLen(r);
> +
> +    if(!datalen || !tupledata)
> +        return;
> +
> +    tuplelen = datalen - SizeOfHeapHeader - sizeof(xl_seq_rec);
> +
> +    change->data.sequence.newtuple =
> +        ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
> +
> +    DecodeSeqTuple(tupledata, datalen, change->data.sequence.newtuple);
> +
> +    ReorderBufferXidSetCatalogChanges(ctx->reorder, XLogRecGetXid(buf->record), buf->origptr);
> +
> +    ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r), buf->origptr, change);
> +
> +}
> diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
> index 481277a1fd..24f2cdf51d 100644
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -474,6 +474,13 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
>          case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
>          case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
>              break;
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
> +            if (change->data.sequence.newtuple)
> +            {
> +                ReorderBufferReturnTupleBuf(rb, change->data.sequence.newtuple);
> +                change->data.sequence.newtuple = NULL;
> +            }
> +            break;
>      }
>  
>      pfree(change);
> @@ -1833,6 +1840,38 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
>                  case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
>                      elog(ERROR, "tuplecid value in changequeue");
>                      break;
> +                case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    Assert(snapshot_now);
> +
> +                    reloid = RelidByRelfilenode(change->data.sequence.relnode.spcNode,
> +                                                change->data.sequence.relnode.relNode);
> +
> +                    if (reloid == InvalidOid &&
> +                        change->data.sequence.newtuple == NULL)
> +                        goto change_done;
> +                    else if (reloid == InvalidOid)
> +                        elog(ERROR, "could not map filenode \"%s\" to relation OID",
> +                             relpathperm(change->data.tp.relnode,
> +                                         MAIN_FORKNUM));
> +
> +                    relation = RelationIdGetRelation(reloid);
> +
> +                    if (!RelationIsValid(relation))
> +                        elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
> +                             reloid,
> +                             relpathperm(change->data.sequence.relnode,
> +                                         MAIN_FORKNUM));
> +
> +                    if (!RelationIsLogicallyLogged(relation))
> +                        goto change_done;
> +
> +                    /* user-triggered change */
> +                    if (!IsToastRelation(relation))
> +                    {
> +                        ReorderBufferToastReplace(rb, txn, relation, change);
> +                        rb->apply_change(rb, txn, relation, change);
> +                    }
> +                    break;
>              }
>          }
>  
> @@ -2516,15 +2555,23 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
>          case REORDER_BUFFER_CHANGE_UPDATE:
>          case REORDER_BUFFER_CHANGE_DELETE:
>          case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
>              {
>                  char       *data;
>                  ReorderBufferTupleBuf *oldtup,
>                             *newtup;
>                  Size        oldlen = 0;
>                  Size        newlen = 0;
> -
> -                oldtup = change->data.tp.oldtuple;
> -                newtup = change->data.tp.newtuple;
> +                if (change->action == REORDER_BUFFER_CHANGE_SEQUENCE)
> +                {
> +                    oldtup = NULL;
> +                    newtup = change->data.sequence.newtuple;
> +                }
> +                else
> +                {
> +                    oldtup = change->data.tp.oldtuple;
> +                    newtup = change->data.tp.newtuple;
> +                }
>  
>                  if (oldtup)
>                  {
> @@ -2707,14 +2754,23 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
>          case REORDER_BUFFER_CHANGE_UPDATE:
>          case REORDER_BUFFER_CHANGE_DELETE:
>          case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
>              {
>                  ReorderBufferTupleBuf *oldtup,
>                             *newtup;
>                  Size        oldlen = 0;
>                  Size        newlen = 0;
>  
> -                oldtup = change->data.tp.oldtuple;
> -                newtup = change->data.tp.newtuple;
> +                if (change->action == REORDER_BUFFER_CHANGE_SEQUENCE)
> +                {
> +                    oldtup = NULL;
> +                    newtup = change->data.sequence.newtuple;
> +                }
> +                else
> +                {
> +                    oldtup = change->data.tp.oldtuple;
> +                    newtup = change->data.tp.newtuple;
> +                }
>  
>                  if (oldtup)
>                  {
> @@ -3048,6 +3104,32 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
>          case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
>          case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
>              break;
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
> +            if (change->data.sequence.newtuple)
> +            {
> +                /* here, data might not be suitably aligned! */
> +                uint32        tuplelen;
> +
> +                memcpy(&tuplelen, data + offsetof(HeapTupleData, t_len),
> +                       sizeof(uint32));
> +
> +                change->data.sequence.newtuple =
> +                    ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
> +
> +                /* restore ->tuple */
> +                memcpy(&change->data.sequence.newtuple->tuple, data,
> +                       sizeof(HeapTupleData));
> +                data += sizeof(HeapTupleData);
> +
> +                /* reset t_data pointer into the new tuplebuf */
> +                change->data.sequence.newtuple->tuple.t_data =
> +                    ReorderBufferTupleBufData(change->data.tp.newtuple);
> +
> +                /* restore tuple data itself */
> +                memcpy(change->data.sequence.newtuple->tuple.t_data, data, tuplelen);
> +                data += tuplelen;
> +            }
> +            break;
>      }
>  
>      dlist_push_tail(&txn->changes, &change->node);
> diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
> index 626ecf4dc9..cf3fd45c5f 100644
> --- a/src/include/replication/reorderbuffer.h
> +++ b/src/include/replication/reorderbuffer.h
> @@ -62,7 +62,8 @@ enum ReorderBufferChangeType
>      REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
>      REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
>      REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
> -    REORDER_BUFFER_CHANGE_TRUNCATE
> +    REORDER_BUFFER_CHANGE_TRUNCATE,
> +    REORDER_BUFFER_CHANGE_SEQUENCE,
>  };
>  
>  /* forward declaration */
> @@ -149,6 +150,15 @@ typedef struct ReorderBufferChange
>              CommandId    cmax;
>              CommandId    combocid;
>          }            tuplecid;
> +        /*
> +         * Truncate data for REORDER_BUFFER_CHANGE_SEQUENCE representing one
> +         * set of relations to be truncated.
> +         */
> +        struct
> +        {
> +            RelFileNode relnode;
> +            ReorderBufferTupleBuf *newtuple;
> +        }            sequence;
>      }            data;
>  
>      /*

Greetings,

Andres Freund



Re: Include sequence relation support in logical replication

От
Andres Freund
Дата:
Hi,

On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
> I have shared a patch that allows sequence relation to be supported in
> logical replication via the decoding plugin ( test_decoding for
> example ); it does not support sequence relation in logical
> replication between a PG publisher and a PG subscriber via pgoutput
> plugin as it will require much more work.

Could you expand on "much more work"? Once decoding support is there,
that shouldn't be that much?


> Sequence changes caused by other sequence-related SQL functions like
> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so
> replicating changes caused by these should not be a problem.

I think this really would need to handle at the very least setval to
make sense.


> For the replication to make sense, the patch actually disables the WAL
> update at every 32 nextval() calls, so every call to nextval() will
> emit a WAL update for proper replication. This is done by setting
> SEQ_LOG_VALS to 0 in sequence.c

Why is that needed? ISTM updating in increments of 32 is fine for
replication purposes? It's good imo, because sending out more granular
increments would increase the size of the WAL stream?



> diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
> index 93c948856e..7a7e572d6c 100644
> --- a/contrib/test_decoding/test_decoding.c
> +++ b/contrib/test_decoding/test_decoding.c
> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>                                      &change->data.tp.oldtuple->tuple,
>                                      true);
>              break;
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    appendStringInfoString(ctx->out, " SEQUENCE:");
> +                    if (change->data.sequence.newtuple == NULL)
> +                        appendStringInfoString(ctx->out, " (no-tuple-data)");
> +                    else
> +                        tuple_to_stringinfo(ctx->out, tupdesc,
> +                                            &change->data.sequence.newtuple->tuple,
> +                                            false);
> +                    break;
>          default:
>              Assert(false);
>      }

You should also add tests - the main purpose of contrib/test_decoding is
to facilitate writing those...


> +    ReorderBufferXidSetCatalogChanges(ctx->reorder, XLogRecGetXid(buf->record), buf->origptr);

Huh, why are you doing this? That's going to increase overhead of logical
decoding by many times?


> +                case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    Assert(snapshot_now);
> +
> +                    reloid = RelidByRelfilenode(change->data.sequence.relnode.spcNode,
> +                                                change->data.sequence.relnode.relNode);
> +
> +                    if (reloid == InvalidOid &&
> +                        change->data.sequence.newtuple == NULL)
> +                        goto change_done;

I don't think this path should be needed? There's afaict no valid ways
we should be able to end up here without a tuple?


> +                    if (!RelationIsLogicallyLogged(relation))
> +                        goto change_done;

Similarly, this seems superflous and should perhaps be an assertion?

> +                    /* user-triggered change */
> +                    if (!IsToastRelation(relation))
> +                    {
> +                        ReorderBufferToastReplace(rb, txn, relation, change);
> +                        rb->apply_change(rb, txn, relation, change);
> +                    }
> +                    break;
>              }
>          }
>

This doesn't make sense either.



> diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
> index 626ecf4dc9..cf3fd45c5f 100644
> --- a/src/include/replication/reorderbuffer.h
> +++ b/src/include/replication/reorderbuffer.h
> @@ -62,7 +62,8 @@ enum ReorderBufferChangeType
>      REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
>      REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
>      REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
> -    REORDER_BUFFER_CHANGE_TRUNCATE
> +    REORDER_BUFFER_CHANGE_TRUNCATE,
> +    REORDER_BUFFER_CHANGE_SEQUENCE,
>  };
>
>  /* forward declaration */
> @@ -149,6 +150,15 @@ typedef struct ReorderBufferChange
>              CommandId    cmax;
>              CommandId    combocid;
>          }            tuplecid;
> +        /*
> +         * Truncate data for REORDER_BUFFER_CHANGE_SEQUENCE representing one
> +         * set of relations to be truncated.
> +         */

What?

> +        struct
> +        {
> +            RelFileNode relnode;
> +            ReorderBufferTupleBuf *newtuple;
> +        }            sequence;
>      }            data;
>
>      /*

I don't think we should expose sequence changes via their tuples -
that'll unnecessarily expose a lot of implementation details.

Greetings,

Andres Freund



Re: Include sequence relation support in logical replication

От
Michael Paquier
Дата:
On Wed, Mar 25, 2020 at 12:27:28PM -0700, Andres Freund wrote:
> On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
>> For the replication to make sense, the patch actually disables the WAL
>> update at every 32 nextval() calls, so every call to nextval() will
>> emit a WAL update for proper replication. This is done by setting
>> SEQ_LOG_VALS to 0 in sequence.c
>
> Why is that needed? ISTM updating in increments of 32 is fine for
> replication purposes? It's good imo, because sending out more granular
> increments would increase the size of the WAL stream?

Once upon a time, I was looking at the effects of playing with the
limit of a WAL record generated every 32 increments for a sequence,
and the performance difference is huge and noticeable.
--
Michael

Вложения

Re: Include sequence relation support in logical replication

От
Cary Huang
Дата:
Hi Andres

thanks for your reply and your patch review. Please see my comments below

>On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
>> I have shared a patch that allows sequence relation to be supported in
>> logical replication via the decoding plugin ( test_decoding for
>> example ); it does not support sequence relation in logical
>> replication between a PG publisher and a PG subscriber via pgoutput
>> plugin as it will require much more work.
>
> Could you expand on "much more work"? Once decoding support is there,
> that shouldn't be that much?

By much more work, I meant more source files will need to be changed to have sequence replication
supported between a PG subscriber and publisher using pgoutput plugin. About 10 more source file changes.
Idea is similar though.

>> Sequence changes caused by other sequence-related SQL functions like
>> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so
>> replicating changes caused by these should not be a problem.
>
> I think this really would need to handle at the very least setval to
> make sense.

yes, sure

>> For the replication to make sense, the patch actually disables the WAL
>> update at every 32 nextval() calls, so every call to nextval() will
>> emit a WAL update for proper replication. This is done by setting
>> SEQ_LOG_VALS to 0 in sequence.c 
>
> Why is that needed? ISTM updating in increments of 32 is fine for
> replication purposes? It's good imo, because sending out more granular
> increments would increase the size of the WAL stream?

yes, updating WAL at every 32 increment is good and have huge performance benefits according 
to Michael, but when it is replicated logically to subscribers, the sequence value they receive would not
make much sense to them.
For example, 

If i have a Sequence called "seq" with current value = 100 and increment = 5. The nextval('seq') call will
return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) ), because that is the value after 32
increments and internally it is also maintaining a "log_cnt" counter that tracks how many nextval() calls have been invoked
since the last WAL write, so it could kind of derive backwards to find the proper next value to return to client. 

But the subscriber for this sequence will receive a change of 260 instead of 105, and it does not represent the current
sequence status. Subscriber is not able to derive backwards because it does not know the increment size in its schema.

Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment behavior and makes WAL update at every nextval() call
and therefore the subscriber to this sequence will receive the same value (105) as the client, as a cost of more WAL writes.

I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without the need to
disabling the 32 increment behavior?

>> diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
>> index 93c948856e..7a7e572d6c 100644
>> --- a/contrib/test_decoding/test_decoding.c
>> +++ b/contrib/test_decoding/test_decoding.c
>> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>>                                     &change->data.tp.oldtuple->tuple,
>>                                     true);
>>             break;
>> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
>> +                    appendStringInfoString(ctx->out, " SEQUENCE:");
>> +                    if (change->data.sequence.newtuple == NULL)
>> +                        appendStringInfoString(ctx->out, " (no-tuple-data)");
>> +                    else
>> +                        tuple_to_stringinfo(ctx->out, tupdesc,
>> +                                            &change->data.sequence.newtuple->tuple,
>> +                                            false);
>> +                    break;
>>         default:
>>             Assert(false);
>>     } 
>
> You should also add tests - the main purpose of contrib/test_decoding is
> to facilitate writing those...

thanks, I will add

>> +    ReorderBufferXidSetCatalogChanges(ctx->reorder, XLogRecGetXid(buf->record), buf->origptr);
>
> Huh, why are you doing this? That's going to increase overhead of logical
> decoding by many times?

This is needed to allow snapshot to be built inside DecodeCommit() function. Regular changes caused by INSERT also has this
function called so I assume it is needed to ensure proper decoding. Without this, a snapshot will not be built and the change
transaction will not be logged

>> +                case REORDER_BUFFER_CHANGE_SEQUENCE:
>> +                    Assert(snapshot_now);
>> +
>> +                    reloid = RelidByRelfilenode(change->data.sequence.relnode.spcNode,
>> +                                                change->data.sequence.relnode.relNode);
>> +
>> +                    if (reloid == InvalidOid &&
>> +                        change->data.sequence.newtuple == NULL)
>> +                        goto change_done;
>
> I don't think this path should be needed? There's afaict no valid ways
> we should be able to end up here without a tuple?

yeah you are right, I can remove the tuple check

>> +                    if (!RelationIsLogicallyLogged(relation))
>> +                        goto change_done;
>
> Similarly, this seems superflous and should perhaps be an assertion?

I think it should be ok to check a relation like this, because it also will check the persistence of the relation and whether
wal_level is set to 'logical'. It is commonly used in the regular INSERT cases so I thought it would make sense to use it
for sequence.

>> +                    /* user-triggered change */
>> +                    if (!IsToastRelation(relation))
>> +                    {
>> +                        ReorderBufferToastReplace(rb, txn, relation, change);
>> +                        rb->apply_change(rb, txn, relation, change);
>> +                    }
>> +                    break;
>>             }
>>         }
>>
>
> This doesn't make sense either.

agreed, it should not be here.

>> /* forward declaration */
>> @@ -149,6 +150,15 @@ typedef struct ReorderBufferChange
>>             CommandId    cmax;
>>             CommandId    combocid;
>>         }            tuplecid;
>> +        /*
>> +         * Truncate data for REORDER_BUFFER_CHANGE_SEQUENCE representing one
>> +         * set of relations to be truncated.
>> +         */
>
> What?

Will fix the comment

>> +        struct
>> +        {
>> +            RelFileNode relnode;
>> +            ReorderBufferTupleBuf *newtuple;
>> +        }            sequence;
>>     }            data;
>>
>>     /*
>
> I don't think we should expose sequence changes via their tuples -
> that'll unnecessarily expose a lot of implementation details.

Can you elaborate more on this? Sequence writes its tuple data in WAL and triggers a change
event to logical decoding logic. What else can I use to expose a sequence change?

Best

Cary Huang
-------------
HighGo Software Inc. (Canada)




---- On Wed, 25 Mar 2020 12:27:28 -0700 Andres Freund <andres@anarazel.de> wrote ----

Hi,

On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
> I have shared a patch that allows sequence relation to be supported in
> logical replication via the decoding plugin ( test_decoding for
> example ); it does not support sequence relation in logical
> replication between a PG publisher and a PG subscriber via pgoutput
> plugin as it will require much more work.

Could you expand on "much more work"? Once decoding support is there,
that shouldn't be that much?


> Sequence changes caused by other sequence-related SQL functions like
> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so
> replicating changes caused by these should not be a problem.

I think this really would need to handle at the very least setval to
make sense.


> For the replication to make sense, the patch actually disables the WAL
> update at every 32 nextval() calls, so every call to nextval() will
> emit a WAL update for proper replication. This is done by setting
> SEQ_LOG_VALS to 0 in sequence.c

Why is that needed? ISTM updating in increments of 32 is fine for
replication purposes? It's good imo, because sending out more granular
increments would increase the size of the WAL stream?



> diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
> index 93c948856e..7a7e572d6c 100644
> --- a/contrib/test_decoding/test_decoding.c
> +++ b/contrib/test_decoding/test_decoding.c
> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>                                     &change->data.tp.oldtuple->tuple,
>                                     true);
>             break;
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    appendStringInfoString(ctx->out, " SEQUENCE:");
> +                    if (change->data.sequence.newtuple == NULL)
> +                        appendStringInfoString(ctx->out, " (no-tuple-data)");
> +                    else
> +                        tuple_to_stringinfo(ctx->out, tupdesc,
> +                                            &change->data.sequence.newtuple->tuple,
> +                                            false);
> +                    break;
>         default:
>             Assert(false);
>     }

You should also add tests - the main purpose of contrib/test_decoding is
to facilitate writing those...


> +    ReorderBufferXidSetCatalogChanges(ctx->reorder, XLogRecGetXid(buf->record), buf->origptr);

Huh, why are you doing this? That's going to increase overhead of logical
decoding by many times?


> +                case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    Assert(snapshot_now);
> +
> +                    reloid = RelidByRelfilenode(change->data.sequence.relnode.spcNode,
> +                                                change->data.sequence.relnode.relNode);
> +
> +                    if (reloid == InvalidOid &&
> +                        change->data.sequence.newtuple == NULL)
> +                        goto change_done;

I don't think this path should be needed? There's afaict no valid ways
we should be able to end up here without a tuple?


> +                    if (!RelationIsLogicallyLogged(relation))
> +                        goto change_done;

Similarly, this seems superflous and should perhaps be an assertion?

> +                    /* user-triggered change */
> +                    if (!IsToastRelation(relation))
> +                    {
> +                        ReorderBufferToastReplace(rb, txn, relation, change);
> +                        rb->apply_change(rb, txn, relation, change);
> +                    }
> +                    break;
>             }
>         }
>

This doesn't make sense either.



> diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
> index 626ecf4dc9..cf3fd45c5f 100644
> --- a/src/include/replication/reorderbuffer.h
> +++ b/src/include/replication/reorderbuffer.h
> @@ -62,7 +62,8 @@ enum ReorderBufferChangeType
>     REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
>     REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
>     REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
> -    REORDER_BUFFER_CHANGE_TRUNCATE
> +    REORDER_BUFFER_CHANGE_TRUNCATE,
> +    REORDER_BUFFER_CHANGE_SEQUENCE,
> };
>
> /* forward declaration */
> @@ -149,6 +150,15 @@ typedef struct ReorderBufferChange
>             CommandId    cmax;
>             CommandId    combocid;
>         }            tuplecid;
> +        /*
> +         * Truncate data for REORDER_BUFFER_CHANGE_SEQUENCE representing one
> +         * set of relations to be truncated.
> +         */

What?

> +        struct
> +        {
> +            RelFileNode relnode;
> +            ReorderBufferTupleBuf *newtuple;
> +        }            sequence;
>     }            data;
>
>     /*

I don't think we should expose sequence changes via their tuples -
that'll unnecessarily expose a lot of implementation details.

Greetings,

Andres Freund




Re: Include sequence relation support in logical replication

От
Cary Huang
Дата:
Hi

I would like to share a v2 of the sequence replication patch that allows logical replication of sequence relation via the decoding plugins at the moment. I have restored the original logic where sequence emits a WAL update every 32 increment, but instead of writing a future value to the WAL, it writes the current value, so the decoding plugin can receive the current sequence value. Regression tests for test_decoding have been updated to reflect the sequence changes and a new sequence test is added as well.

Best

Cary Huang
-------------
HighGo Software Inc. (Canada)

---- On Thu, 26 Mar 2020 15:33:33 -0700 Cary Huang <cary.huang@highgo.ca> wrote ----

Hi Andres

thanks for your reply and your patch review. Please see my comments below

>On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
>> I have shared a patch that allows sequence relation to be supported in
>> logical replication via the decoding plugin ( test_decoding for
>> example ); it does not support sequence relation in logical
>> replication between a PG publisher and a PG subscriber via pgoutput
>> plugin as it will require much more work.
>
> Could you expand on "much more work"? Once decoding support is there,
> that shouldn't be that much?

By much more work, I meant more source files will need to be changed to have sequence replication
supported between a PG subscriber and publisher using pgoutput plugin. About 10 more source file changes.
Idea is similar though.

>> Sequence changes caused by other sequence-related SQL functions like
>> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so
>> replicating changes caused by these should not be a problem.
>
> I think this really would need to handle at the very least setval to
> make sense.

yes, sure

>> For the replication to make sense, the patch actually disables the WAL
>> update at every 32 nextval() calls, so every call to nextval() will
>> emit a WAL update for proper replication. This is done by setting
>> SEQ_LOG_VALS to 0 in sequence.c 
>
> Why is that needed? ISTM updating in increments of 32 is fine for
> replication purposes? It's good imo, because sending out more granular
> increments would increase the size of the WAL stream?

yes, updating WAL at every 32 increment is good and have huge performance benefits according 
to Michael, but when it is replicated logically to subscribers, the sequence value they receive would not
make much sense to them.
For example, 

If i have a Sequence called "seq" with current value = 100 and increment = 5. The nextval('seq') call will
return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) ), because that is the value after 32
increments and internally it is also maintaining a "log_cnt" counter that tracks how many nextval() calls have been invoked
since the last WAL write, so it could kind of derive backwards to find the proper next value to return to client. 

But the subscriber for this sequence will receive a change of 260 instead of 105, and it does not represent the current
sequence status. Subscriber is not able to derive backwards because it does not know the increment size in its schema.

Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment behavior and makes WAL update at every nextval() call
and therefore the subscriber to this sequence will receive the same value (105) as the client, as a cost of more WAL writes.

I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without the need to
disabling the 32 increment behavior?

>> diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
>> index 93c948856e..7a7e572d6c 100644
>> --- a/contrib/test_decoding/test_decoding.c
>> +++ b/contrib/test_decoding/test_decoding.c
>> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>>                                     &change->data.tp.oldtuple->tuple,
>>                                     true);
>>             break;
>> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
>> +                    appendStringInfoString(ctx->out, " SEQUENCE:");
>> +                    if (change->data.sequence.newtuple == NULL)
>> +                        appendStringInfoString(ctx->out, " (no-tuple-data)");
>> +                    else
>> +                        tuple_to_stringinfo(ctx->out, tupdesc,
>> +                                            &change->data.sequence.newtuple->tuple,
>> +                                            false);
>> +                    break;
>>         default:
>>             Assert(false);
>>     } 
>
> You should also add tests - the main purpose of contrib/test_decoding is
> to facilitate writing those...

thanks, I will add

>> +    ReorderBufferXidSetCatalogChanges(ctx->reorder, XLogRecGetXid(buf->record), buf->origptr);
>
> Huh, why are you doing this? That's going to increase overhead of logical
> decoding by many times?

This is needed to allow snapshot to be built inside DecodeCommit() function. Regular changes caused by INSERT also has this
function called so I assume it is needed to ensure proper decoding. Without this, a snapshot will not be built and the change
transaction will not be logged

>> +                case REORDER_BUFFER_CHANGE_SEQUENCE:
>> +                    Assert(snapshot_now);
>> +
>> +                    reloid = RelidByRelfilenode(change->data.sequence.relnode.spcNode,
>> +                                                change->data.sequence.relnode.relNode);
>> +
>> +                    if (reloid == InvalidOid &&
>> +                        change->data.sequence.newtuple == NULL)
>> +                        goto change_done;
>
> I don't think this path should be needed? There's afaict no valid ways
> we should be able to end up here without a tuple?

yeah you are right, I can remove the tuple check

>> +                    if (!RelationIsLogicallyLogged(relation))
>> +                        goto change_done;
>
> Similarly, this seems superflous and should perhaps be an assertion?

I think it should be ok to check a relation like this, because it also will check the persistence of the relation and whether
wal_level is set to 'logical'. It is commonly used in the regular INSERT cases so I thought it would make sense to use it
for sequence.

>> +                    /* user-triggered change */
>> +                    if (!IsToastRelation(relation))
>> +                    {
>> +                        ReorderBufferToastReplace(rb, txn, relation, change);
>> +                        rb->apply_change(rb, txn, relation, change);
>> +                    }
>> +                    break;
>>             }
>>         }
>>
>
> This doesn't make sense either.

agreed, it should not be here.

>> /* forward declaration */
>> @@ -149,6 +150,15 @@ typedef struct ReorderBufferChange
>>             CommandId    cmax;
>>             CommandId    combocid;
>>         }            tuplecid;
>> +        /*
>> +         * Truncate data for REORDER_BUFFER_CHANGE_SEQUENCE representing one
>> +         * set of relations to be truncated.
>> +         */
>
> What?

Will fix the comment

>> +        struct
>> +        {
>> +            RelFileNode relnode;
>> +            ReorderBufferTupleBuf *newtuple;
>> +        }            sequence;
>>     }            data;
>>
>>     /*
>
> I don't think we should expose sequence changes via their tuples -
> that'll unnecessarily expose a lot of implementation details.

Can you elaborate more on this? Sequence writes its tuple data in WAL and triggers a change
event to logical decoding logic. What else can I use to expose a sequence change?

Best

Cary Huang
-------------
HighGo Software Inc. (Canada)




---- On Wed, 25 Mar 2020 12:27:28 -0700 Andres Freund <andres@anarazel.de> wrote ----




Hi,

On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
> I have shared a patch that allows sequence relation to be supported in
> logical replication via the decoding plugin ( test_decoding for
> example ); it does not support sequence relation in logical
> replication between a PG publisher and a PG subscriber via pgoutput
> plugin as it will require much more work.

Could you expand on "much more work"? Once decoding support is there,
that shouldn't be that much?


> Sequence changes caused by other sequence-related SQL functions like
> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so
> replicating changes caused by these should not be a problem.

I think this really would need to handle at the very least setval to
make sense.


> For the replication to make sense, the patch actually disables the WAL
> update at every 32 nextval() calls, so every call to nextval() will
> emit a WAL update for proper replication. This is done by setting
> SEQ_LOG_VALS to 0 in sequence.c

Why is that needed? ISTM updating in increments of 32 is fine for
replication purposes? It's good imo, because sending out more granular
increments would increase the size of the WAL stream?



> diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
> index 93c948856e..7a7e572d6c 100644
> --- a/contrib/test_decoding/test_decoding.c
> +++ b/contrib/test_decoding/test_decoding.c
> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>                                     &change->data.tp.oldtuple->tuple,
>                                     true);
>             break;
> +        case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    appendStringInfoString(ctx->out, " SEQUENCE:");
> +                    if (change->data.sequence.newtuple == NULL)
> +                        appendStringInfoString(ctx->out, " (no-tuple-data)");
> +                    else
> +                        tuple_to_stringinfo(ctx->out, tupdesc,
> +                                            &change->data.sequence.newtuple->tuple,
> +                                            false);
> +                    break;
>         default:
>             Assert(false);
>     }

You should also add tests - the main purpose of contrib/test_decoding is
to facilitate writing those...


> +    ReorderBufferXidSetCatalogChanges(ctx->reorder, XLogRecGetXid(buf->record), buf->origptr);

Huh, why are you doing this? That's going to increase overhead of logical
decoding by many times?


> +                case REORDER_BUFFER_CHANGE_SEQUENCE:
> +                    Assert(snapshot_now);
> +
> +                    reloid = RelidByRelfilenode(change->data.sequence.relnode.spcNode,
> +                                                change->data.sequence.relnode.relNode);
> +
> +                    if (reloid == InvalidOid &&
> +                        change->data.sequence.newtuple == NULL)
> +                        goto change_done;

I don't think this path should be needed? There's afaict no valid ways
we should be able to end up here without a tuple?


> +                    if (!RelationIsLogicallyLogged(relation))
> +                        goto change_done;

Similarly, this seems superflous and should perhaps be an assertion?

> +                    /* user-triggered change */
> +                    if (!IsToastRelation(relation))
> +                    {
> +                        ReorderBufferToastReplace(rb, txn, relation, change);
> +                        rb->apply_change(rb, txn, relation, change);
> +                    }
> +                    break;
>             }
>         }
>

This doesn't make sense either.



> diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
> index 626ecf4dc9..cf3fd45c5f 100644
> --- a/src/include/replication/reorderbuffer.h
> +++ b/src/include/replication/reorderbuffer.h
> @@ -62,7 +62,8 @@ enum ReorderBufferChangeType
>     REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
>     REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
>     REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
> -    REORDER_BUFFER_CHANGE_TRUNCATE
> +    REORDER_BUFFER_CHANGE_TRUNCATE,
> +    REORDER_BUFFER_CHANGE_SEQUENCE,
> };
>
> /* forward declaration */
> @@ -149,6 +150,15 @@ typedef struct ReorderBufferChange
>             CommandId    cmax;
>             CommandId    combocid;
>         }            tuplecid;
> +        /*
> +         * Truncate data for REORDER_BUFFER_CHANGE_SEQUENCE representing one
> +         * set of relations to be truncated.
> +         */

What?

> +        struct
> +        {
> +            RelFileNode relnode;
> +            ReorderBufferTupleBuf *newtuple;
> +        }            sequence;
>     }            data;
>
>     /*

I don't think we should expose sequence changes via their tuples -
that'll unnecessarily expose a lot of implementation details.

Greetings,

Andres Freund




Вложения

Re: Include sequence relation support in logical replication

От
Andres Freund
Дата:
Hi,

On 2020-03-26 15:33:33 -0700, Cary Huang wrote:

> >> For the replication to make sense, the patch actually disables the WAL 
> 
> >> update at every 32 nextval() calls, so every call to nextval() will 
> 
> >> emit a WAL update for proper replication. This is done by setting 
> 
> >> SEQ_LOG_VALS to 0 in sequence.c 
> 
> >
> 
> > Why is that needed? ISTM updating in increments of 32 is fine for 
> 
> > replication purposes? It's good imo, because sending out more granular 
> 
> > increments would increase the size of the WAL stream? 
> 
> 
> 
> yes, updating WAL at every 32 increment is good and have huge performance benefits according 
> 
> to Michael, but when it is replicated logically to subscribers, the sequence value they receive would not 
> 
> make much sense to them.
> 
> For example, 
> 
> 
> 
> If i have a Sequence called "seq" with current value = 100 and increment = 5. The nextval('seq') call will
> 
> return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) ), because that is the value after 32
> 
> increments and internally it is also maintaining a "log_cnt" counter that tracks how many nextval() calls have been
invoked
> 
> since the last WAL write, so it could kind of derive backwards to find the proper next value to return to client. 
> 
> 
> 
> But the subscriber for this sequence will receive a change of 260 instead of 105, and it does not represent the
current
> 
> sequence status. Subscriber is not able to derive backwards because it does not know the increment size in its
schema.

What is the problem with the subscriber seeing 260? This already can
happen on the primary today, if there is a crash / immediate
restart. But that is fine - sequences don't guarantee that they are free
of gaps, just that each call will return a bigger value than before.


> 
> Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment behavior and makes WAL update at every
nextval()call
 
> 
> and therefore the subscriber to this sequence will receive the same value (105) as the client, as a cost of more WAL
writes.
> 
> 
> 
> I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without
theneed to
 
> 
> disabling the 32 increment behavior?

It simply shouldn't expect that to be the case.  What do you need it
for?

As far as I can tell replicating sequence values is useful to allow
failover, by ensuring all sequences will return sensible values going
forward. That does not require to now precise values.


Greetings,

Andres Freund



Re: Include sequence relation support in logical replication

От
Craig Ringer
Дата:
On Thu, 16 Apr 2020 at 07:44, Andres Freund <andres@anarazel.de> wrote:

> I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without the need to
>
> disabling the 32 increment behavior?

It simply shouldn't expect that to be the case.  What do you need it
for?

As far as I can tell replicating sequence values is useful to allow
failover, by ensuring all sequences will return sensible values going
forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is broken code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by logical decoding back in 9.4 is that until the pg_catalog.pg_sequence relation was introduced in PostgreSQL 10, the sequence relfilenode intermixed a bunch of transactional and non-transactional state in a very messy way. This made it very hard to achieve sensible behaviour for logical decoding.

As it is, make sure your regression tests carefully cover the following cases, as TAP tests in src/test/recovery, probably a new module for logical decoding of sequences:

1.

* Begin txn
* Create sequence
* Call nextval() on sequence over generate_series() and discard results
* Rollback
* Issue a dummy insert+commit to some other table to force logical decoding to send something
* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get created.

2.
 
* Begin 1st txn
* Create a sequence
* Use the sequence to populate a temp table with enough rows to ensure sequence updates are written
* Begin a 2nd txn
* Issue a dummy insert+commit to some other table to force logical decoding to send something
* Commit the 2nd txn
* Commit the 1st txn
* Wait for subscription catchup
* Check that the sequence value on the subscriber reflects the value after sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they arrive for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We should really have a helper for this. I'll see if I'm allowed to post the one I use for my own TAP tests to the list.

Re: Include sequence relation support in logical replication

От
Cary Huang
Дата:
Hi Craig, Andres

Thank you guys so much for your reviews and comments. Really helpful. Yes you guys are right, Sequence does not guarantee free of gaps and replicating sequence is useful for failover cases, then there will be no problem for a subscriber to get a future value 32 increments after. I will do more analysis on my end based on your comments and refine the patch with better test cases. Much appreciated of your help.

Best regards

Cary Huang
-------------
HighGo Software Inc. (Canada)

---- On Wed, 15 Apr 2020 22:18:28 -0700 Craig Ringer <craig@2ndquadrant.com> wrote ----

On Thu, 16 Apr 2020 at 07:44, Andres Freund <andres@anarazel.de> wrote:

> I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without the need to
>
> disabling the 32 increment behavior?

It simply shouldn't expect that to be the case.  What do you need it
for?

As far as I can tell replicating sequence values is useful to allow
failover, by ensuring all sequences will return sensible values going
forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is broken code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by logical decoding back in 9.4 is that until the pg_catalog.pg_sequence relation was introduced in PostgreSQL 10, the sequence relfilenode intermixed a bunch of transactional and non-transactional state in a very messy way. This made it very hard to achieve sensible behaviour for logical decoding.

As it is, make sure your regression tests carefully cover the following cases, as TAP tests in src/test/recovery, probably a new module for logical decoding of sequences:

1.

* Begin txn
* Create sequence
* Call nextval() on sequence over generate_series() and discard results
* Rollback
* Issue a dummy insert+commit to some other table to force logical decoding to send something
* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get created.

2.
 
* Begin 1st txn
* Create a sequence
* Use the sequence to populate a temp table with enough rows to ensure sequence updates are written
* Begin a 2nd txn
* Issue a dummy insert+commit to some other table to force logical decoding to send something

* Commit the 2nd txn
* Commit the 1st txn
* Wait for subscription catchup
* Check that the sequence value on the subscriber reflects the value after sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they arrive for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We should really have a helper for this. I'll see if I'm allowed to post the one I use for my own TAP tests to the list.


Re: Include sequence relation support in logical replication

От
Cary Huang
Дата:
Hi Craig

I have added more regression test cases to the sequence replication patch with emphasis on transactions and rollback per your suggestions. I find that when a transaction is aborted with rollback, the decoder plugin will not receive the change but the sequence value will in fact advance if nextval() or setval() were called. I have also made sequence replication an optional parameter in test_decoding so other test_decoding regression test cases will not need an update due to the new sequence replication function. The sequence update in this patch will emit an wal update every 32 increment, and each update is a future value 32 increments after like it was originally, so it is no longer required getting a specific value of sequence.

Could you elaborate more on your second case where it requires a psql in an IPC::Run background session and check if regression needs more coverage on certain cases?
thank you!


Cary Huang
-------------
HighGo Software Inc. (Canada)

---- On Thu, 16 Apr 2020 09:45:06 -0700 Cary Huang <cary.huang@highgo.ca> wrote ----

Hi Craig, Andres

Thank you guys so much for your reviews and comments. Really helpful. Yes you guys are right, Sequence does not guarantee free of gaps and replicating sequence is useful for failover cases, then there will be no problem for a subscriber to get a future value 32 increments after. I will do more analysis on my end based on your comments and refine the patch with better test cases. Much appreciated of your help.

Best regards

Cary Huang
-------------
HighGo Software Inc. (Canada)

---- On Wed, 15 Apr 2020 22:18:28 -0700 Craig Ringer <craig@2ndquadrant.com> wrote ----




On Thu, 16 Apr 2020 at 07:44, Andres Freund <andres@anarazel.de> wrote:

> I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without the need to
>
> disabling the 32 increment behavior?

It simply shouldn't expect that to be the case.  What do you need it
for?

As far as I can tell replicating sequence values is useful to allow
failover, by ensuring all sequences will return sensible values going
forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is broken code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by logical decoding back in 9.4 is that until the pg_catalog.pg_sequence relation was introduced in PostgreSQL 10, the sequence relfilenode intermixed a bunch of transactional and non-transactional state in a very messy way. This made it very hard to achieve sensible behaviour for logical decoding.

As it is, make sure your regression tests carefully cover the following cases, as TAP tests in src/test/recovery, probably a new module for logical decoding of sequences:

1.

* Begin txn
* Create sequence
* Call nextval() on sequence over generate_series() and discard results
* Rollback
* Issue a dummy insert+commit to some other table to force logical decoding to send something
* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get created.

2.
 
* Begin 1st txn
* Create a sequence
* Use the sequence to populate a temp table with enough rows to ensure sequence updates are written
* Begin a 2nd txn
* Issue a dummy insert+commit to some other table to force logical decoding to send something

* Commit the 2nd txn
* Commit the 1st txn
* Wait for subscription catchup
* Check that the sequence value on the subscriber reflects the value after sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they arrive for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We should really have a helper for this. I'll see if I'm allowed to post the one I use for my own TAP tests to the list.


Вложения

Re: Include sequence relation support in logical replication

От
Andres Freund
Дата:
Hi,

On 2020-05-08 16:32:38 -0700, Cary Huang wrote:
> I have added more regression test cases to the sequence replication
> patch with emphasis on transactions and rollback per your
> suggestions. I find that when a transaction is aborted with rollback,
> the decoder plugin will not receive the change but the sequence value
> will in fact advance if nextval() or setval() were called.

Right. The sequence advances shouldn't be treated
transactionally. That's already (optionally) done similarly for
messages, so you should be able to copy that code.

Greetings,

Andres Freund