Re: Skipping logical replication transactions on subscriber side

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Skipping logical replication transactions on subscriber side
Дата
Msg-id CAD21AoAGpOTDAQeEndjxZyxxcwcQCs8zH+k0uO_gFxhiS8ctcQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Skipping logical replication transactions on subscriber side  (vignesh C <vignesh21@gmail.com>)
Список pgsql-hackers
On Wed, Nov 17, 2021 at 3:52 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Nov 16, 2021 at 12:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Nov 15, 2021 at 11:43 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Mon, Nov 15, 2021 at 2:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Mon, Nov 15, 2021 at 4:49 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
> > > > >
> > > > > On Mon, Nov 15, 2021 at 1:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > I've attached an updated patch that incorporates all comments I got so
> > > > > > far. Please review it.
> > > > > >
> > > > >
> > > > > Thanks for the updated patch.
> > > > > A few minor comments:
> > > > >
> > > > > doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > > > >
> > > > > (1) tab in doc updates
> > > > >
> > > > > There's a tab before "Otherwise,":
> > > > >
> > > > > +        copy of the relation with <parameter>relid</parameter>.
> > > > >         Otherwise,
> > > >
> > > > Fixed.
> > > >
> > > > >
> > > > > src/backend/utils/adt/pgstatfuncs.c
> > > > >
> > > > > (2) The function comment for "pg_stat_reset_subscription_worker_sub"
> > > > > seems a bit long and I expected it to be multi-line (did you run
> > > > > pg_indent?)
> > > >
> > > > I ran pg_indent on pgstatfuncs.c but it didn't become a multi-line comment.
> > > >
> > > > >
> > > > > src/include/pgstat.h
> > > > >
> > > > > (3) Remove PgStat_StatSubWorkerEntry.dbid?
> > > > >
> > > > > The "dbid" member of the new PgStat_StatSubWorkerEntry struct doesn't
> > > > > seem to be used, so I think it should be removed.
> > > > > (I could remove it and everything builds OK and tests pass).
> > > > >
> > > >
> > > > Fixed.
> > > >
> > > > Thank you for the comments! I've updated an updated version patch.
> > >
> > > Thanks for the updated patch.
> > > I found one issue:
> > > This Assert can fail in few cases:
> > > +void
> > > +pgstat_report_subworker_error(Oid subid, Oid subrelid, Oid relid,
> > > +
> > > LogicalRepMsgType command, TransactionId xid,
> > > +                                                         const char *errmsg)
> > > +{
> > > +       PgStat_MsgSubWorkerError msg;
> > > +       int                     len;
> > > +
> > > +       Assert(strlen(errmsg) < PGSTAT_SUBWORKERERROR_MSGLEN);
> > > +       len = offsetof(PgStat_MsgSubWorkerError, m_message[0]) +
> > > strlen(errmsg) + 1;
> > > +
> > >
> > > I could reproduce the problem with the following scenario:
> > > Publisher:
> > > create table t1 (c1 varchar);
> > > create publication pub1 for table t1;
> > > insert into t1 values(repeat('abcd', 5000));
> > >
> > > Subscriber:
> > > create table t1(c1 smallint);
> > > create subscription sub1 connection 'dbname=postgres port=5432'
> > > publication pub1 with ( two_phase = true);
> > > postgres=# select * from pg_stat_subscription_workers;
> > > WARNING:  terminating connection because of crash of another server process
> > > DETAIL:  The postmaster has commanded this server process to roll back
> > > the current transaction and exit, because another server process
> > > exited abnormally and possibly corrupted shared memory.
> > > HINT:  In a moment you should be able to reconnect to the database and
> > > repeat your command.
> > > server closed the connection unexpectedly
> > >    This probably means the server terminated abnormally
> > >    before or while processing the request.
> > > The connection to the server was lost. Attempting reset: Failed.
> > >
> > > Subscriber logs:
> > > 2021-11-15 19:27:56.380 IST [15685] LOG:  logical replication apply
> > > worker for subscription "sub1" has started
> > > 2021-11-15 19:27:56.384 IST [15687] LOG:  logical replication table
> > > synchronization worker for subscription "sub1", table "t1" has started
> > > TRAP: FailedAssertion("strlen(errmsg) < PGSTAT_SUBWORKERERROR_MSGLEN",
> > > File: "pgstat.c", Line: 1946, PID: 15687)
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (ExceptionalCondition+0xd0)[0x55a18f3c727f]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (pgstat_report_subworker_error+0x7a)[0x55a18f126417]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (ApplyWorkerMain+0x493)[0x55a18f176611]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (StartBackgroundWorker+0x23c)[0x55a18f11f7e2]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (+0x54efc0)[0x55a18f134fc0]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (+0x54f3af)[0x55a18f1353af]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (+0x54e338)[0x55a18f134338]
> > > /lib/x86_64-linux-gnu/libpthread.so.0(+0x141f0)[0x7feef84371f0]
> > > /lib/x86_64-linux-gnu/libc.so.6(__select+0x57)[0x7feef81e3ac7]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (+0x5498c2)[0x55a18f12f8c2]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (PostmasterMain+0x134c)[0x55a18f12f1dd]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (+0x43c3d4)[0x55a18f0223d4]
> > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xd5)[0x7feef80fd565]
> > > postgres: logical replication worker for subscription 16387 sync 16384
> > > (_start+0x2e)[0x55a18ecaf4fe]
> > > 2021-11-15 19:27:56.483 IST [15645] LOG:  background worker "logical
> > > replication worker" (PID 15687) was terminated by signal 6: Aborted
> > > 2021-11-15 19:27:56.483 IST [15645] LOG:  terminating any other active
> > > server processes
> > > 2021-11-15 19:27:56.485 IST [15645] LOG:  all server processes
> > > terminated; reinitializing
> > >
> > > Here it fails because of a long error message ""invalid input syntax
> > > for type smallint:
> >
> > Good catch!
> >
> > >
\"abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabc...."
> > > because we try to insert varchar type data into smallint type.  Maybe
> > > we should trim the error message in this case.
> >
> > Right. I've fixed this issue and attached an updated patch.
>
> Thanks for the updated patch. The issue is fixed in the patch provided.
> I found that in one of the scenarios the statistics is getting lost:

Thank you for the tests!!

>
> Step 3:
> postgres=# select * from pg_stat_subscription_workers order by subid;
> subid | subname | subrelid | relid | command | xid | error_count |
> error_message | first_error_time | last_error_time
>
-------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------+----------------------------------+----------------------------------
> 16389 | sub1 | 16384 | 16384 | | | 17 | duplicate key value violates
> unique constraint "t1_pkey" | 2021-11-17 12:01:46.141086+05:30 |
> 2021-11-17 12:03:13.175698+05:30
> 16395 | sub2 | 16390 | 16390 | | | 16 | duplicate key value violates
> unique constraint "t2_pkey" | 2021-11-17 12:01:51.337055+05:30 |
> 2021-11-17 12:03:15.512249+05:30
> 16401 | sub3 | 16396 | 16396 | | | 16 | duplicate key value violates
> unique constraint "t3_pkey" | 2021-11-17 12:01:51.352157+05:30 |
> 2021-11-17 12:03:15.802225+05:30
> 16407 | sub4 | 16402 | 16402 | | | 16 | duplicate key value violates
> unique constraint "t4_pkey" | 2021-11-17 12:01:51.390638+05:30 |
> 2021-11-17 12:03:14.709496+05:30
> 16413 | sub5 | 16408 | 16408 | | | 16 | duplicate key value violates
> unique constraint "t5_pkey" | 2021-11-17 12:01:51.418825+05:30 |
> 2021-11-17 12:03:15.257235+05:30
>
> Step 4:
> Then restart the publisher
>
> Step 5:
> postgres=# select * from pg_stat_subscription_workers order by subid;
> subid | subname | subrelid | relid | command | xid | error_count |
> error_message |
> first_error_time | last_error_time
>
-------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-----
> -----------------------------+----------------------------------
> 16389 | sub1 | 16384 | 16384 | | | 1 | could not create replication
> slot "pg_16389_sync_16384_7031422794938304519": FATAL: terminating
> connection due to administrator command+| 2021
> -11-17 12:03:28.201247+05:30 | 2021-11-17 12:03:28.201247+05:30
> | | | | | | | server closed the connection unexpectedly +|
> |
> | | | | | | | This probably means the server terminated abnormally +|
> |
> | | | | | | | before or while proce |
> |
> 16395 | sub2 | 16390 | 16390 | | | 18 | duplicate key value violates
> unique constraint "t2_pkey" | 2021
> -11-17 12:01:51.337055+05:30 | 2021-11-17 12:03:23.832585+05:30
> 16401 | sub3 | 16396 | 16396 | | | 18 | duplicate key value violates
> unique constraint "t3_pkey" | 2021
> -11-17 12:01:51.352157+05:30 | 2021-11-17 12:03:26.567873+05:30
> 16407 | sub4 | 16402 | 16402 | | | 1 | could not create replication
> slot "pg_16407_sync_16402_7031422794938304519": FATAL: terminating
> connection due to administrator command+| 2021
> -11-17 12:03:28.196958+05:30 | 2021-11-17 12:03:28.196958+05:30
> | | | | | | | server closed the connection unexpectedly +|
> |
> | | | | | | | This probably means the server terminated abnormally +|
> |
> | | | | | | | before or while proce |
> |
> 16413 | sub5 | 16408 | 16408 | | | 18 | duplicate key value violates
> unique constraint "t5_pkey" | 2021
> -11-17 12:01:51.418825+05:30 | 2021-11-17 12:03:25.595697+05:30
>
> Step 6:
> postgres=# select * from pg_stat_subscription_workers order by subid;
> subid | subname | subrelid | relid | command | xid | error_count |
> error_message | first_error_time | last_error_time
>
-------+---------+----------+-------+---------+-----+-------------+------------------------------------------------------------+----------------------------------+----------------------------------
> 16389 | sub1 | 16384 | 16384 | | | 1 | duplicate key value violates
> unique constraint "t1_pkey" | 2021-11-17 12:03:33.346514+05:30 |
> 2021-11-17 12:03:33.346514+05:30
> 16395 | sub2 | 16390 | 16390 | | | 19 | duplicate key value violates
> unique constraint "t2_pkey" | 2021-11-17 12:01:51.337055+05:30 |
> 2021-11-17 12:03:33.437505+05:30
> 16401 | sub3 | 16396 | 16396 | | | 19 | duplicate key value violates
> unique constraint "t3_pkey" | 2021-11-17 12:01:51.352157+05:30 |
> 2021-11-17 12:03:33.482954+05:30
> 16407 | sub4 | 16402 | 16402 | | | 1 | duplicate key value violates
> unique constraint "t4_pkey" | 2021-11-17 12:03:33.327489+05:30 |
> 2021-11-17 12:03:33.327489+05:30
> 16413 | sub5 | 16408 | 16408 | | | 19 | duplicate key value violates
> unique constraint "t5_pkey" | 2021-11-17 12:01:51.418825+05:30 |
> 2021-11-17 12:03:33.374522+05:30
>
> We can see that sub1 and sub4 statistics are lost, old error_count
> value is lost. I'm not sure if this behavior is ok or not. Thoughts?
>

Looking at the outputs of steps 3, 5, and 6, the error messages are
different. In the current design, error_count is incremented only when
the exact same error (i.g., xid, command, relid, error message are the
same) comes. Since some different kinds of errors happened on the
subscription the error_count was reset. Similarly, the
first_error_time value was also reset.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: Parallel vacuum workers prevent the oldest xmin from advancing
Следующее
От: Aleksander Alekseev
Дата:
Сообщение: Re: Anything I can contribute?