Re: Subscription stuck at initialize state

Поиск
Список
Период
Сортировка
От Abhishek Bhola
Тема Re: Subscription stuck at initialize state
Дата
Msg-id CAEDsCzgLHLVQXbrorugGek0iVA17W=YNz4H_uDV7jb2GBWV4_w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Subscription stuck at initialize state  (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
Список pgsql-general
Hi Vijaykumar

Thank you so much for taking out so much of your time to recreate the bug.
I checked the max_logical_replication_workers on both the  nodes and they are set at 4 at the moment.
The reason why it is failing is that there are 3 existing replications on the target node already and when I create this 4th one, it is not able to create that temporary replication worker, needed to copy the initial data, and therefore it never moves ahead of that state.
Since these are production databases, I can't restart them during the weekdays, so I will try to change them during the weekend.
But I did find the warning message in the target DB logs.

"WARNING,53400,"out of logical replication worker slots",,"You might need to increase max_logical_replication_workers.",,,,,,,"","logical replication worker"
So I am sure this will work.

THANK YOU SO MUCH.


On Fri, Feb 4, 2022 at 3:04 AM Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:
trimming the email, to avoid noise.

I spent a lot of time trying multiple options/combinations and finally managed to replicate your debug output.
this is when i have , 

postgres=# show max_logical_replication_workers;
 max_logical_replication_workers
---------------------------------
 2
(1 row)


on publisher
postgres@controller:~$ psql -p 5001
psql (14.1 (Ubuntu 14.1-2.pgdg20.04+1))
Type "help" for help.

postgres=# create table t_bytea(id int primary key, somebytea bytea);
CREATE TABLE
postgres=# create table t1(id int);
CREATE TABLE
postgres=# create table t2(id int);
CREATE TABLE
postgres=# create table t3(id int);
CREATE TABLE
postgres=# create publication mypub1 for table t1;
CREATE PUBLICATION
postgres=# create publication mypub2 for table t2;
CREATE PUBLICATION
postgres=# create publication mypub3 for table t3;
CREATE PUBLICATION
postgres=# create publication mypub4 for table t3;
CREATE PUBLICATION
postgres=# create publication mypub5 for table t_bytea;
CREATE PUBLICATION
postgres=# insert into t_bytea  select x,repeat(repeat('xxx', 100), 1000)::bytea from generate_series(1, 1000) x;
INSERT 0 1000



on subscriber
postgres@controller:~$ psql -p 5002
psql (14.1 (Ubuntu 14.1-2.pgdg20.04+1))
Type "help" for help.

postgres=# create table t_bytea(id int primary key, somebytea bytea);
CREATE TABLE
postgres=#  create table t1(id int);
CREATE TABLE
postgres=#  create table t2(id int);
CREATE TABLE
postgres=#  create table t3(id int);
CREATE TABLE
postgres=#  create table t4(id int);
CREATE TABLE
postgres=#  create table t5(id int);
CREATE TABLE
postgres=# create subscription mysub1 connection 'port=5001' publication mypub1;
NOTICE:  created replication slot "mysub1" on publisher
CREATE SUBSCRIPTION
postgres=# create subscription mysub2 connection 'port=5001' publication mypub2;
NOTICE:  created replication slot "mysub2" on publisher
CREATE SUBSCRIPTION
postgres=# create subscription mysub3 connection 'port=5001' publication mypub3;
NOTICE:  created replication slot "mysub3" on publisher
CREATE SUBSCRIPTION
postgres=# create subscription mysub4 connection 'port=5001' publication mypub4;
NOTICE:  created replication slot "mysub4" on publisher
CREATE SUBSCRIPTION
postgres=# create subscription mysub5 connection 'port=5001' publication mypub5;
NOTICE:  created replication slot "mysub5" on publisher
CREATE SUBSCRIPTION
postgres=# select count(1) from t_bytea;
 count
-------
     0
(1 row)

postgres=# table pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+-----------
   16406 |   16391 | r          | 0/1722838
   16407 |   16394 | i          |
   16408 |   16397 | i          |
   16409 |   16397 | i          |
   16410 |   16384 | i          |
(5 rows)

# as expected no data in t_bytea (as it could not get any worker)
postgres=# select count(1) from t_bytea;
 count
-------
     0
(1 row)



but logs clearly state the problem
2022-02-03 23:18:31.107 IST [3430] LOG:  logical replication table synchronization worker for subscription "mysub1", table "t1" has started
2022-02-03 23:18:31.138 IST [3430] LOG:  logical replication table synchronization worker for subscription "mysub1", table "t1" has finished
2022-02-03 23:18:40.730 IST [3433] LOG:  logical replication apply worker for subscription "mysub2" has started
2022-02-03 23:18:40.737 IST [3433] WARNING:  out of logical replication worker slots
2022-02-03 23:18:40.737 IST [3433] HINT:  You might need to increase max_logical_replication_workers.
2022-02-03 23:18:45.865 IST [3433] WARNING:  out of logical replication worker slots


#publisher logs
2022-02-03 23:18:31.096 IST [3427] STATEMENT:  CREATE_REPLICATION_SLOT "mysub1" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2022-02-03 23:18:31.106 IST [3429] LOG:  starting logical decoding for slot "mysub1"
2022-02-03 23:18:31.106 IST [3429] DETAIL:  Streaming transactions committing after 0/1722800, reading WAL from 0/17227C8.
2022-02-03 23:18:31.106 IST [3429] STATEMENT:  START_REPLICATION SLOT "mysub1" LOGICAL 0/0 (proto_version '2', publication_names '"mypub1"')
2022-02-03 23:18:31.106 IST [3429] LOG:  logical decoding found consistent point at 0/17227C8
2022-02-03 23:18:31.106 IST [3429] DETAIL:  There are no running transactions.
2022-02-03 23:18:31.106 IST [3429] STATEMENT:  START_REPLICATION SLOT "mysub1" LOGICAL 0/0 (proto_version '2', publication_names '"mypub1"')
2022-02-03 23:18:31.129 IST [3431] LOG:  logical decoding found consistent point at 0/1722800
2022-02-03 23:18:31.129 IST [3431] DETAIL:  There are no running transactions.
2022-02-03 23:18:31.129 IST [3431] STATEMENT:  CREATE_REPLICATION_SLOT "pg_16406_sync_16391_7060540926182153512" LOGICAL pgoutput USE_SNAPSHOT
2022-02-03 23:18:31.135 IST [3431] LOG:  starting logical decoding for slot "pg_16406_sync_16391_7060540926182153512"
2022-02-03 23:18:31.135 IST [3431] DETAIL:  Streaming transactions committing after 0/1722838, reading WAL from 0/1722800.
2022-02-03 23:18:31.135 IST [3431] STATEMENT:  START_REPLICATION SLOT "pg_16406_sync_16391_7060540926182153512" LOGICAL 0/1722838 (proto_version '2', publication_names '"mypub1"')
2022-02-03 23:18:35.718 IST [3432] LOG:  logical decoding found consistent point at 0/1722838
2022-02-03 23:18:35.718 IST [3432] DETAIL:  There are no running transactions.
2022-02-03 23:18:35.718 IST [3432] STATEMENT:  CREATE_REPLICATION_SLOT "mysub2" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2022-02-03 23:18:40.737 IST [3434] LOG:  starting logical decoding for slot "mysub2"
2022-02-03 23:18:40.737 IST [3434] DETAIL:  Streaming transactions committing after 0/1722870, reading WAL from 0/1722838.
2022-02-03 23:18:40.737 IST [3434] STATEMENT:  START_REPLICATION SLOT "mysub2" LOGICAL 0/0 (proto_version '2', publication_names '"mypub2"')
2022-02-03 23:18:40.737 IST [3434] LOG:  logical decoding found consistent point at 0/1722838
2022-02-03 23:18:40.737 IST [3434] DETAIL:  There are no running transactions.
2022-02-03 23:18:40.737 IST [3434] STATEMENT:  START_REPLICATION SLOT "mysub2" LOGICAL 0/0 (proto_version '2', publication_names '"mypub2"')
2022-02-03 23:18:40.857 IST [3435] LOG:  logical decoding found consistent point at 0/1722870
2022-02-03 23:18:40.857 IST [3435] DETAIL:  There are no running transactions.
2022-02-03 23:18:40.857 IST [3435] STATEMENT:  CREATE_REPLICATION_SLOT "mysub3" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2022-02-03 23:18:46.091 IST [3437] LOG:  logical decoding found consistent point at 0/17228A8
2022-02-03 23:18:46.091 IST [3437] DETAIL:  There are no running transactions.
2022-02-03 23:18:46.091 IST [3437] STATEMENT:  CREATE_REPLICATION_SLOT "mysub4" LOGICAL pgoutput NOEXPORT_SNAPSHOT
2022-02-03 23:18:51.941 IST [3439] LOG:  logical decoding found consistent point at 0/17228E0
2022-02-03 23:18:51.941 IST [3439] DETAIL:  There are no running transactions.
2022-02-03 23:18:51.941 IST [3439] STATEMENT:  CREATE_REPLICATION_SLOT "mysub5" LOGICAL pgoutput NOEXPORT_SNAPSHOT



# solution
so i bump the  max_logical_replication_workers = 10 and restart the db servers.
as there are workers available, replication catches up and tables in sync.

# on subscriber
psql (14.1 (Ubuntu 14.1-2.pgdg20.04+1))
Type "help" for help.

postgres=# select count(1) from t_bytea;
 count
-------
     0
(1 row)

postgres=# table pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+-----------
   16406 |   16391 | r          | 0/1722838
   16407 |   16394 | r          | 0/2000110
   16408 |   16397 | r          | 0/2000148
   16409 |   16397 | r          | 0/2000148
   16410 |   16384 | r          | 0/20001B8
(5 rows)

postgres=# select count(1) from t_bytea;
 count
-------
  1000
(1 row)


so two things,
can you check logs for warning messages  of out of  logical replication worker slots on subscribers ?
can you try bumping them, and check if it catches up (srsubstate should either be r(READY) or d (INITIAL COPY))
 
also monitor logs for both publisher and subscriber.

I also figured out all the debugging steps I requested earlier were useless. 



--

This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).本メッセージに記載および添付されている情報(以下、総称して「本情報」といいます。)は、本来の受信者による使用のみを意図しています。誤送信等により本情報を取得された場合でも、本情報に係る秘密、または法律上の秘匿特権が失われるものではありません。本電子メールを受取られた方が、本来の受信者ではない場合には、本情報及びそのコピーすべてを削除・破棄し、本電子メールが誤って届いた旨を発信者宛てにご通知下さいますようお願いします。本情報の閲覧、発信または本情報に基づくいかなる行為も明確に禁止されていることをご了承ください。

В списке pgsql-general по дате отправления:

Предыдущее
От: Michael Lewis
Дата:
Сообщение: Re: increasing effective_cache_size slows down join queries by a factor of 4000x
Следующее
От: aditya desai
Дата:
Сообщение: Fwd: Increase fetch fize of oracl_fdw(ALTER SERVER)