Re: [HACKERS] Issues with logical replication

Поиск
Список
Период
Сортировка
От Petr Jelinek
Тема Re: [HACKERS] Issues with logical replication
Дата
Msg-id d0588e65-289c-9768-7a43-2e8ef4d05620@2ndquadrant.com
обсуждение исходный текст
Ответ на [HACKERS] Issues with logical replication  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Ответы Re: [HACKERS] Issues with logical replication  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Re: [HACKERS] Issues with logical replication  (Stas Kelvich <s.kelvich@postgrespro.ru>)
Список pgsql-hackers
On 02/10/17 17:49, Konstantin Knizhnik wrote:
> I have faced two issues with logical replication.
> Reproducing scenario:
> 
> 1. start two Postgres instances (I start both at local machine).
> 2. Initialize pgbench tables at both instances:
>     pgbench -i -s 10 postgres
> 3. Create publication of pgbench_accounts table at one node:
>     create publication pub for table pgbench_accounts;
> 4. Create subscription at another node:
>     delete from pgbench_accounts;
>     CREATE SUBSCRIPTION sub connection 'dbname=postgres host=localhost
> port=5432 sslmode=disable' publication pub;
>     CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS
> TRIGGER AS $$
>     BEGIN
>     --  RETURN NEW;
>     END $$ LANGUAGE plpgsql;
>    CREATE TRIGGER replication_trigger BEFORE INSERT OR UPDATE OR DELETE
> ON pgbench_accounts FOR EACH ROW EXECUTE PROCEDURE
> replication_trigger_proc();
>    ALTER TABLE pgbench_accounts ENABLE ALWAYS TRIGGER replication_trigger;
> 5. Start pgbench at primary node:
>     pgbench -T 1000 -P 2 -c 10 postgres
> 
> 
> Please notice commented "return new" statement. Invocation of this
> function cause error and in log we see repeated messages:
> 
> 2017-10-02 16:47:46.764 MSK [32129] LOG:  logical replication table
> synchronization worker for subscription "sub", table "pgbench_accounts"
> has started
> 2017-10-02 16:47:46.771 MSK [32129] ERROR:  control reached end of
> trigger procedure without RETURN
> 2017-10-02 16:47:46.771 MSK [32129] CONTEXT:  PL/pgSQL function
> replication_trigger_proc()
>     COPY pgbench_accounts, line 1: "1    1    0 "
> 2017-10-02 16:47:46.772 MSK [28020] LOG:  worker process: logical
> replication worker for subscription 17264 sync 17251 (PID 32129) exited
> with exit code 1
> ...
> 
> 
> After few minutes of work primary node (with publication) is crashed
> with the following stack trace:
> 
> Program terminated with signal SIGABRT, Aborted.
> #0  0x00007f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 56    ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x00007f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f3608f92028 in __GI_abort () at abort.c:89
> #2  0x00000000009f5740 in ExceptionalCondition (conditionName=0xbf6b30
> "!(((xid) != ((TransactionId) 0)))",
>     errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
> lineNumber=582) at assert.c:54
> #3  0x000000000086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
> oper=XLTW_None) at lmgr.c:582
> #4  0x000000000081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
> cutoff=898498) at snapbuild.c:1400
> #5  0x000000000081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
> lsn=798477760, running=0x2749198) at snapbuild.c:1311
> #6  0x000000000081c339 in SnapBuildProcessRunningXacts
> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106


Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid
transaction ids but we specifically skip those in
GetRunningTransactionData(). Can you check pg_waldump output for that
record (lsn is shown in the backtrace)?

> 
> Now fix the trigger function:
> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
> BEGIN
>   RETURN NEW;
> END $$ LANGUAGE plpgsql;
> 
> And manually perform at master two updates inside one transaction:
> 
> postgres=# begin;
> BEGIN
> postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
> UPDATE 1
> postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
> UPDATE 1
> postgres=# commit;
> <hangs>
> 
> and in replica log we see:
> 2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply
> worker for subscription "sub" has started
> 2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible
> tuple
> 2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical
> replication worker for subscription 16403 (PID 2954) exited with exit
> code 1
> 
> Error happens in trigger.c:
> 
> #3  0x000000000069bddb in GetTupleForTrigger (estate=0x2e36b50,
> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
>     lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
> trigger.c:3103
> #4  0x000000000069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
>     fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
> #5  0x00000000006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
> epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
>     at execReplication.c:461
> #6  0x0000000000820894 in apply_handle_update (s=0x7ffc442163b0) at
> worker.c:736

We have locked the same tuple in RelationFindReplTupleByIndex() just
before this gets called and didn't get the same error. I guess we do
something wrong with snapshots. Will need to investigate more.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS][BUG] Cache invalidation for queries that contains constof temporary composite type