Обсуждение: rare crash - FailedAssertion snapbuild.c Line: 580
Hello, To test postgres 11, I still regularly run series of short sessions of pgbench-over-logical-replication (basically the same thing that I used last year [1] - now in a perl incarnation). Most of the time the replication is stable and finishes correctly but sometimes (rarely) I get: TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", File: "snapbuild.c", Line: 580) This will probably be difficult to reproduce and to act upon but I wanted to report it anyway as in the course of the last few months I have seen it several times, on several machines. Always rarely, always postgres 11 (I did not try other versions). Erik Rijkers [1] https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl source/version: bf2d0462cd73 / REL_11_STABLE (compiled 20180828_1629) 1 primary, 4 replicas (on 1 host). Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres then: wait for log-repl sync. (I also added extra data type columns to the basic pgbench tables to test all the different data types) Below is a grep for the assert message, with 15 surrounding lines from the latest occurrence. $ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B 15 -Ei 'trap|assert' 65*/logfile.65* | less 6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG: logical decoding found consistent point at 0/89E038E0 6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL: There are no running transactions. 6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG: received replication command: CREATE_REPLICATION_SLOT "sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG: logical decoding found initial starting point at 0/8AF5D590 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL: Waiting for transactions (approximately 1) older than 886297 to end. 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG: logical decoding found consistent point at 0/8B018310 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL: There are no running transactions. 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG: received replication command: START_REPLICATION SLOT "sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1', publication_names '"pub1_6516"') 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG: starting logical decoding for slot "sub2_6517_6517_18834_sync_18793" 6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL: Streaming transactions committing after 0/89E03918, reading WAL from 0/841EE960. 6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG: logical decoding found consistent point at 0/841EF3E0 6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL: Logical decoding will begin using saved snapshot. 6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG: received replication command: CREATE_REPLICATION_SLOT "sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG: logical decoding found consistent point at 0/8C6BA408 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL: There are no running transactions. 6516JTq_E8/logfile.6516:TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", File: "snapbuild.c", Line: 580) 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG: server process (PID 139541) was terminated by signal 6: Aborted 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL: Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG: terminating any other active server processes 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING: terminating connection because of crash of another server process 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT: In a moment you should be able to reconnect to the database and repeat your command. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT: slot "sub2_6517_6517", output plugin "pgoutput", in the change callback, associated LSN 0/8B11CF00 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING: terminating connection because of crash of another server process 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] HINT: In a moment you should be able to reconnect to the database and repeat your command. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] WARNING: terminating connection because of crash of another server process 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] HINT: In a moment you should be able to reconnect to the database and repeat your command. 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.116 CEST [130398] LOG: all server processes terminated; reinitializing 6516JTq_E8/logfile.6516-2018-08-29 14:28:20.515 CEST [139548] LOG: database system was interrupted; last known up at 2018-08-29 14:26:14 CEST
Hi, On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote: > To test postgres 11, I still regularly run series of short sessions of > pgbench-over-logical-replication (basically the same thing that I used last > year [1] - now in a perl incarnation). Most of the time the replication is > stable and finishes correctly but sometimes (rarely) I get: > > TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, > snap->xmin))", File: "snapbuild.c", Line: 580) > > This will probably be difficult to reproduce and to act upon but I wanted to > report it anyway as in the course of the last few months I have seen it > several times, on several machines. Always rarely, always postgres 11 (I did > not try other versions). Thanks for testing! Could you possibly run the tests with core files enabled, so we at get a backtrace in case of trouble? Knowing what the values here are would be tremendously helpful... Greetings, Andres Freund
On 2018-08-29 21:15, Andres Freund wrote: > Hi, > > On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote: >> To test postgres 11, I still regularly run series of short sessions of >> pgbench-over-logical-replication (basically the same thing that I used >> last >> year [1] - now in a perl incarnation). Most of the time the >> replication is >> stable and finishes correctly but sometimes (rarely) I get: >> >> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, >> snap->xmin))", File: "snapbuild.c", Line: 580) >> >> This will probably be difficult to reproduce and to act upon but I >> wanted to >> report it anyway as in the course of the last few months I have seen >> it >> several times, on several machines. Always rarely, always postgres 11 >> (I did >> not try other versions). > > Thanks for testing! Could you possibly run the tests with core files > enabled, so we at get a backtrace in case of trouble? Knowing what the > values here are would be tremendously helpful... ok, is this any use? $ gdb --quiet /var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres /var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core Reading symbols from /var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done. [New LWP 147484] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: walsender rijkers [local] idle in transaction '. Program terminated with signal SIGABRT, Aborted. #0 0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f0fd20e8448 in __GI_abort () at abort.c:89 #2 0x00000000008880bf in ExceptionalCondition (conditionName=conditionName@entry=0xa417f8 "!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", errorType=errorType@entry=0x8d365d "FailedAssertion", fileName=fileName@entry=0xa41223 "snapbuild.c", lineNumber=lineNumber@entry=580) at assert.c:54 #3 0x000000000072676e in SnapBuildInitialSnapshot () at snapbuild.c:580 #4 0x000000000072ed54 in CreateReplicationSlot (cmd=0x223bae0) at walsender.c:951 #5 exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 "CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527 #6 0x000000000077e8ee in PostgresMain (argc=<optimized out>, argv=argv@entry=0x21cbbc8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4155 #7 0x0000000000704fde in BackendRun (port=0x21c4520) at postmaster.c:4361 #8 BackendStartup (port=0x21c4520) at postmaster.c:4033 #9 ServerLoop () at postmaster.c:1706 #10 0x0000000000705e0f in PostmasterMain (argc=argc@entry=12, argv=argv@entry=0x219c470) at postmaster.c:1379 #11 0x0000000000478d80 in main (argc=12, argv=0x219c470) at main.c:228 (gdb)
On 2018-Aug-30, Erik Rijkers wrote: > ok, is this any use? Seems mostly good, but the Xids are not printed. Could you please do "bt full"? Also: frame 3 print *snap Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-08-30 16:44, Alvaro Herrera wrote: > On 2018-Aug-30, Erik Rijkers wrote: > >> ok, is this any use? > > Seems mostly good, but the Xids are not printed. Could you please do > "bt full"? Also: > > frame 3 > print *snap See the attached.