Обсуждение: rare crash - FailedAssertion snapbuild.c Line: 580

Поиск
Список
Период
Сортировка

rare crash - FailedAssertion snapbuild.c Line: 580

От
Erik Rijkers
Дата:
Hello,

To test postgres 11, I still regularly run series of short sessions of 
pgbench-over-logical-replication (basically the same thing that I used 
last year [1] - now in a perl incarnation).  Most of the time the 
replication is stable and finishes correctly but sometimes (rarely) I 
get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, 
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I 
wanted to report it anyway as in the course of the last few months I 
have seen it several times, on several machines. Always rarely, always 
postgres 11 (I did not try other versions).


Erik Rijkers


[1] 
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl


source/version: bf2d0462cd73 / REL_11_STABLE  (compiled 20180828_1629)

1 primary, 4 replicas (on 1 host).
Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres
   then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres
   then: wait for log-repl sync.
     (I also added extra data type columns to the basic pgbench tables to 
test all the different data types)

Below is a grep for the assert message, with 15 surrounding lines from 
the latest occurrence.

$ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B 
15 -Ei 'trap|assert' 65*/logfile.65* | less
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG:  
logical decoding found consistent point at 0/89E038E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG:  
received replication command: CREATE_REPLICATION_SLOT 
"sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG:  
logical decoding found initial starting point at 0/8AF5D590
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL:  
Waiting for transactions (approximately 1) older than 886297 to end.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG:  
logical decoding found consistent point at 0/8B018310
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG:  
received replication command: START_REPLICATION SLOT 
"sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1', 
publication_names '"pub1_6516"')
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG:  
starting logical decoding for slot "sub2_6517_6517_18834_sync_18793"
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL:  
Streaming transactions committing after 0/89E03918, reading WAL from 
0/841EE960.
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG:  
logical decoding found consistent point at 0/841EF3E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL:  
Logical decoding will begin using saved snapshot.
6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG:  
received replication command: CREATE_REPLICATION_SLOT 
"sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG:  
logical decoding found consistent point at 0/8C6BA408
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL:  
There are no running transactions.
6516JTq_E8/logfile.6516:TRAP: 
FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
File: "snapbuild.c", Line: 580)
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:  
server process (PID 139541) was terminated by signal 6: Aborted
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL:  
Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE 
READ
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:  
terminating any other active server processes
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL:  
The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT:  In 
a moment you should be able to reconnect to the database and repeat your 
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT:  
slot "sub2_6517_6517", output plugin "pgoutput", in the change callback, 
associated LSN 0/8B11CF00
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL:  
The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] HINT:  In 
a moment you should be able to reconnect to the database and repeat your 
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] WARNING:  
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] DETAIL:  
The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] HINT:  In 
a moment you should be able to reconnect to the database and repeat your 
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.116 CEST [130398] LOG:  all 
server processes terminated; reinitializing
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.515 CEST [139548] LOG:  
database system was interrupted; last known up at 2018-08-29 14:26:14 
CEST





Re: rare crash - FailedAssertion snapbuild.c Line: 580

От
Andres Freund
Дата:
Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:
> To test postgres 11, I still regularly run series of short sessions of
> pgbench-over-logical-replication (basically the same thing that I used last
> year [1] - now in a perl incarnation).  Most of the time the replication is
> stable and finishes correctly but sometimes (rarely) I get:
> 
> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
> snap->xmin))", File: "snapbuild.c", Line: 580)
> 
> This will probably be difficult to reproduce and to act upon but I wanted to
> report it anyway as in the course of the last few months I have seen it
> several times, on several machines. Always rarely, always postgres 11 (I did
> not try other versions).

Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble?  Knowing what the
values here are would be tremendously helpful...

Greetings,

Andres Freund


Re: rare crash - FailedAssertion snapbuild.c Line: 580

От
Erik Rijkers
Дата:
On 2018-08-29 21:15, Andres Freund wrote:
> Hi,
> 
> On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:
>> To test postgres 11, I still regularly run series of short sessions of
>> pgbench-over-logical-replication (basically the same thing that I used 
>> last
>> year [1] - now in a perl incarnation).  Most of the time the 
>> replication is
>> stable and finishes correctly but sometimes (rarely) I get:
>> 
>> TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
>> snap->xmin))", File: "snapbuild.c", Line: 580)
>> 
>> This will probably be difficult to reproduce and to act upon but I 
>> wanted to
>> report it anyway as in the course of the last few months I have seen 
>> it
>> several times, on several machines. Always rarely, always postgres 11 
>> (I did
>> not try other versions).
> 
> Thanks for testing! Could you possibly run the tests with core files
> enabled, so we at get a backtrace in case of trouble?  Knowing what the
> values here are would be tremendously helpful...

ok, is this any use?

$ gdb --quiet 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres  
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core
Reading symbols from 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done.
[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in 
transaction                       '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.
(gdb) bt
#0  0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f0fd20e8448 in __GI_abort () at abort.c:89
#2  0x00000000008880bf in ExceptionalCondition 
(conditionName=conditionName@entry=0xa417f8 
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
errorType=errorType@entry=0x8d365d "FailedAssertion", 
fileName=fileName@entry=0xa41223 "snapbuild.c", 
lineNumber=lineNumber@entry=580) at assert.c:54
#3  0x000000000072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
#4  0x000000000072ed54 in CreateReplicationSlot (cmd=0x223bae0) at 
walsender.c:951
#5  exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY 
LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527
#6  0x000000000077e8ee in PostgresMain (argc=<optimized out>, 
argv=argv@entry=0x21cbbc8, dbname=<optimized out>, username=<optimized 
out>) at postgres.c:4155
#7  0x0000000000704fde in BackendRun (port=0x21c4520) at 
postmaster.c:4361
#8  BackendStartup (port=0x21c4520) at postmaster.c:4033
#9  ServerLoop () at postmaster.c:1706
#10 0x0000000000705e0f in PostmasterMain (argc=argc@entry=12, 
argv=argv@entry=0x219c470) at postmaster.c:1379
#11 0x0000000000478d80 in main (argc=12, argv=0x219c470) at main.c:228
(gdb)





Re: rare crash - FailedAssertion snapbuild.c Line: 580

От
Alvaro Herrera
Дата:
On 2018-Aug-30, Erik Rijkers wrote:

> ok, is this any use?

Seems mostly good, but the Xids are not printed.  Could you please do
"bt full"?  Also:

frame 3
print *snap

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: rare crash - FailedAssertion snapbuild.c Line: 580

От
Erik Rijkers
Дата:
On 2018-08-30 16:44, Alvaro Herrera wrote:
> On 2018-Aug-30, Erik Rijkers wrote:
> 
>> ok, is this any use?
> 
> Seems mostly good, but the Xids are not printed.  Could you please do
> "bt full"?  Also:
> 
> frame 3
> print *snap

See the attached.




Вложения