Re: BUG #16331: segfault in checkpointer with full disk
От | Julien Rouhaud |
---|---|
Тема | Re: BUG #16331: segfault in checkpointer with full disk |
Дата | |
Msg-id | CAOBaU_a0-FkNp4YHO_7nN7=NDN2R_xb-Ya-e3w9bB1SHEstYCQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #16331: segfault in checkpointer with full disk (Jozef Mlich <jmlich83@gmail.com>) |
Список | pgsql-bugs |
On Wed, Apr 1, 2020 at 11:51 AM Jozef Mlich <jmlich83@gmail.com> wrote: > > On Wed, 2020-04-01 at 11:04 +0200, Julien Rouhaud wrote: > > Hi, > > > > On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form > > wrote: > > > > > > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7 > > > (from > > > yum.postgresql.org). I am using multiple extensions (cstore, > > > postgres_fdw, > > > pgcrypto,dblink, etc.). It seems crash is related to disk run out > > > of space > > > (I am using separate partion for / and for /var/lib/pgsql). It > > > occurs few > > > times a day. According to backtrace it seems to be related to > > > checkpointer. > > > Replication is not configured. > > > > > > > > > [New LWP 26290] > > > [Thread debugging using libthread_db enabled] > > > Using host libthread_db library "/lib64/libthread_db.so.1". > > > Core was generated by `postgres: > > > checkpointer > > > '. > > > Program terminated with signal 6, Aborted. > > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > > > 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); > > > > > > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)): > > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55 > > > resultvar = 0 > > > pid = 26290 > > > selftid = 26290 > > > #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90 > > > save_stage = 2 > > > act = {__sigaction_handler = {sa_handler = 0x0, > > > sa_sigaction = 0x0}, > > > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717, > > > 39808819211026438, 20126216749056, 70394513997832, 9268713, > > > 70403103920719, > > > 17316096998686159616, 20134806683648, 140618848608704, > > > 140618848592800}}, > > > sa_flags = 1615828275, sa_restorer = 0x0} > > > sigs = {__val = {32, 0 <repeats 15 times>}} > > > #2 0x000000000087840a in errfinish (dummy=<optimized out>) at > > > elog.c:552 > > > edata = 0xd47040 <errordata> > > > elevel = 22 > > > oldcontext = 0x171a6d0 > > > econtext = 0x0 > > > __func__ = "errfinish" > > > #3 0x0000000000706b24 in CheckPointReplicationOrigin () at > > > origin.c:562 > > > tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp" > > > path = 0x9e6fd0 "pg_logical/replorigin_checkpoint" > > > tmpfd = <optimized out> > > > i = <optimized out> > > > magic = 307747550 > > > crc = 4294967295 > > > __func__ = "CheckPointReplicationOrigin" > > > > That's not a bug (nor a segfault) but the expected behavior if the > > checkpointer is not able to do its work. As data durability can't be > > guaranteed in such case, the checkpointer raises a PANIC level > > message, which raises an abort so that the whole instance do an > > emergency restart cycle. > > > > Do you have monitoring for this filesystem? Do you see spikes in > > disk usage or other strange behavior? > > Then it is clear. Thanks for explanation and applogize for false bug > report. > > I have probably misunderstood how is segfault distinguished from abort. > I need to fix my kernel.core_pattern script. > > In attachment is screenshot from monitoring grafana with information > about space on /var/lib/pgsql partition. The main filesystem is full or almost full most of the time? That's unfortunately a good way to trigger that kind of outage. Is that because most of the data is on a different tablespace? Even in that case you need to ensure that you still have at least a reasonable amount of free space.
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Jehan-Guillaume de RorthaisДата:
Сообщение: Re: [BUG] non archived WAL removed during production crash recovery