Re: stress test for parallel workers

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: stress test for parallel workers
Дата
Msg-id CA+hUKGLch1bNWdG-G8YaeJbyVsper6hG86Ugx9tSWG3=a1R89Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: stress test for parallel workers  (Justin Pryzby <pryzby@telsasoft.com>)
Ответы Re: stress test for parallel workers  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: stress test for parallel workers  (Thomas Munro <thomas.munro@gmail.com>)
Re: stress test for parallel workers  (Justin Pryzby <pryzby@telsasoft.com>)
Список pgsql-hackers
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
>         edata = <value optimized out>
>         elevel = 22
>         oldcontext = 0x27e15d0
>         econtext = 0x0
>         __func__ = "errfinish"
> #3  0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
>         save_errno = <value optimized out>
>         tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
>         path = 0x9c4300 "pg_logical/replorigin_checkpoint"
>         tmpfd = 64
>         i = <value optimized out>
>         magic = 307747550
>         crc = 4294967295
>         __func__ = "CheckPointReplicationOrigin"

> Supposedly it's trying to do this:
>
> |       ereport(PANIC,
> |                       (errcode_for_file_access(),
> |                        errmsg("could not write to file \"%s\": %m",
> |                                       tmppath)));
>
> And since there's consistently nothing in logs, I'm guessing there's a
> legitimate write error (legitimate from PG perspective).  Storage here is ext4
> plus zfs tablespace on top of LVM on top of vmware thin volume.

If you have that core, it might be interesting to go to frame 2 and
print *edata or edata->saved_errno.  If the errno is EIO, it's a bit
strange if that's not showing up in some form in kernel logs or dmesg
or something; if it's ENOSPC I guess it'd be normal that it doesn't
show up anywhere and there is nothing in the PostgreSQL logs if
they're on the same full filesystem, but then you would probably
already have mentioned that your filesystem was out of space.  Could
it have been fleetingly full due to some other thing happening on the
system that rapidly expands and contracts?

I'm confused by the evidence, though.  If this PANIC is the origin of
the problem, how do we get to postmaster-death based exit in a
parallel leader*, rather than quickdie() (the kind of exit that
happens when the postmaster sends SIGQUIT to every process and they
say "terminating connection because of crash of another server
process", because some backend crashed or panicked).  Perhaps it would
be clearer what's going on if you could put the PostgreSQL log onto a
different filesystem, so we get a better chance of collecting
evidence?  But then... the parallel leader process was apparently able
to log something -- maybe it was just lucky, but you said this
happened this way more than once.  I'm wondering how it could be that
you got some kind of IO failure and weren't able to log the PANIC
message AND your postmaster was killed, and you were able to log a
message about that.  Perhaps we're looking at evidence from two
unrelated failures.

*I suspect that the only thing implicating parallelism in this failure
is that parallel leaders happen to print out that message if the
postmaster dies while they are waiting for workers; most other places
(probably every other backend in your cluster) just quietly exit.
That tells us something about what's happening, but on its own doesn't
tell us that parallelism plays an important role in the failure mode.


--
Thomas Munro
https://enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: getting ERROR "relation 16401 has no triggers" with partitionforeign key alter
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: make \d pg_toast.foo show its indices ; and, \d toast show itsmain table ; and \d relkind=I show its partitions