Обсуждение: backend stuck in DataFileExtend

Поиск
Список
Период
Сортировка

backend stuck in DataFileExtend

От
Justin Pryzby
Дата:
In March, I noticed that a backend got stuck overnight doing:

backend_start    | 2024-03-27 22:34:12.774195-07
xact_start       | 2024-03-27 22:34:39.741933-07
query_start      | 2024-03-27 22:34:41.997276-07
state_change     | 2024-03-27 22:34:41.997307-07
wait_event_type  | IO
wait_event       | DataFileExtend
state            | active
backend_xid      | 330896991
backend_xmin     | 330896991
query_id         | -3255287420265634540
query            | PREPARE mml_1 AS INSERT INTO child.un...

The backend was spinning at 100% CPU:

[pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
   PID WCHAN  %CPU S TTY          TIME COMMAND
  7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT

This was postgres 16 STABLE compiled at 14e991db8.

It's possible that this is a rare issue that we haven't hit before.
It's also possible this this is a recent regression.  We previously
compiled at b2c9936a7 without hitting any issue (possibly by chance).

I could neither strace the process nor attach a debugger.  They got
stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.

$ awk '{print $14, $15}' /proc/7881/stat # usr / sys
229 3088448

When I tried to shut down postgres (hoping to finally be able to attach
a debugger), instead it got stuck:

$ ps -fu postgres
UID         PID   PPID  C STIME TTY          TIME CMD
postgres   7881 119674 99 mar27 ?        08:38:06 postgres: telsasoft ts [local] INSERT
postgres 119674      1  0 mar25 ?        00:07:13 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
postgres 119676 119674  0 mar25 ?        00:00:11 postgres: logger 
postgres 119679 119674  0 mar25 ?        00:11:56 postgres: checkpointer 

This first occurred on Mar 27, but I see today that it's now recurring
at a different customer:

backend_start   | 2024-05-05 22:19:17.009477-06
xact_start      | 2024-05-05 22:20:18.129305-06
query_start     | 2024-05-05 22:20:19.409555-06
state_change    | 2024-05-05 22:20:19.409556-06
pid             | 18468
wait_event_type | IO
wait_event      | DataFileExtend
state           | active
backend_xid     | 4236249136
backend_xmin    | 4236221661
query_id        | 2601062835886299431
left            | PREPARE mml_1 AS INSERT INTO chil

This time it's running v16.2 (REL_16_STABLE compiled at b78fa8547),
under centos7 / 3.10.0-1160.66.1.el7.x86_64.

The symptoms are the same: backend stuck using 100% CPU in user mode:
[pryzbyj@telsasoft-centos7 ~]$ awk '{print $14, $15}' /proc/18468/stat # usr / sys
364 2541168

This seems to have affected other backends, which are now waiting on
RegisterSyncRequest, frozenid, and CheckpointerComm.

In both cases, the backend got stuck after 10pm, which is when a backup
job kicks off, followed by other DB maintenance.  Our backup job uses
pg_export_snapshot() + pg_dump --snapshot.  In today's case, the pg_dump
would've finished and snapshot closed at 2023-05-05 22:15.  The backup
job did some more pg_dumps involving historic tables without snapshots
and finished at 01:11:40, at which point a reindex job started, but it
looks like the DB was already stuck for the purpose of reindex, and so
the script ended after a handful of commands were "[canceled] due to
statement timeout".

Full disclosure: the VM that hit this issue today has had storage-level
errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
days ago.

Maybe more importantly, this VM also seems to suffer from some memory
leak, and the leaky process was Killed shortly after the stuck query
started.  This makes me suspect a race condition which was triggered
while swapping:

May  5 22:24:05 localhost kernel: Out of memory: Kill process 17157 (python3.8) score 134 or sacrifice child

We don't have as good logs from March, but I'm not aware of killed
processes on the VM where we hit this in March, but it's true that the
I/O there is not fast.

Also, I fibbed when I said these were compiled at 16 STABLE - I'd
backpatched a small number of patches from master:

a97bbe1f1df Reduce branches in heapgetpage()'s per-tuple loop
98f320eb2ef Increase default vacuum_buffer_usage_limit to 2MB.
44086b09753 Preliminary refactor of heap scanning functions
959b38d770b Invent --transaction-size option for pg_restore.
a45c78e3284 Rearrange pg_dump's handling of large objects for better efficiency.
9d1a5354f58 Fix costing bug in MergeAppend
a5cf808be55 Read include/exclude commands for dump/restore from file
8c16ad3b432 Allow using syncfs() in frontend utilities.
cccc6cdeb32 Add support for syncfs() in frontend support functions.
3ed19567198 Make enum for sync methods available to frontend code.
f39b265808b Move PG_TEMP_FILE* macros to file_utils.h.
a14354cac0e Add GUC parameter "huge_pages_status"

I will need to restart services this morning, but if someone wants to
suggest diagnostic measures, I will see whether the command gets stuck
or not.

-- 
Justin



Re: backend stuck in DataFileExtend

От
Andres Freund
Дата:
Hi,

On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> In March, I noticed that a backend got stuck overnight doing:
> 
> backend_start    | 2024-03-27 22:34:12.774195-07
> xact_start       | 2024-03-27 22:34:39.741933-07
> query_start      | 2024-03-27 22:34:41.997276-07
> state_change     | 2024-03-27 22:34:41.997307-07
> wait_event_type  | IO
> wait_event       | DataFileExtend
> state            | active
> backend_xid      | 330896991
> backend_xmin     | 330896991
> query_id         | -3255287420265634540
> query            | PREPARE mml_1 AS INSERT INTO child.un...
> 
> The backend was spinning at 100% CPU:
> 
> [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
>    PID WCHAN  %CPU S TTY          TIME COMMAND
>   7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT
> 
> This was postgres 16 STABLE compiled at 14e991db8.
> 
> It's possible that this is a rare issue that we haven't hit before.
> It's also possible this this is a recent regression.  We previously
> compiled at b2c9936a7 without hitting any issue (possibly by chance).
> 
> I could neither strace the process nor attach a debugger.  They got
> stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
> VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.

> $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> 229 3088448
> 
> When I tried to shut down postgres (hoping to finally be able to attach
> a debugger), instead it got stuck:

That strongly indicates either a kernel bug or storage having an issue. It
can't be postgres' fault if an IO never completes.

What do /proc/$pid/stack say?


> In both cases, the backend got stuck after 10pm, which is when a backup
> job kicks off, followed by other DB maintenance.  Our backup job uses
> pg_export_snapshot() + pg_dump --snapshot.  In today's case, the pg_dump
> would've finished and snapshot closed at 2023-05-05 22:15.  The backup
> job did some more pg_dumps involving historic tables without snapshots
> and finished at 01:11:40, at which point a reindex job started, but it
> looks like the DB was already stuck for the purpose of reindex, and so
> the script ended after a handful of commands were "[canceled] due to
> statement timeout".

Is it possible that you're "just" waiting for very slow IO? Is there a lot of
dirty memory? Particularly on these old kernels that can lead to very extreme
delays.

grep -Ei 'dirty|writeback' /proc/meminfo


> [...]
> Full disclosure: the VM that hit this issue today has had storage-level
> errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
> days ago.

So indeed, my suspicion from above is confirmed.

Greetings,

Andres Freund



Re: backend stuck in DataFileExtend

От
Justin Pryzby
Дата:
On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> > In March, I noticed that a backend got stuck overnight doing:
> > 
> > backend_start    | 2024-03-27 22:34:12.774195-07
> > xact_start       | 2024-03-27 22:34:39.741933-07
> > query_start      | 2024-03-27 22:34:41.997276-07
> > state_change     | 2024-03-27 22:34:41.997307-07
> > wait_event_type  | IO
> > wait_event       | DataFileExtend
> > state            | active
> > backend_xid      | 330896991
> > backend_xmin     | 330896991
> > query_id         | -3255287420265634540
> > query            | PREPARE mml_1 AS INSERT INTO child.un...
> > 
> > The backend was spinning at 100% CPU:
> > 
> > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
> >    PID WCHAN  %CPU S TTY          TIME COMMAND
> >   7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT
> > 
> > This was postgres 16 STABLE compiled at 14e991db8.
> > 
> > It's possible that this is a rare issue that we haven't hit before.
> > It's also possible this this is a recent regression.  We previously
> > compiled at b2c9936a7 without hitting any issue (possibly by chance).
> > 
> > I could neither strace the process nor attach a debugger.  They got
> > stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
> > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.
> 
> > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> > 229 3088448
> > 
> > When I tried to shut down postgres (hoping to finally be able to attach
> > a debugger), instead it got stuck:
> 
> That strongly indicates either a kernel bug or storage having an issue. It
> can't be postgres' fault if an IO never completes.

Is that for sure even though wchan=? (which I take to mean "not in a system
call"), and the process is stuck in user mode ?

> What do /proc/$pid/stack say?

[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat
18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10
184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0
173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0
 

> > In both cases, the backend got stuck after 10pm, which is when a backup
> > job kicks off, followed by other DB maintenance.  Our backup job uses
> > pg_export_snapshot() + pg_dump --snapshot.  In today's case, the pg_dump
> > would've finished and snapshot closed at 2023-05-05 22:15.  The backup
> > job did some more pg_dumps involving historic tables without snapshots
> > and finished at 01:11:40, at which point a reindex job started, but it
> > looks like the DB was already stuck for the purpose of reindex, and so
> > the script ended after a handful of commands were "[canceled] due to
> > statement timeout".
> 
> Is it possible that you're "just" waiting for very slow IO? Is there a lot of
> dirty memory? Particularly on these old kernels that can lead to very extreme
> delays.
> 
> grep -Ei 'dirty|writeback' /proc/meminfo

[pryzbyj@telsasoft-centos7 ~]$ grep -Ei 'dirty|writeback' /proc/meminfo
Dirty:                28 kB
Writeback:             0 kB
WritebackTmp:          0 kB

> > Full disclosure: the VM that hit this issue today has had storage-level
> > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
> > days ago.
> 
> So indeed, my suspicion from above is confirmed.

I'd be fine with that conclusion (as in the earlier thread), except this
has now happened on 2 different VMs, and the first one has no I/O
issues.  If this were another symptom of a storage failure, and hadn't
previously happened on another VM, I wouldn't be re-reporting it.

-- 
Justin



Re: backend stuck in DataFileExtend

От
Andres Freund
Дата:
Hi,

On 2024-05-06 12:37:26 -0500, Justin Pryzby wrote:
> On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> > > In March, I noticed that a backend got stuck overnight doing:
> > >
> > > backend_start    | 2024-03-27 22:34:12.774195-07
> > > xact_start       | 2024-03-27 22:34:39.741933-07
> > > query_start      | 2024-03-27 22:34:41.997276-07
> > > state_change     | 2024-03-27 22:34:41.997307-07
> > > wait_event_type  | IO
> > > wait_event       | DataFileExtend
> > > state            | active
> > > backend_xid      | 330896991
> > > backend_xmin     | 330896991
> > > query_id         | -3255287420265634540
> > > query            | PREPARE mml_1 AS INSERT INTO child.un...
> > >
> > > The backend was spinning at 100% CPU:
> > >
> > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
> > >    PID WCHAN  %CPU S TTY          TIME COMMAND
> > >   7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT
> > >
> > > This was postgres 16 STABLE compiled at 14e991db8.
> > >
> > > It's possible that this is a rare issue that we haven't hit before.
> > > It's also possible this this is a recent regression.  We previously
> > > compiled at b2c9936a7 without hitting any issue (possibly by chance).
> > >
> > > I could neither strace the process nor attach a debugger.  They got
> > > stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
> > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.
> >
> > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> > > 229 3088448
> > >
> > > When I tried to shut down postgres (hoping to finally be able to attach
> > > a debugger), instead it got stuck:
> >
> > That strongly indicates either a kernel bug or storage having an issue. It
> > can't be postgres' fault if an IO never completes.
>
> Is that for sure even though wchan=? (which I take to mean "not in a system
> call"), and the process is stuck in user mode ?

Postgres doesn't do anything to prevent a debugger from working, so this is
just indicative that the kernel is stuck somewhere that it didn't set up
information about being blocked - because it's busy doing something.


> > What do /proc/$pid/stack say?
>
> [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat
> 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10
184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0
173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0
 

stack, not stat...


> > > Full disclosure: the VM that hit this issue today has had storage-level
> > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
> > > days ago.
> >
> > So indeed, my suspicion from above is confirmed.
>
> I'd be fine with that conclusion (as in the earlier thread), except this
> has now happened on 2 different VMs, and the first one has no I/O
> issues.  If this were another symptom of a storage failure, and hadn't
> previously happened on another VM, I wouldn't be re-reporting it.

Is it the same VM hosting environment? And the same old distro?

Greetings,

Andres Freund



Re: backend stuck in DataFileExtend

От
Justin Pryzby
Дата:
On Mon, May 06, 2024 at 10:51:08AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-05-06 12:37:26 -0500, Justin Pryzby wrote:
> > On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote:
> > > Hi,
> > >
> > > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> > > > In March, I noticed that a backend got stuck overnight doing:
> > > >
> > > > backend_start    | 2024-03-27 22:34:12.774195-07
> > > > xact_start       | 2024-03-27 22:34:39.741933-07
> > > > query_start      | 2024-03-27 22:34:41.997276-07
> > > > state_change     | 2024-03-27 22:34:41.997307-07
> > > > wait_event_type  | IO
> > > > wait_event       | DataFileExtend
> > > > state            | active
> > > > backend_xid      | 330896991
> > > > backend_xmin     | 330896991
> > > > query_id         | -3255287420265634540
> > > > query            | PREPARE mml_1 AS INSERT INTO child.un...
> > > >
> > > > The backend was spinning at 100% CPU:
> > > >
> > > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
> > > >    PID WCHAN  %CPU S TTY          TIME COMMAND
> > > >   7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT
> > > >
> > > > This was postgres 16 STABLE compiled at 14e991db8.
> > > >
> > > > It's possible that this is a rare issue that we haven't hit before.
> > > > It's also possible this this is a recent regression.  We previously
> > > > compiled at b2c9936a7 without hitting any issue (possibly by chance).
> > > >
> > > > I could neither strace the process nor attach a debugger.  They got
> > > > stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
> > > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.
> > >
> > > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> > > > 229 3088448
> > > >
> > > > When I tried to shut down postgres (hoping to finally be able to attach
> > > > a debugger), instead it got stuck:
> > >
> > > That strongly indicates either a kernel bug or storage having an issue. It
> > > can't be postgres' fault if an IO never completes.
> >
> > Is that for sure even though wchan=? (which I take to mean "not in a system
> > call"), and the process is stuck in user mode ?
> 
> Postgres doesn't do anything to prevent a debugger from working, so this is
> just indicative that the kernel is stuck somewhere that it didn't set up
> information about being blocked - because it's busy doing something.
> 
> 
> > > What do /proc/$pid/stack say?
> >
> > [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat
> > 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10
184467440737095516154194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0
173 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0
 
> 
> stack, not stat...

Ah, that is illuminating - thanks.

[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stack 
[<ffffffffaa2d7856>] __cond_resched+0x26/0x30
[<ffffffffc10af35e>] dbuf_rele+0x1e/0x40 [zfs]
[<ffffffffc10bb730>] dmu_buf_rele_array.part.6+0x40/0x70 [zfs]
[<ffffffffc10bd96a>] dmu_write_uio_dnode+0x11a/0x180 [zfs]
[<ffffffffc10bda24>] dmu_write_uio_dbuf+0x54/0x70 [zfs]
[<ffffffffc11abd1b>] zfs_write+0xb9b/0xfb0 [zfs]
[<ffffffffc11ed202>] zpl_aio_write+0x152/0x1a0 [zfs]
[<ffffffffaa44dadb>] do_sync_readv_writev+0x7b/0xd0
[<ffffffffaa44f62e>] do_readv_writev+0xce/0x260
[<ffffffffaa44f855>] vfs_writev+0x35/0x60
[<ffffffffaa44fc12>] SyS_pwritev+0xc2/0xf0
[<ffffffffaa999f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.

It's surely possible that there's an issue that affects older kernels
but not more recent ones.

> > > > Full disclosure: the VM that hit this issue today has had storage-level
> > > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
> > > > days ago.
> > >
> > > So indeed, my suspicion from above is confirmed.
> >
> > I'd be fine with that conclusion (as in the earlier thread), except this
> > has now happened on 2 different VMs, and the first one has no I/O
> > issues.  If this were another symptom of a storage failure, and hadn't
> > previously happened on another VM, I wouldn't be re-reporting it.
> 
> Is it the same VM hosting environment? And the same old distro?

Yes, they're running centos7 with the indicated kernels.

dmidecode shows they're both running:

        Product Name: VMware Virtual Platform

But they're different customers, so I'd be somewhat surprised if they're
running same versions of the hypervisor.

-- 
Justin



Re: backend stuck in DataFileExtend

От
Thomas Munro
Дата:
On Tue, May 7, 2024 at 6:21 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.
...
> Yes, they're running centos7 with the indicated kernels.

So far we've got:

* spurious EIO when opening a file (your previous report)
* hanging with CPU spinning (?) inside pwritev()
* old kernel, bleeding edge ZFS

From an (uninformed) peek at the ZFS code, if it really is spinning
there is seems like a pretty low level problem: it's finish the write,
and now is just trying to release (something like our unpin) and
unlock the buffers, which involves various code paths that might touch
various mutexes and spinlocks, and to get stuck like that I guess it's
either corrupted itself or it is deadlocking against something else,
but what?  Do you see any other processes (including kernel threads)
with any stuck stacks that might be a deadlock partner?

While looking around for reported issues I found your abandoned report
against an older ZFS version from a few years ago, same old Linux
version:

https://github.com/openzfs/zfs/issues/11641

I don't know enough to say anything useful about that but it certainly
smells similar...

I see you've been busy reporting lots of issues, which seems to
involve big data, big "recordsize" (= ZFS block sizes), compression
and PostgreSQL:

https://github.com/openzfs/zfs/issues?q=is%3Aissue+author%3Ajustinpryzby



Re: backend stuck in DataFileExtend

От
Justin Pryzby
Дата:
On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote:
> On Tue, May 7, 2024 at 6:21 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.
> ...
> > Yes, they're running centos7 with the indicated kernels.
> 
> So far we've got:
> 
> * spurious EIO when opening a file (your previous report)
> * hanging with CPU spinning (?) inside pwritev()
> * old kernel, bleeding edge ZFS
> 
> From an (uninformed) peek at the ZFS code, if it really is spinning
> there is seems like a pretty low level problem: it's finish the write,
> and now is just trying to release (something like our unpin) and
> unlock the buffers, which involves various code paths that might touch
> various mutexes and spinlocks, and to get stuck like that I guess it's
> either corrupted itself or it is deadlocking against something else,
> but what?  Do you see any other processes (including kernel threads)
> with any stuck stacks that might be a deadlock partner?

Sorry, but even after forgetting several times, I finally remembered to
go back to issue, and already rebooted the VM as needed to kill the
stuck process.

But .. it seems to have recurred again this AM.  Note that yesterday,
I'd taken the opportunity to upgrade to zfs-2.2.4.

These two procs are the oldest active postgres procs, and also (now)
adjacent in ps -ef --sort start_time.

-[ RECORD 1 ]----+----------------------------------------------------------------
backend_start    | 2024-05-07 09:45:06.228528-06
application_name | 
xact_start       | 2024-05-07 09:55:38.409549-06
query_start      | 2024-05-07 09:55:38.409549-06
state_change     | 2024-05-07 09:55:38.409549-06
pid              | 27449
backend_type     | autovacuum worker
wait_event_type  | BufferPin
wait_event       | BufferPin
state            | active
backend_xid      | 
backend_xmin     | 4293757489
query_id         | 
left             | autovacuum: VACUUM ANALYZE child.
-[ RECORD 2 ]----+----------------------------------------------------------------
backend_start    | 2024-05-07 09:49:24.686314-06
application_name | MasterMetricsLoader -n -m Xml
xact_start       | 2024-05-07 09:50:30.387156-06
query_start      | 2024-05-07 09:50:32.744435-06
state_change     | 2024-05-07 09:50:32.744436-06
pid              | 5051
backend_type     | client backend
wait_event_type  | IO
wait_event       | DataFileExtend
state            | active
backend_xid      | 4293757489
backend_xmin     | 4293757429
query_id         | 2230046159014513529
left             | PREPARE mml_0 AS INSERT INTO chil

The earlier proc is doing:
strace: Process 27449 attached
epoll_wait(11, ^Cstrace: Process 27449 detached
 <detached ...>

The later process is stuck, with:
[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/5051/stack 
[<ffffffffffffffff>] 0xffffffffffffffff

For good measure:
[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27433/stack 
[<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl]
[<ffffffff9eec5f91>] kthread+0xd1/0xe0
[<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27434/stack 
[<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl]
[<ffffffff9eec5f91>] kthread+0xd1/0xe0
[<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff

[pryzbyj@telsasoft-centos7 ~]$ ps -O wchan================ 5051 27449
  PID =============== S TTY          TIME COMMAND
 5051 ?               R ?        02:14:27 postgres: telsasoft ts ::1(53708) INSERT
27449 ep_poll         S ?        00:05:16 postgres: autovacuum worker ts

The interesting procs might be:

ps -eO wchan===============,lstart --sort start_time
...
15632 worker_thread  Mon May  6 23:51:34 2024 S ?        00:00:00 [kworker/2:2H]
27433 taskq_thread   Tue May  7 09:35:59 2024 S ?        00:00:56 [z_wr_iss]
27434 taskq_thread   Tue May  7 09:35:59 2024 S ?        00:00:57 [z_wr_iss]
27449 ep_poll        Tue May  7 09:45:05 2024 S ?        00:05:16 postgres: autovacuum worker ts
 5051 ?              Tue May  7 09:49:23 2024 R ?        02:23:04 postgres: telsasoft ts ::1(53708) INSERT
 7861 ep_poll        Tue May  7 09:51:25 2024 S ?        00:03:04 /usr/local/bin/python3.8 -u
/home/telsasoft/server/alarms/core/pr...
 7912 ep_poll        Tue May  7 09:51:27 2024 S ?        00:00:00 postgres: telsasoft ts ::1(53794) idle
24518 futex_wait_que Tue May  7 10:42:56 2024 S ?        00:00:55 java -jar
/home/telsasoft/server/alarms/alcatel_lucent/jms/jms2rm...
...

> While looking around for reported issues I found your abandoned report
> against an older ZFS version from a few years ago, same old Linux
> version:
> 
> https://github.com/openzfs/zfs/issues/11641
> 
> I don't know enough to say anything useful about that but it certainly
> smells similar...

Wow - I'd completely forgotten about that problem report.
The symptoms are the same, even with a zfs version 3+ years newer.
I wish the ZFS people would do more with their problem reports.

BTW, we'll be upgrading this VM to a newer kernel, if not a newer OS
(for some reason, these projects take a very long time).  With any luck,
it'll either recur, or not.

I'm not sure if any of that is useful, or interesting.

-- 
Justin



Re: backend stuck in DataFileExtend

От
Thomas Munro
Дата:
On Wed, May 8, 2024 at 6:54 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote:
> > https://github.com/openzfs/zfs/issues/11641
> >
> > I don't know enough to say anything useful about that but it certainly
> > smells similar...
>
> Wow - I'd completely forgotten about that problem report.
> The symptoms are the same, even with a zfs version 3+ years newer.
> I wish the ZFS people would do more with their problem reports.

If I had to guess, my first idea would be that your 1MB or ginormous
16MB recordsize (a relatively new option) combined with PostgreSQL's
8KB block-at-a-time random order I/O patterns are tickling strange
corners and finding a bug that no one has seen before.  I would
imagine that almost everyone in the galaxy who uses very large records
does so with 'settled' data that gets streamed out once sequentially
(for example I think some of the OpenZFS maintainers are at Lawrence
Livermore National Lab where I guess they might pump around petabytes
of data produced by particle physics research or whatever it might be,
probably why they they are also adding direct I/O to avoid caches
completely...).  But for us, if we have lots of backends reading,
writing and extending random 8KB fragments of a 16MB page concurrently
(2048 pages per record!), maybe we hit some broken edge...  I'd be
sure to include that sort of detail in any future reports.

Normally I suppress urges to blame problems on kernels, file systems
etc and in the past accusations that ZFS was buggy turned out to be
bugs in PostgreSQL IIRC, but user space sure seems to be off the hook
for this one...