Re: Non-reproducible AIO failure

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: Non-reproducible AIO failure
Дата
Msg-id CAEze2WhTcBjgzOPzD0xYdw-F-HjYzGNKEyQB9=FP9XneFZWESg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Non-reproducible AIO failure  (Alexander Lakhin <exclusion@gmail.com>)
Ответы Re: Non-reproducible AIO failure
Список pgsql-hackers
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Thomas and Andres,
>
> 04.06.2025 23:32, Thomas Munro wrote:
> > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
> >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> >>> 2, ioh->generation: 21694
> >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
> >> "normal" reason for that. We know that the IO wasn't actually started as
> >> otherwise pgaio_io_start_readv() would have logged that fact.
> > A cheap/easy thing to try, maybe: give that enumeration more
> > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
> > a wild 1 here?

I have a very wild guess that's probably wrong in a weird way, but
here goes anyway:

Did anyone test if interleaving the enum-typed bitfield fields of
PgAioHandle with the uint8 fields might solve the issue? I.e. {state,
flags, target, num_callbacks, op} instead of {state, target, op,
flags, num_callbacks}?

I've looked at some decompiled code generated for
pgaio_io_before_start and pgaio_io_start_readv. pgaio_io_start_readv
looks rather normal, but in pgaio_io_before_start it's clear that
pgaio_io_before_start does some weird things as a response to the
bitfields of PgAioHandle:

(manually annotated)

_pgaio_io_before_start:
00000001002eaba8    stp    x20, x19, [sp, #-0x20]!     /* save registers */
00000001002eabac    stp    x29, x30, [sp, #0x10]       /* save registers */
00000001002eabb0    add    x29, sp, #0x10              /* x29 := stack+10 */
00000001002eabb4    ldrb    w8, [x0]                /* w8 := ioh->state */
00000001002eabb8    cmp    w8, #0x1                    /* w8 ??
PGAIO_HS_HANDED_OUT */
00000001002eabbc    b.ne    0x1002eac38

w8 here will contain the value 0x1.

00000001002eabc0    mov    x19, x0                     /* x19 := ioh */
00000001002eabc4    adrp    x8, 1461 ; 0x10089f000  /* x8 := 0x10089f000 */
00000001002eabc8    add    x8, x8, #0xd40              /* x8 += 0xd40
(0x10089fd40) */
00000001002eabcc    ldr    x8, [x8]                    /* x8 :=
pgaio_my_backend */
00000001002eabd0    ldr    x8, [x8, #0x20]             /* x8 :=
pgaio_my_backend->handed_out_io */
00000001002eabd4    cmp    x8, x0                      /* x8
(pgaio_my_backend->handed_out_io) ?? x19 (ioh) */
00000001002eabd8    b.ne    0x1002eac50
00000001002eabdc    mov    x0, x19                     /* x0 := ioh */
00000001002eabe0    bl    _pgaio_io_has_target
00000001002eabe4    tbz    w0, #0x0, 0x1002eac68       /*
_pgaio_io_has_target(ioh)? */
00000001002eabe8    ldrb    w8, [x19, #0x2]         /* w8 := ioh->op */

w8 now contains ioh->op, which is presumed to be 0x0.

00000001002eabec    ldrh    w9, [x19]               /* w9 :=
ioh->[target | state] */
00000001002eabf0    orr    w8, w9, w8, lsl #16         /* w8 := w9 |
(w8 << 16) // [ op | [target | state] ]*/

But here more operations are applied, pushing in the preceding
bitfields of PgAioHandle, mixing in various values into w8.

00000001002eabf4    cmp    w8, #0x10, lsl #12          /* w8 ??
0x1'00'00 // op'state'target ?? 0x1'00'00 */

And finally the full set of bitfields is compared against a target
value, rather than just ioh->op. This means we do 2 memory accesses
for this field.

00000001002eabf8    b.hs    0x1002eac80             /* >= ?? ERROR : continue */

And finally we branch into the setup for _ExceptionalCondition.

This shows that b->op is not exactly a stand-alone field, and might be
influenced by the other sibling bitfield fields. If we changed the
order of the fields to not have continuous bitfields, that might be
able to convince the compiler that there are no data dependencies
between state/target/op, and remove the two-phase load operation.

(checks more decompiled code)

Oh, the bitfields are not just read with multiple accesses, but also
*written* using multiple accesses. See pgaio_io_set_target:

_pgaio_io_set_target:
00000001002eafdc    stp    x29, x30, [sp, #-0x10]!
00000001002eafe0    mov    x29, sp
00000001002eafe4    ldrb    w8, [x0, #0x2]
00000001002eafe8    ldrh    w9, [x0]
00000001002eafec    orr    w8, w9, w8, lsl #16
00000001002eaff0    and    w9, w8, #0xff
00000001002eaff4    cmp    w9, #0x1
00000001002eaff8    b.ne    0x1002eb01c
00000001002eaffc    tst    w8, #0xff00
00000001002eb000    b.ne    0x1002eb034
00000001002eb004    lsr    w9, w8, #16
00000001002eb008    bfi    w8, w1, #8, #24
00000001002eb00c    strb    w9, [x0, #0x2] /* ioh->op = w9 */
00000001002eb010    strh    w8, [x0]           /* ioh->[state, target] = w8 */

... and pgaio_io_update_state:

_pgaio_io_update_state:
00000001002e7f08    sub    sp, sp, #0x70
00000001002e7f0c    stp    x24, x23, [sp, #0x30]
00000001002e7f10    stp    x22, x21, [sp, #0x40]
00000001002e7f14    stp    x20, x19, [sp, #0x50]
00000001002e7f18    stp    x29, x30, [sp, #0x60]
00000001002e7f1c    add    x29, sp, #0x60
00000001002e7f20    mov    x19, x1
00000001002e7f24    mov    x20, x0
00000001002e7f28    mov    w0, #0xa
00000001002e7f2c    mov    x1, #0x0
00000001002e7f30    bl    _errstart
00000001002e7f34    cbz    w0, 0x1002e8010
[...] /* skip pgaio_debug_io log handling */
00000001002e8010    dmb    ish
00000001002e8014    ldurh    w8, [x20, #0x1]
00000001002e8018    bfi    w19, w8, #8, #16
00000001002e801c    lsr    w8, w8, #8
00000001002e8020    strb    w8, [x20, #0x2] /* store ioh->op */
00000001002e8024    strh    w19, [x20]        /* store ioh->{state, target} */

I'm sending this now without looking much deeper into this due to
$err_date_changed and other time-related constraints, but might this
be a candidate for further debugging research?


Kind regards,

Matthias van de Meent
Databricks Inc.



В списке pgsql-hackers по дате отправления: