Re: Non-reproducible AIO failure

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: Non-reproducible AIO failure
Дата
Msg-id 7235a473-e949-404e-a85c-ccefd81c2efa@garret.ru
обсуждение исходный текст
Ответ на Re: Non-reproducible AIO failure  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 06/06/2025 2:31 am, Tom Lane wrote:
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
>> I have a very wild guess that's probably wrong in a weird way, but
>> here goes anyway:
>> Did anyone test if interleaving the enum-typed bitfield fields of
>> PgAioHandle with the uint8 fields might solve the issue?
> Ugh.  I think you probably nailed it.
>
> IMO all those struct fields better be declared uint8.
>
>             regards, tom lane

I also think that it can be in compiler. Bitfields with different enum 
type looks really exotic, so no wonder that optimizer can do something 
strange here.
I failed to reproduce the problem with old version of clang (15.0). Also 
as far as I understand nobody was able to reproduce the problem with 
disabled optimizations (-O0).
It definitely doesn't mean that there is bug in optimizer - just timing 
can be changed.

Still it is not quite clear to me how `PGAIO_OP_READV` is managed to be 
written.
There is just one place in the code when it is assigned:

```
pgaio_io_start_readv(PgAioHandle *ioh,
                      int fd, int iovcnt, uint64 offset)
{
     ...

     pgaio_io_stage(ioh, PGAIO_OP_READV);
}

```

and `pgaio_io_stage` should update both `state` and `op`:

```
     ioh->op = op;
     ioh->result = 0;

     pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
```

But as we see from the trace state is still PGAIO_HS_HANDED_OUT, so it 
was not updated.

If there is some bug in optimizer which incorrectly construct mask for 
bitfield assignment, it is still not clean where it managed to get this 
PGAIO_OP_READV.
And we can be sure that it is really PGAIO_OP_READV and just arbitrary 
garbage, because Alexander has replaced its value with 0xaa and we see 
in logs that it is rally stored.

If there is race condition in `pgaio_io_update_state` (which enforces 
memory barrier before updating state) then for example inserting some 
sleep between assignment operation and status should increase 
probability of error. But it doesn't happen. Also as far as I 
understand, op is updated and read by the same backend. So it should not 
be some synchronization issue.

So most likely it is bug in optimizer which generates incorrect code. 
Can Alexander or somebody else who was able to reproduce the problem 
share assembler code of `pgaio_io_reclaim`  function?
I am not sure that the bug is in this function - but it is prime 
suspect. Only `pgaio_io_start_readv` can set PGAIO_OP_READV, but we are 
almost sure that it was no called.
So looks like that `op` was not cleared despite to what we see in logs. 
But if there was incorrect code in `pgaio_io_reclaim`, then it should 
always work incorrectly - doesn't clear "op" but in most cases it works...








В списке pgsql-hackers по дате отправления: