Обсуждение: Rework SLRU I/O errors handle

Поиск

Список

Период

Сортировка

Rework SLRU I/O errors handle

От

Maxim Orlov

Дата:

26 февраля, 17:26:05

Hi!

Beginning of the discussion is here [0].

Historically, the SLRU module was designed to handle 32-bit
transactions. However, it is now utilised for handling a variety of
object types, like TransactionId, MultixactId, MultiXactOffset,
QueuePosition, and so on. But the IO error reporting system is still
designed to support 32-bit XIDs exclusively.

The proposed patchset allows us to define a "custom" callback to
improve error messages.

The first two commits add a callback and test case. The subsequent ones
improve I/O error messages. The last one adds the XID epoch to the error
message. It's purely optional, but I think it would be useful.

[0] https://www.postgresql.org/message-id/CACG%3Dezbwy1zargXDNPeYXxZwRW3jXu_aD%3DrcG-7dc4fw7Y9Ojw%40mail.gmail.com

Best regards,

Maxim Orlov.

Вложения

Re: Rework SLRU I/O errors handle

От

Heikki Linnakangas

Дата:

06 марта, 17:43:24

On 26/02/2026 16:26, Maxim Orlov wrote:
> Beginning of the discussion is here [0].
> 
> Historically, the SLRU module was designed to handle 32-bit
> transactions. However, it is now utilised for handling a variety of
> object types, like TransactionId, MultixactId, MultiXactOffset,
> QueuePosition, and so on. But the IO error reporting system is still
> designed to support 32-bit XIDs exclusively.
> 
> The proposed patchset allows us to define a "custom" callback to
> improve error messages.
> 
> The first two commits add a callback and test case. The subsequent ones
> improve I/O error messages. The last one adds the XID epoch to the error
> message. It's purely optional, but I think it would be useful.
> 
> [0] https://www.postgresql.org/message-id/ 
> CACG%3Dezbwy1zargXDNPeYXxZwRW3jXu_aD%3DrcG-7dc4fw7Y9Ojw%40mail.gmail.com 
> <https://www.postgresql.org/message-id/ 
> CACG%3Dezbwy1zargXDNPeYXxZwRW3jXu_aD%3DrcG-7dc4fw7Y9Ojw%40mail.gmail.com>

Thanks, looks reasonable.

I'm -1 on the last patch, "Expand xact SLRU IO-error to show epoch" 
though. The epoch isn't used in addressing the SLRU, the patch just 
expands the 32-bit XID into a full 64-bit XID using the current epoch. 
That seems misleading.

- Heikki

Re: Rework SLRU I/O errors handle

От

Álvaro Herrera

Дата:

06 марта, 18:33:23

On 2026-Mar-06, Heikki Linnakangas wrote:

> I'm -1 on the last patch, "Expand xact SLRU IO-error to show epoch" though.
> The epoch isn't used in addressing the SLRU, the patch just expands the
> 32-bit XID into a full 64-bit XID using the current epoch. That seems
> misleading.

I agree on that.

I'm not a fan of the split.  I think it all these patches should be
pushed as a single commit, and avoid introducing xact_errmsg_for_io_error
as an exposed function.  I think that doesn't make a lot of sense.  Each
SLRU should have a correct and appropriate error reporting callback.

The comment added in 0005 is bogus too.  It mentions InvalidTransactionId,
but the problem is not that the value is 0 but rather that we get no
pointer.  Also, in all other callbacks the pointer is asserted to not be
NULL, so why don't we do the same here, and avoid an error message
that's not going to help anyone?  I see however that in the patch we're
passing a NULL to SlruReportIOError(), which means if you get an IO
error with any SLRU other than xact, you're going to get either a crash
on the assertion, or (on non-debug builds) a crash on dereferencing the
NULL pointer.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"If you want to have good ideas, you must have many ideas.  Most of them
will be wrong, and what you have to learn is which ones to throw away."
                                                         (Linus Pauling)

Re: Rework SLRU I/O errors handle

От

Heikki Linnakangas

Дата:

06 марта, 19:22:42

On 06/03/2026 17:33, Álvaro Herrera wrote:
> I'm not a fan of the split.  I think it all these patches should be
> pushed as a single commit, and avoid introducing xact_errmsg_for_io_error
> as an exposed function.  I think that doesn't make a lot of sense.  Each
> SLRU should have a correct and appropriate error reporting callback.

Agreed.

> The comment added in 0005 is bogus too.  It mentions InvalidTransactionId,
> but the problem is not that the value is 0 but rather that we get no
> pointer.  Also, in all other callbacks the pointer is asserted to not be
> NULL, so why don't we do the same here, and avoid an error message
> that's not going to help anyone?  I see however that in the patch we're
> passing a NULL to SlruReportIOError(), which means if you get an IO
> error with any SLRU other than xact, you're going to get either a crash
> on the assertion, or (on non-debug builds) a crash on dereferencing the
> NULL pointer.

Hmm, I thought we could just never pass a NULL pointer, but if a Write 
fails, slru.c has no context where to pull that opaque_data. Perhaps we 
should just not call the callback in that case.

I'm starting to feel that what SlruReportIOError() currently uses as the 
errdetail, could well be the main error message, and the callback could 
provide the errdetail. I.e. swap the errmsg and errdetail we have now. 
That way, we can just leave out the errdetail for Write failures. The 
current errmsg we have for Write failures is pretty bad: "ERROR: could 
not access status of transaction 0". What's currently the errdetail, 
e.g. "Could not read from file \"%s\" at offset %d: %m", is a lot more 
informative.

- Heikki

Re: Rework SLRU I/O errors handle

От

Heikki Linnakangas

Дата:

06 марта, 19:38:26

On 06/03/2026 18:22, Heikki Linnakangas wrote:
> On 06/03/2026 17:33, Álvaro Herrera wrote:
>> I'm not a fan of the split.  I think it all these patches should be
>> pushed as a single commit, and avoid introducing xact_errmsg_for_io_error
>> as an exposed function.  I think that doesn't make a lot of sense.  Each
>> SLRU should have a correct and appropriate error reporting callback.
> 
> Agreed.
> 
>> The comment added in 0005 is bogus too.  It mentions 
>> InvalidTransactionId,
>> but the problem is not that the value is 0 but rather that we get no
>> pointer.  Also, in all other callbacks the pointer is asserted to not be
>> NULL, so why don't we do the same here, and avoid an error message
>> that's not going to help anyone?  I see however that in the patch we're
>> passing a NULL to SlruReportIOError(), which means if you get an IO
>> error with any SLRU other than xact, you're going to get either a crash
>> on the assertion, or (on non-debug builds) a crash on dereferencing the
>> NULL pointer.
> 
> Hmm, I thought we could just never pass a NULL pointer, but if a Write 
> fails, slru.c has no context where to pull that opaque_data. Perhaps we 
> should just not call the callback in that case.
> 
> I'm starting to feel that what SlruReportIOError() currently uses as the 
> errdetail, could well be the main error message, and the callback could 
> provide the errdetail. I.e. swap the errmsg and errdetail we have now. 
> That way, we can just leave out the errdetail for Write failures. The 
> current errmsg we have for Write failures is pretty bad: "ERROR: could 
> not access status of transaction 0". What's currently the errdetail, 
> e.g. "Could not read from file \"%s\" at offset %d: %m", is a lot more 
> informative.

Attached version that does that.

As another data point, we have a lot of error messages like "could not 
open file \"%s\": %m" in the source tree. slru.c was the only place 
where that was in the errdetail() part.

- Heikki

Вложения

v2-0001-Add-a-callback-for-I-O-error-messages-in-SLRUs.patch

Re: Rework SLRU I/O errors handle

От

Heikki Linnakangas

Дата:

06 марта, 19:49:22

On 06/03/2026 18:38, Heikki Linnakangas wrote:
> On 06/03/2026 18:22, Heikki Linnakangas wrote:
>> On 06/03/2026 17:33, Álvaro Herrera wrote:
>>> The comment added in 0005 is bogus too.  It mentions 
>>> InvalidTransactionId,
>>> but the problem is not that the value is 0 but rather that we get no
>>> pointer.  Also, in all other callbacks the pointer is asserted to not be
>>> NULL, so why don't we do the same here, and avoid an error message
>>> that's not going to help anyone?  I see however that in the patch we're
>>> passing a NULL to SlruReportIOError(), which means if you get an IO
>>> error with any SLRU other than xact, you're going to get either a crash
>>> on the assertion, or (on non-debug builds) a crash on dereferencing the
>>> NULL pointer.
>>
>> Hmm, I thought we could just never pass a NULL pointer, but if a Write 
>> fails, slru.c has no context where to pull that opaque_data. Perhaps 
>> we should just not call the callback in that case.
>>
>> I'm starting to feel that what SlruReportIOError() currently uses as 
>> the errdetail, could well be the main error message, and the callback 
>> could provide the errdetail. I.e. swap the errmsg and errdetail we 
>> have now. That way, we can just leave out the errdetail for Write 
>> failures. The current errmsg we have for Write failures is pretty bad: 
>> "ERROR: could not access status of transaction 0". What's currently 
>> the errdetail, e.g. "Could not read from file \"%s\" at offset %d: 
>> %m", is a lot more informative.
> 
> Attached version that does that.

That version was a little rushed, with wrong message styles etc. 
leftovers. Sorry about that, here's yet another, more cleaned-up version.
- Heikki

Вложения

v4-0001-Add-a-callback-for-I-O-error-messages-in-SLRUs.patch

Re: Rework SLRU I/O errors handle

От

Maxim Orlov

Дата:

06 марта, 20:08:15

I don't know what's happening with my mail, but I haven't

received the previous letters.

Anyway, v4 looks good to me.

Perhaps the extra double line following clog_errdetail_for_io_error

is unnecessary? But as always, to your taste.

Best regards,

Maxim Orlov.

Re: Rework SLRU I/O errors handle

От

Heikki Linnakangas

Дата:

11 марта, 13:51:53

On 06/03/2026 19:08, Maxim Orlov wrote:
> I don't know what's happening with my mail, but I haven't
> received the previous letters.
> 
> Anyway, v4 looks good to me.
> Perhaps the extra double line following clog_errdetail_for_io_error
> is unnecessary? But as always, to your taste.

Thanks. I did one more iteration on this: I realized that the error we 
now printed for errors on pg_multixact/members always printed the 
failing offset, whereas before this patch we usually printed the failing 
*multixid* that the member is part of. Printing the multixid might 
actually be more useful; the offset can more easily be deduced from the 
segment filename and physical offset that is printed anyway, but it's 
harder to know which multixid it belongs to. This printing the 
originating multixid seems useful. If things go badly wrong and you need 
to do manual debugging of a corrupted database, the multixid can more 
easily be compared with the xmax in the heap and with pg_waldump output, 
for example.

We can print both, per attached, which is even better. This is perhaps 
overkill, but then again, if you hit an error like this, you really 
appreciate any extra information you can get.

- Heikki

Вложения

Re: Rework SLRU I/O errors handle

От

Heikki Linnakangas

Дата:

13 марта, 17:32:10

On 11/03/2026 12:51, Heikki Linnakangas wrote:
> On 06/03/2026 19:08, Maxim Orlov wrote:
>> I don't know what's happening with my mail, but I haven't
>> received the previous letters.
>>
>> Anyway, v4 looks good to me.
>> Perhaps the extra double line following clog_errdetail_for_io_error
>> is unnecessary? But as always, to your taste.
> 
> Thanks. I did one more iteration on this: I realized that the error we 
> now printed for errors on pg_multixact/members always printed the 
> failing offset, whereas before this patch we usually printed the failing 
> *multixid* that the member is part of. Printing the multixid might 
> actually be more useful; the offset can more easily be deduced from the 
> segment filename and physical offset that is printed anyway, but it's 
> harder to know which multixid it belongs to. This printing the 
> originating multixid seems useful. If things go badly wrong and you need 
> to do manual debugging of a corrupted database, the multixid can more 
> easily be compared with the xmax in the heap and with pg_waldump output, 
> for example.
> 
> We can print both, per attached, which is even better. This is perhaps 
> overkill, but then again, if you hit an error like this, you really 
> appreciate any extra information you can get.

I hear no objections, so committed that. Thanks!

- Heikki

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Rework SLRU I/O errors handle

Вложения

Вложения

Вложения

Вложения