Re: Random pg_upgrade test failure on drongo

Поиск

Список

Период

Сортировка

От	Jim Nasby
Тема	Re: Random pg_upgrade test failure on drongo
Дата	8 января 19:06:40
Msg-id	9897f89d-3d77-40fe-b05f-ac7b492e8160@gmail.com обсуждение исходный текст
Ответ на	Re: Random pg_upgrade test failure on drongo (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: Random pg_upgrade test failure on drongo
Список	pgsql-hackers

Дерево обсуждения

On 1/4/24 10:19 PM, Amit Kapila wrote:
> On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>>
>> 03.01.2024 14:42, Amit Kapila wrote:
>>>
>>
>>>> And the internal process is ... background writer (BgBufferSync()).
>>>>
>>>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
>>>> got 20 x 10 tests passing.
>>>>
>>>> Thus, it we want just to get rid of the test failure, maybe it's enough to
>>>> add this to the test's config...
>>>>
>>> What about checkpoints? Can't it do the same while writing the buffers?
>>
>> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get
>> the desired effect, but I think it's not impossible in principle.
>> More details below.
>> What happens during the pg_upgrade execution is essentially:
>> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...;
>> -- this command flushes file buffers as well
>> 2) ALTER DATABASE postgres OWNER TO ...
>> 3) COMMENT ON DATABASE "postgres" IS ...
>> 4)     -- For binary upgrade, preserve pg_largeobject and index relfilenodes
>>       SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
>>       SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
>>       TRUNCATE pg_catalog.pg_largeobject;
>> --  ^^^ here we can get the error "could not create file "base/5/2683": File exists"
>> ...
>>
>> We get the effect discussed when the background writer process decides to
>> flush a file buffer for pg_largeobject during stage 1.
>> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
>> the result must be the same.)
>> And another important factor is shared_buffers = 1MB (set during the test).
>> With the default setting of 128MB I couldn't see the failure.
>>
>> It can be reproduced easily (on old Windows versions) just by running
>> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
>> default cluster)).
>> If an old cluster contains dozen of databases, this increases the failure
>> probability significantly (with 10 additional databases I've got failures
>> on iterations 4, 1, 6).
>>
> 
> I don't have an old Windows environment to test but I agree with your
> analysis and theory. The question is what should we do for these new
> random BF failures? I think we should set bgwriter_lru_maxpages to 0
> and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> fix as part of this doesn't sound reasonable because this is an
> existing problem and there seems to be another patch by Thomas that
> probably deals with the root cause of the existing problem [1] as
> pointed out by you.
> 
> [1] - https://commitfest.postgresql.org/40/3951/

Isn't this just sweeping the problem (non-POSIX behavior on SMB and 
ReFS) under the carpet? I realize that synthetic test workloads like 
pg_upgrade in a loop aren't themselves real-world scenarios, but what 
about other cases? Even if we're certain it's not possible for these 
issues to wedge a server, it's still not a good experience for users to 
get random, unexplained IO-related errors...
-- 
Jim Nasby, Data Architect, Austin TX

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Alvaro Herrera
Дата: 08 января, 18:51:22
Сообщение: Re: brininsert optimization opportunity

Следующее

От: Dmitry Dolgov
Дата: 08 января, 19:10:20
Сообщение: Re: pg_stat_statements and "IN" conditions

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Random pg_upgrade test failure on drongo

Предыдущее

Следующее