Re: Random pg_upgrade test failure on drongo
От | Jim Nasby |
---|---|
Тема | Re: Random pg_upgrade test failure on drongo |
Дата | |
Msg-id | 9897f89d-3d77-40fe-b05f-ac7b492e8160@gmail.com обсуждение исходный текст |
Ответ на | Re: Random pg_upgrade test failure on drongo (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Random pg_upgrade test failure on drongo
|
Список | pgsql-hackers |
On 1/4/24 10:19 PM, Amit Kapila wrote: > On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> >> 03.01.2024 14:42, Amit Kapila wrote: >>> >> >>>> And the internal process is ... background writer (BgBufferSync()). >>>> >>>> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and >>>> got 20 x 10 tests passing. >>>> >>>> Thus, it we want just to get rid of the test failure, maybe it's enough to >>>> add this to the test's config... >>>> >>> What about checkpoints? Can't it do the same while writing the buffers? >> >> As we deal here with pg_upgrade/pg_restore, it must not be very easy to get >> the desired effect, but I think it's not impossible in principle. >> More details below. >> What happens during the pg_upgrade execution is essentially: >> 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; >> -- this command flushes file buffers as well >> 2) ALTER DATABASE postgres OWNER TO ... >> 3) COMMENT ON DATABASE "postgres" IS ... >> 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes >> SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); >> SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); >> TRUNCATE pg_catalog.pg_largeobject; >> -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" >> ... >> >> We get the effect discussed when the background writer process decides to >> flush a file buffer for pg_largeobject during stage 1. >> (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, >> the result must be the same.) >> And another important factor is shared_buffers = 1MB (set during the test). >> With the default setting of 128MB I couldn't see the failure. >> >> It can be reproduced easily (on old Windows versions) just by running >> pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the >> default cluster)). >> If an old cluster contains dozen of databases, this increases the failure >> probability significantly (with 10 additional databases I've got failures >> on iterations 4, 1, 6). >> > > I don't have an old Windows environment to test but I agree with your > analysis and theory. The question is what should we do for these new > random BF failures? I think we should set bgwriter_lru_maxpages to 0 > and checkpoint_timeout to 1hr for these new tests. Doing some invasive > fix as part of this doesn't sound reasonable because this is an > existing problem and there seems to be another patch by Thomas that > probably deals with the root cause of the existing problem [1] as > pointed out by you. > > [1] - https://commitfest.postgresql.org/40/3951/ Isn't this just sweeping the problem (non-POSIX behavior on SMB and ReFS) under the carpet? I realize that synthetic test workloads like pg_upgrade in a loop aren't themselves real-world scenarios, but what about other cases? Even if we're certain it's not possible for these issues to wedge a server, it's still not a good experience for users to get random, unexplained IO-related errors... -- Jim Nasby, Data Architect, Austin TX
В списке pgsql-hackers по дате отправления: