On Fri, Nov 10, 2023 at 9:44 AM Jeff Janes <jeff.janes@gmail.com> wrote:
>
> I was looking into a possible scalability problem with GIN indexes under concurrent insert, but instead I found an
uncharacterizedbug. One of the processes will occasionally throw an error "ERROR: buffer 10112 is not owned by
resourceowner Portal" where the buffer number changes from run to run.
>
> I've verified this with both 14.9 and 16.1, on ubuntu 22.04. I use an AWS m5.4xlarge machine, and haven't tried to
verifyit on anything else. I don't currently have any real hardware with enough CPUs to do a meaningful test.
>
> I've attached the "user data" file I feed to AWS to run the test, this one is for v14.9. The v16.1 is similar except
Icompile PostgreSQL myself (without JIT) rather than getting it from apt. I standup an ubuntu 22.04 m5.4xlarge machine
withall the defaults, except changing the storage from 8GB to 80GB, and fed it the attached user data cloud init file.
>
> If you don't want to parse the meat out of the file, the core of the test is to run this command with some escalating
levelof concurrency in a loop. Each call just inserts one JSONB object with highly redundant keys (the same 10 keys
presentin every row) but a more distinctive value for each key.
>
> insert into j (j) select jsonb_object_agg(x::text, left(md5(random()::text),5)) from generate_series(1,10) f(x);
>
> I've never seen the error occur until the concurrency reaches at least 4, but sample size is too low for that to be
definitive.
>
> Unless someone has some better idea, my next step will be to switch the column from jsonb to text[] and see if it
existsthere as well.
>
> I assume the synchronous_commit=off is needed because without it you couldn't accumulate enough trials to spot the
bug,even though it would exist in that setting. I guess I could run the test on a machine with very fast SSD and leave
synchronous_commit=on,but I'm not looking forward to the cost of renting a machine that can do that or figuring out how
toconfigure it. I also haven't tried it with fastupdate on. I assume the test would not work because the pending list
wouldgrow without bound at high concurrencies (it would grow faster than a single-threaded cleaner could clean it) and
sonot seeing the bug would not mean it wasn't present.
>
> The test loops the insert for one minute, at each concurrency from 1 to 10, then starts over at -c 1 again. It seems
likeif you don't see the bug within the first 20 minutes (the first two 1-to-10 concurrency cycles) you are unlikely to
seeit at all. But that is more a hunch than a formal analysis.
>
> Cheers,
>
> Jeff
I can reproduce this by checking to e9f075f9a15593fe31c610e15cfc71a5fa281ede,
but master seems ok since Heikki has some ResourceOwner related patch
committed after that.
--
Regards
Junwang Zhao