Обсуждение: Weird test mixup

Поиск
Список
Период
Сортировка

Weird test mixup

От
Heikki Linnakangas
Дата:
I got a weird test failure while testing my forking refactor patches on 
Cirrus CI 
(https://cirrus-ci.com/task/5880724448870400?logs=test_running#L121):

> [16:52:39.753] Summary of Failures:
> [16:52:39.753] 
> [16:52:39.753] 66/73 postgresql:intarray-running / intarray-running/regress                       ERROR
6.27s  exit status 1
 
> [16:52:39.753] 
> [16:52:39.753] Ok:                 72  
> [16:52:39.753] Expected Fail:      0   
> [16:52:39.753] Fail:               1   
> [16:52:39.753] Unexpected Pass:    0   
> [16:52:39.753] Skipped:            0   
> [16:52:39.753] Timeout:            0   
> [16:52:39.753] 
> [16:52:39.753] Full log written to /tmp/cirrus-ci-build/build/meson-logs/testlog-running.txt

And:

> diff -U3 /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out
/tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out
> --- /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out    2024-03-14 16:48:48.690367000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out    2024-03-14 16:52:05.759444000
+0000
> @@ -804,6 +804,7 @@
>  
>  DROP INDEX text_idx;
>  CREATE INDEX text_idx on test__int using gin ( a gin__int_ops );
> +ERROR:  error triggered for injection point gin-leave-leaf-split-incomplete
>  SELECT count(*) from test__int WHERE a && '{23,50}';
>   count 
>  -------
> @@ -877,6 +878,7 @@
>  (1 row)
>  
>  DROP INDEX text_idx;
> +ERROR:  index "text_idx" does not exist
>  -- Repeat the same queries with an extended data set. The data set is the
>  -- same that we used before, except that each element in the array is
>  -- repeated three times, offset by 1000 and 2000. For example, {1, 5}

Somehow the 'gin-leave-leaf-split-incomplete' injection point was active 
in the 'intarray' test. That makes no sense. That injection point is 
only used by the test in src/test/modules/gin/. Perhaps that ran at the 
same time as the intarray test? But they run in separate instances, with 
different data directories. And the 'gin' test passed.

I'm completely stumped. Anyone have a theory?

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Weird test mixup

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Somehow the 'gin-leave-leaf-split-incomplete' injection point was active 
> in the 'intarray' test. That makes no sense. That injection point is 
> only used by the test in src/test/modules/gin/. Perhaps that ran at the 
> same time as the intarray test? But they run in separate instances, with 
> different data directories.

Do they?  It'd be fairly easy to explain this if these things were
being run in "installcheck" style.  I'm not sure about CI, but from
memory, the buildfarm does use installcheck for some things.

I wonder if it'd be wise to adjust the injection point stuff so that
it's active in only the specific database the injection point was
activated in.

            regards, tom lane



Re: Weird test mixup

От
Thomas Munro
Дата:
On Fri, Mar 15, 2024 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
> > in the 'intarray' test. That makes no sense. That injection point is
> > only used by the test in src/test/modules/gin/. Perhaps that ran at the
> > same time as the intarray test? But they run in separate instances, with
> > different data directories.
>
> Do they?  It'd be fairly easy to explain this if these things were
> being run in "installcheck" style.  I'm not sure about CI, but from
> memory, the buildfarm does use installcheck for some things.

Right, as mentioned here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX%3D%3DAVnW84f-%2B8yamQ%40mail.gmail.com

That's the "running" test, which is like the old installcheck.



Re: Weird test mixup

От
Tom Lane
Дата:
I wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Somehow the 'gin-leave-leaf-split-incomplete' injection point was active
>> in the 'intarray' test. That makes no sense. That injection point is
>> only used by the test in src/test/modules/gin/. Perhaps that ran at the
>> same time as the intarray test? But they run in separate instances, with
>> different data directories.

> Do they?  It'd be fairly easy to explain this if these things were
> being run in "installcheck" style.  I'm not sure about CI, but from
> memory, the buildfarm does use installcheck for some things.

Hmm, Munro's comment yesterday[1] says that current CI does use
installcheck mode in some cases.

            regards, tom lane

[1] https://www.postgresql.org/message-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==AVnW84f-+8yamQ@mail.gmail.com



Re: Weird test mixup

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 15, 2024 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Do they?  It'd be fairly easy to explain this if these things were
>> being run in "installcheck" style.  I'm not sure about CI, but from
>> memory, the buildfarm does use installcheck for some things.

> Right, as mentioned here:
>
https://www.postgresql.org/message-id/flat/CA%2BhUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX%3D%3DAVnW84f-%2B8yamQ%40mail.gmail.com
> That's the "running" test, which is like the old installcheck.

Hmm.  Seems like maybe we need to institute a rule that anything
using injection points has to be marked NO_INSTALLCHECK.  That's
kind of a big hammer though.

            regards, tom lane



Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote:
> Do they?  It'd be fairly easy to explain this if these things were
> being run in "installcheck" style.  I'm not sure about CI, but from
> memory, the buildfarm does use installcheck for some things.
>
> I wonder if it'd be wise to adjust the injection point stuff so that
> it's active in only the specific database the injection point was
> activated in.

It can be made optional by extending InjectionPointAttach() to
specify a database OID or a database name.  Note that
041_checkpoint_at_promote.pl wants an injection point to run in the
checkpointer, where we don't have a database requirement.

Or we could just disable runningcheck because of the concurrency
requirement in this test.  The test would still be able to run, just
less times.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Fri, Mar 15, 2024 at 07:53:57AM +0900, Michael Paquier wrote:
> It can be made optional by extending InjectionPointAttach() to
> specify a database OID or a database name.  Note that
> 041_checkpoint_at_promote.pl wants an injection point to run in the
> checkpointer, where we don't have a database requirement.

Slight correction here.  It is also possible to not touch
InjectionPointAttach() at all: just tweak the callbacks to do that as
long as the database that should be used is tracked in shmem with its
point name, say with new fields in InjectionPointSharedState.  That
keeps the backend APIs in a cleaner state.
--
Michael

Вложения

Re: Weird test mixup

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote:
>> I wonder if it'd be wise to adjust the injection point stuff so that
>> it's active in only the specific database the injection point was
>> activated in.

> It can be made optional by extending InjectionPointAttach() to
> specify a database OID or a database name.  Note that
> 041_checkpoint_at_promote.pl wants an injection point to run in the
> checkpointer, where we don't have a database requirement.

> Or we could just disable runningcheck because of the concurrency
> requirement in this test.  The test would still be able to run, just
> less times.

No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points.  The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation.  Side-effects occurring in other
databases are completely not OK.

I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in.  Then only tests that require a global injection
point need to be NO_INSTALLCHECK.

            regards, tom lane



Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Or we could just disable runningcheck because of the concurrency
>> requirement in this test.  The test would still be able to run, just
>> less times.
>
> No, actually we *must* mark all these tests NO_INSTALLCHECK if we
> stick with the current definition of injection points.  The point
> of installcheck mode is that the tests are supposed to be safe to
> run in a live installation.  Side-effects occurring in other
> databases are completely not OK.

I really don't want to plug any runtime conditions into the backend
APIs, because there can be so much more that can be done there than
only restricting a callback to a database.  I can imagine process type
restrictions, process PID restrictions, etc.  So this knowledge should
stick into the test module itself, and be expanded there.  That's
easier ABI-wise, as well.

> I can see that some tests would want to be able to inject code
> cluster-wide, but I bet that's going to be a small minority.
> I suggest that we invent a notion of "global" vs "local"
> injection points, where a "local" one only fires in the DB it
> was defined in.  Then only tests that require a global injection
> point need to be NO_INSTALLCHECK.

Attached is a POC of what could be done.  I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.

The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks.  With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name.  Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.

By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks.  Adding a pg_sleep of a
few seconds after 'gin-leave-leaf-split-incomplete' is registered
enlarges the window, and the patch avoids failures when running
installcheck in parallel for modules/gin/ and something else using
gin, like contrib/btree_gin/:
while make USE_MODULE_DB=1 installcheck; do :; done

0001 is the condition facility for the module, 0002 is a fix for the
GIN test.  Thoughts are welcome.
--
Michael

Вложения

Re: Weird test mixup

От
Heikki Linnakangas
Дата:
On 15/03/2024 09:39, Michael Paquier wrote:
> On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:
>> I can see that some tests would want to be able to inject code
>> cluster-wide, but I bet that's going to be a small minority.
>> I suggest that we invent a notion of "global" vs "local"
>> injection points, where a "local" one only fires in the DB it
>> was defined in.  Then only tests that require a global injection
>> point need to be NO_INSTALLCHECK.
> 
> Attached is a POC of what could be done.  I have extended the module
> injection_points so as it is possible to register what I'm calling a
> "condition" in the module that can be defined with a new SQL function.
>
> The condition is stored in shared memory with the point name, then at
> runtime the conditions are cross-checked in the callbacks.  With the
> interface of this patch, the condition should be registered *before* a
> point is attached, but this stuff could also be written so as
> injection_points_attach() takes an optional argument with a database
> name.  Or this could use a different, new SQL function, say a
> injection_points_attach_local() that registers a condition with
> MyDatabaseId on top of attaching the point, making the whole happening
> while holding once the spinlock of the shmem state for the module.

For the gin test, a single "SELECT injection_points_attach_local()" at 
the top of the test file would be most convenient.

If I have to do "SELECT 
injection_points_condition('gin-finish-incomplete-split', :'datname');" 
for every injection point in the test, I will surely forget it sometimes.

In the 'gin' test, they could actually be scoped to the same backend.


Wrt. the spinlock and shared memory handling, I think this would be 
simpler if you could pass some payload in the InjectionPointAttach() 
call, which would be passed back to the callback function:

  void
  InjectionPointAttach(const char *name,
                       const char *library,
-                     const char *function)
+                     const char *function,
+                     uint64 payload)

In this case, the payload would be the "slot index" in shared memory.

Or perhaps always allocate, say, 1024 bytes of working area for every 
attached injection point that the test module can use any way it wants. 
Like for storing extra conditions, or for the wakeup counter stuff in 
injection_wait(). A fixed size working area is a little crude, but would 
be very handy in practice.

> By the way, modules/gin/ was missing missing a detach, so the test was
> not repeatable with successive installchecks.

Oops.

It would be nice to automatically detach all the injection points on 
process exit. You wouldn't always want that, but I think most tests hold 
a session open throughout the test, and for those it would be handy.

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Weird test mixup

От
Heikki Linnakangas
Дата:
On 15/03/2024 01:13, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Or we could just disable runningcheck because of the concurrency
>> requirement in this test.  The test would still be able to run, just
>> less times.
> 
> No, actually we *must* mark all these tests NO_INSTALLCHECK if we
> stick with the current definition of injection points.  The point
> of installcheck mode is that the tests are supposed to be safe to
> run in a live installation.  Side-effects occurring in other
> databases are completely not OK.

I committed a patch to do that, to put out the fire.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Weird test mixup

От
Heikki Linnakangas
Дата:
On 15/03/2024 13:09, Heikki Linnakangas wrote:
> On 15/03/2024 01:13, Tom Lane wrote:
>> Michael Paquier <michael@paquier.xyz> writes:
>>> Or we could just disable runningcheck because of the concurrency
>>> requirement in this test.  The test would still be able to run, just
>>> less times.
>>
>> No, actually we *must* mark all these tests NO_INSTALLCHECK if we
>> stick with the current definition of injection points.  The point
>> of installcheck mode is that the tests are supposed to be safe to
>> run in a live installation.  Side-effects occurring in other
>> databases are completely not OK.
> 
> I committed a patch to do that, to put out the fire.

That's turning the buildfarm quite red. Many, but not all animals are 
failing like this:

> ---
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out
 2024-03-15 12:41:16.363286975 +0100
 
> +++
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out
2024-03-15 12:53:11.528159615 +0100
 
> @@ -1,118 +1,111 @@
>  CREATE EXTENSION injection_points;
> +ERROR:  extension "injection_points" is not available
> +DETAIL:  Could not open extension control file
"/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control":
Nosuch file or directory.
 
> +HINT:  The extension must first be installed on the system where PostgreSQL is running.
> ... 

Looks like adding NO_INSTALLCHECK somehow affected how the modules are 
installed in tmp_install. I'll investigate..

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Weird test mixup

От
Heikki Linnakangas
Дата:
On 15/03/2024 14:10, Heikki Linnakangas wrote:
> On 15/03/2024 13:09, Heikki Linnakangas wrote:
>> I committed a patch to do that, to put out the fire.
> 
> That's turning the buildfarm quite red. Many, but not all animals are
> failing like this:
> 
>> ---
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out
 2024-03-15 12:41:16.363286975 +0100
 
>> +++
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out
2024-03-15 12:53:11.528159615 +0100
 
>> @@ -1,118 +1,111 @@
>>   CREATE EXTENSION injection_points;
>> +ERROR:  extension "injection_points" is not available
>> +DETAIL:  Could not open extension control file
"/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control":
Nosuch file or directory.
 
>> +HINT:  The extension must first be installed on the system where PostgreSQL is running.
>> ...
> 
> Looks like adding NO_INSTALLCHECK somehow affected how the modules are
> installed in tmp_install. I'll investigate..

I think this is a bug in the buildfarm client. In the make_misc_check 
step, it does this (reduced to just the interesting parts):

> # run the modules that can't be run with installcheck
> sub make_misc_check
> {
>     ...
>     my @dirs = glob("$pgsql/src/test/modules/* $pgsql/contrib/*");
>     foreach my $dir (@dirs)
>     {
>         next unless -e "$dir/Makefile";
>         my $makefile = file_contents("$dir/Makefile");
>         next unless $makefile =~ /^NO_INSTALLCHECK/m;
>         my $test = basename($dir);
> 
>         # skip redundant TAP tests which are called elsewhere
>         my @out = run_log("cd $dir && $make $instflags TAP_TESTS= check");
>         ...
>     }

So it scans src/test/modules, and runs "make check" for all 
subdirectories that have NO_INSTALLCHECK in the makefile. But the 
injection fault tests are also conditional on the 
enable_injection_points in the parent Makefile:

> ifeq ($(enable_injection_points),yes)
> SUBDIRS += injection_points gin
> else
> ALWAYS_SUBDIRS += injection_points gin
> endif

The buildfarm client doesn't pay any attention to that, and runs the 
test anyway.

I committed an ugly hack to the subdirectory Makefiles, to turn "make 
check" into a no-op if injection points are disabled. Normally when you 
run "make check" at the parent level, it doesn't even recurse to the 
directories, but this works around the buildfarm script. I hope...

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Weird test mixup

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 15/03/2024 13:09, Heikki Linnakangas wrote:
>> I committed a patch to do that, to put out the fire.

> That's turning the buildfarm quite red. Many, but not all animals are 
> failing like this:

It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention.  Why would this test have done that,
if the module failed to load?

            regards, tom lane



Re: Weird test mixup

От
Heikki Linnakangas
Дата:
On 15/03/2024 16:00, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 15/03/2024 13:09, Heikki Linnakangas wrote:
>>> I committed a patch to do that, to put out the fire.
> 
>> That's turning the buildfarm quite red. Many, but not all animals are
>> failing like this:
> 
> It may be even worse than it appears from the buildfarm status page.
> My animals were stuck in infinite loops that required a manual "kill"
> to get out of, and it's reasonable to assume there are others that
> will require owner intervention.  Why would this test have done that,
> if the module failed to load?

The gin_incomplete_split test inserts rows until it hits the injection 
point, at page split. There is a backstop, it should give up after 10000 
iterations, but that was broken. Fixed that, thanks for the report!

Hmm, don't we have any timeout that would kill tests if they get stuck?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Weird test mixup

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 15/03/2024 16:00, Tom Lane wrote:
>> It may be even worse than it appears from the buildfarm status page.
>> My animals were stuck in infinite loops that required a manual "kill"
>> to get out of, and it's reasonable to assume there are others that
>> will require owner intervention.  Why would this test have done that,
>> if the module failed to load?

> The gin_incomplete_split test inserts rows until it hits the injection 
> point, at page split. There is a backstop, it should give up after 10000 
> iterations, but that was broken. Fixed that, thanks for the report!

Duh ...

> Hmm, don't we have any timeout that would kill tests if they get stuck?

AFAIK, the only constraint on a buildfarm animal's runtime is the
wait_timeout setting, which is infinite by default, and was on my
machines.  (Not anymore ;-).)  We do have timeouts in (most?) TAP
tests, but this wasn't a TAP test.

If this is a continuous-insertion loop, presumably it will run the
machine out of disk space eventually, which could be unpleasant
if there are other services running besides the buildfarm.
I think I'll go notify the buildfarm owners list to check for
trouble.

Are there limits on the runtime of CI or cfbot jobs?  Maybe
somebody should go check those systems.

            regards, tom lane



Re: Weird test mixup

От
Thomas Munro
Дата:
On Sat, Mar 16, 2024 at 7:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Are there limits on the runtime of CI or cfbot jobs?  Maybe
> somebody should go check those systems.

Those get killed at a higher level after 60 minutes (configurable but
we didn't change it AFAIK):

https://cirrus-ci.org/faq/#instance-timed-out

It's a fresh virtual machine for each run, and after that it's gone
(well the ccache directory survives but only by being
uploaded/downloaded in explicit steps to transmit it between runs).



Re: Weird test mixup

От
Michael Paquier
Дата:
On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
> For the gin test, a single "SELECT injection_points_attach_local()" at the
> top of the test file would be most convenient.
>
> If I have to do "SELECT
> injection_points_condition('gin-finish-incomplete-split', :'datname');" for
> every injection point in the test, I will surely forget it sometimes.

So will I, most likely..  The odds never play in favor of hackers.  I
have a few more tests in mind that can be linked to a specific
backend with SQL queries, but I've not been able to get back to it
yet.

> Wrt. the spinlock and shared memory handling, I think this would be simpler
> if you could pass some payload in the InjectionPointAttach() call, which
> would be passed back to the callback function:
>
> In this case, the payload would be the "slot index" in shared memory.
>
> Or perhaps always allocate, say, 1024 bytes of working area for every
> attached injection point that the test module can use any way it wants. Like
> for storing extra conditions, or for the wakeup counter stuff in
> injection_wait(). A fixed size working area is a little crude, but would be
> very handy in practice.

Perhaps.  I am not sure that we need more than the current signature,
all that can just be handled in some module-specific shmem area.  The
key is to be able to link a point name to some state related to it.
Using a hash table would be more efficient, but performance wise a
array is not going to matter as there will most likely never be more
than 8 points.  4 is already a lot, just doubling that on safety
ground.

> It would be nice to automatically detach all the injection points on process
> exit. You wouldn't always want that, but I think most tests hold a session
> open throughout the test, and for those it would be handy.

Linking all the points to a PID with a injection_points_attach_local()
that switches a static flag while registering a before_shmem_exit() to
do an automated cleanup sounds like the simplest approach to me based
on what I'm reading on this thread.

(Just saw the buildfarm storm, wow.)
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
> Linking all the points to a PID with a injection_points_attach_local()
> that switches a static flag while registering a before_shmem_exit() to
> do an automated cleanup sounds like the simplest approach to me based
> on what I'm reading on this thread.

Please find a patch to do exactly that, without touching the backend
APIs.  0001 adds a new function call injection_points_local() that can
be added on top of a SQL test to make it concurrent-safe.  0002 is the
fix for the GIN tests.

I am going to add an open item to not forget about all that.

Comments are welcome.
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 18 Mar 2024, at 06:04, Michael Paquier <michael@paquier.xyz> wrote:
>
> new function call injection_points_local() that can
> be added on top of a SQL test to make it concurrent-safe.

Maybe consider function injection_points_attach_local(‘point name’) instead of static switch?
Or even injection_points_attach_global(‘point name’), while function injection_points_attach(‘point name’) will be
global?This would favour writing concurrent test by default… 


Best regards, Andrey Borodin.


Re: Weird test mixup

От
Michael Paquier
Дата:
On Mon, Mar 18, 2024 at 10:50:25AM +0500, Andrey M. Borodin wrote:
> Maybe consider function injection_points_attach_local(‘point name’)
> instead of static switch?
> Or even injection_points_attach_global(‘point name’), while function
> injection_points_attach(‘point name’) will be global? This would
> favour writing concurrent test by default…

The point is to limit accidents like the one of this thread.  So, for
cases already in the tree, not giving the point name in the SQL
function would be simple enough.

What you are suggesting can be simply done, as well, though I'd rather
wait for a reason to justify doing so.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Mon, Mar 18, 2024 at 10:04:45AM +0900, Michael Paquier wrote:
> Please find a patch to do exactly that, without touching the backend
> APIs.  0001 adds a new function call injection_points_local() that can
> be added on top of a SQL test to make it concurrent-safe.  0002 is the
> fix for the GIN tests.
>
> I am going to add an open item to not forget about all that.

It's been a couple of weeks since this has been sent, and this did not
get any reviews.  I'd still be happy with the simplicity of a single
injection_points_local() that can be used to link all the injection
points created in a single process to it, discarding them once the
process exists with a shmem exit callback.  And I don't really see an
argument to tweak the backend-side routines, as well.  Comments and/or
objections?
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 5 Apr 2024, at 07:19, Michael Paquier <michael@paquier.xyz> wrote:
>
> It's been a couple of weeks since this has been sent, and this did not
> get any reviews.  I'd still be happy with the simplicity of a single
> injection_points_local() that can be used to link all the injection
> points created in a single process to it, discarding them once the
> process exists with a shmem exit callback.

OK, makes sense.
I find name of the function "injection_points_local()" strange, because there is no verb in the name. How about
"injection_points_set_local"?

>  And I don't really see an
> argument to tweak the backend-side routines, as well.
>  Comments and/or
> objections?

I'm not sure if we should refactor anything here, but InjectionPointSharedState has singular name, plural wait_counts
andsingular condition. 
InjectionPointSharedState is already an array of injection points, maybe let's add there optional pid instead of
inventingseparate array of pids? 

Can we set global point to 'notice', but same local to 'wait'? Looks like now we can't, but allowing to do so would
makecode simpler. 

Besides this opportunity to simplify stuff, both patches looks good to me.


Best regards, Andrey Borodin.


Re: Weird test mixup

От
Michael Paquier
Дата:
On Sat, Apr 06, 2024 at 10:34:46AM +0500, Andrey M. Borodin wrote:
> I find name of the function "injection_points_local()" strange,
> because there is no verb in the name. How about
> "injection_points_set_local"?

That makes sense.

> I'm not sure if we should refactor anything here, but
> InjectionPointSharedState has singular name, plural wait_counts and
> singular condition.
> InjectionPointSharedState is already an array of injection points,
> maybe let's add there optional pid instead of inventing separate
> array of pids?

Perhaps we could unify these two concepts, indeed, with a "kind" added
to InjectionPointCondition.  Now waits/wakeups are a different beast
than the conditions that could be assigned to a point to filter if it
should be executed.  More runtime conditions coming immediately into
my mind, that could be added to this structure relate mostly to global
objects, like:
- Specific database name and/or OID.
- Specific role(s).
So that's mostly cross-checking states coming from miscadmin.h for
now.

> Can we set global point to 'notice', but same local to 'wait'? Looks
> like now we can't, but allowing to do so would make code simpler.

You mean using the name point name with more than more callback?  Not
sure we'd want to be able to do that.  Perhaps you're right, though,
if there is a use case that justifies it.

> Besides this opportunity to simplify stuff, both patches looks good
> to me.

Yeah, this module can be always tweaked more if necessary.  Saying
that, naming the new thing "condition" in InjectionPointSharedState
felt strange, as you said, because it is an array of multiple
conditions.

For now I have applied 997db123c054 to make the GIN tests with
injection points repeatable as it was an independent issue, and
f587338dec87 to add the local function pieces.

Attached is the last piece to switch the GIN test to use local
injection points.  85f65d7a26fc should maintain the buildfarm at bay,
but I'd rather not take a bet and accidently freeze the buildfarm as
it would impact folks who aim at getting patches committed just before
the finish line.  So I am holding on this one for a few more days
until we're past the freeze and the buildfarm is more stable.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Mon, Apr 08, 2024 at 10:22:40AM +0900, Michael Paquier wrote:
> For now I have applied 997db123c054 to make the GIN tests with
> injection points repeatable as it was an independent issue, and
> f587338dec87 to add the local function pieces.

Bharath has reported me offlist that one of the new tests has a race
condition when doing the reconnection.  When the backend creating the
local points is very slow to exit, the backend created after the
reconnection may detect that a local point previously created still
exists, causing a failure.  The failure can be reproduced with a sleep
in the shmem exit callback, like:
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -163,6 +163,8 @@ injection_points_cleanup(int code, Datum arg)
     if (!injection_point_local)
         return;

+    pg_usleep(1000000 * 1L);
+
     SpinLockAcquire(&inj_state->lock);
     for (int i = 0; i < INJ_MAX_CONDITION; i++)
     {

At first I was looking at a loop with a scan of pg_stat_activity, but
I've noticed that regress.so includes a wait_pid() that we can use to
make sure that a given process exits before moving on to the next
parts of a test, so I propose to just reuse that here.  This requires
tweaks with --dlpath for meson and ./configure, nothing new.  The CI
is clean.  Patch attached.

Thoughts?
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 8 Apr 2024, at 10:33, Michael Paquier <michael@paquier.xyz> wrote:
> 
> Thoughts?

As an alternative we can make local injection points mutually exclusive.


Best regards, Andrey Borodin.



Re: Weird test mixup

От
Michael Paquier
Дата:
On Mon, Apr 08, 2024 at 10:42:08AM +0300, Andrey M. Borodin wrote:
> As an alternative we can make local injection points mutually exclusive.

Sure.  Now, the point of the test is to make sure that the local
cleanup happens, so I'd rather keep it as-is and use the same names
across reloads.
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 8 Apr 2024, at 11:55, Michael Paquier <michael@paquier.xyz> wrote:
>
>  the point of the test is to make sure that the local
> cleanup happens
Uh, I did not understand this. Because commit message was about stabiilzizing tests, not extending coverage.
Also, should we drop function wait_pid() at the end of a test?
Given that tweaks with are nothing new, I think patch looks good.


Best regards, Andrey Borodin.


Re: Weird test mixup

От
Michael Paquier
Дата:
On Mon, Apr 08, 2024 at 12:29:43PM +0300, Andrey M. Borodin wrote:
> On 8 Apr 2024, at 11:55, Michael Paquier <michael@paquier.xyz> wrote:
>> Uh, I did not understand this. Because commit message was about
>> stabiilzizing tests, not extending coverage.

Okay, it is about stabilizing an existing test.

> Also, should we drop function wait_pid() at the end of a test?

Sure.

> Given that tweaks with are nothing new, I think patch looks good.

Applied that after a second check.  And thanks to Bharath for the
poke.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Tue, Apr 09, 2024 at 12:41:57PM +0900, Michael Paquier wrote:
> Applied that after a second check.  And thanks to Bharath for the
> poke.

And now that the buildfarm is cooler, I've also applied the final
patch in the series as of 5105c9079681 to make the GIN module
concurrent-safe using injection_points_set_local().
--
Michael

Вложения

Re: Weird test mixup

От
Noah Misch
Дата:
While writing an injection point test, I encountered a variant of the race
condition that f4083c4 fixed.  It had three sessions and this sequence of
events:

s1: local-attach to POINT
s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
s3: detach POINT, deleting the InjectionPointCondition record
s2: wake up and run POINT as though it had been non-local

On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
> On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
> > Wrt. the spinlock and shared memory handling, I think this would be simpler
> > if you could pass some payload in the InjectionPointAttach() call, which
> > would be passed back to the callback function:
> > 
> > In this case, the payload would be the "slot index" in shared memory.
> > 
> > Or perhaps always allocate, say, 1024 bytes of working area for every
> > attached injection point that the test module can use any way it wants. Like
> > for storing extra conditions, or for the wakeup counter stuff in
> > injection_wait(). A fixed size working area is a little crude, but would be
> > very handy in practice.

That would be one good way to solve it.  (Storing a slot index has the same
race condition, but it fixes the race to store a struct containing the PID.)

The best alternative I see is to keep an InjectionPointCondition forever after
creating it.  Give it a "bool valid" field that we set on detach.  I don't see
a major reason to prefer one of these over the other.  One puts a negligible
amount of memory pressure on the main segment, but it simplifies the module
code.  I lean toward the "1024 bytes of working area" idea.  Other ideas or
opinions?


Separately, injection_points_cleanup() breaks the rules by calling
InjectionPointDetach() while holding a spinlock.  The latter has an
elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
given as much thought to solutions for this one.

Thanks,
nm



Re: Weird test mixup

От
Michael Paquier
Дата:
On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> While writing an injection point test, I encountered a variant of the race
> condition that f4083c4 fixed.  It had three sessions and this sequence of
> events:
>
> s1: local-attach to POINT
> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> s3: detach POINT, deleting the InjectionPointCondition record
> s2: wake up and run POINT as though it had been non-local

Fun.  One thing I would ask is why it makes sense to be able to detach
a local point from a different session than the one who defined it as
local.  Shouldn't the operation of s3 be restricted rather than
authorized as a safety measure, instead?

> On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
>> On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
>>> Wrt. the spinlock and shared memory handling, I think this would be simpler
>>> if you could pass some payload in the InjectionPointAttach() call, which
>>> would be passed back to the callback function:
>>>
>>> In this case, the payload would be the "slot index" in shared memory.
>>>
>>> Or perhaps always allocate, say, 1024 bytes of working area for every
>>> attached injection point that the test module can use any way it wants. Like
>>> for storing extra conditions, or for the wakeup counter stuff in
>>> injection_wait(). A fixed size working area is a little crude, but would be
>>> very handy in practice.
>
> That would be one good way to solve it.  (Storing a slot index has the same
> race condition, but it fixes the race to store a struct containing the PID.)
>
> The best alternative I see is to keep an InjectionPointCondition forever after
> creating it.  Give it a "bool valid" field that we set on detach.  I don't see
> a major reason to prefer one of these over the other.  One puts a negligible
> amount of memory pressure on the main segment, but it simplifies the module
> code.  I lean toward the "1024 bytes of working area" idea.  Other ideas or
> opinions?

If more state data is needed, the fixed area injection_point.c would
be better.  Still, I am not sure that this is required here, either.

> Separately, injection_points_cleanup() breaks the rules by calling
> InjectionPointDetach() while holding a spinlock.  The latter has an
> elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
> given as much thought to solutions for this one.

Indeed.  That's a brain fade.  This one could be fixed by collecting
the point names when cleaning up the conditions and detach after
releasing the spinlock.  This opens a race condition between the
moment when the spinlock is released and the detach, where another
backend could come in and detach a point before the shmem_exit
callback has the time to do its cleanup, even if detach() is
restricted for local points.  So we could do the callback cleanup in
three steps in the shmem exit callback:
- Collect the names of the points to detach, while holding the
spinlock.
- Do the Detach.
- Take again the spinlock, clean up the conditions.

Please see the attached.
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 2 May 2024, at 12:27, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
>> While writing an injection point test, I encountered a variant of the race
>> condition that f4083c4 fixed.  It had three sessions and this sequence of
>> events:
>>
>> s1: local-attach to POINT
>> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
>> s3: detach POINT, deleting the InjectionPointCondition record
>> s2: wake up and run POINT as though it had been non-local
>
> Fun.  One thing I would ask is why it makes sense to be able to detach
> a local point from a different session than the one who defined it as
> local.  Shouldn't the operation of s3 be restricted rather than
> authorized as a safety measure, instead?

That seems to prevent meaningful use case. If we want exactly one session to be waiting just before some specific
point,the only way to achieve this is to create local injection point. But the session must be resumable from another
session.
Without this local waiting injection points are meaningless.


Best regards, Andrey Borodin.


Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, May 02, 2024 at 01:33:45PM +0500, Andrey M. Borodin wrote:
> That seems to prevent meaningful use case. If we want exactly one
> session to be waiting just before some specific point, the only way
> to achieve this is to create local injection point. But the session
> must be resumable from another session.
> Without this local waiting injection points are meaningless.

I am not quite sure to follow your argument here.  It is still
possible to attach a local injection point with a wait callback that
can be awaken by a different backend:
s1: select injection_points_set_local();
s1: select injection_points_attach('popo', 'wait');
s1: select injection_points_run('popo'); -- waits
s2: select injection_points_wakeup('popo');
s1: -- ready for action.

A detach is not a wakeup.
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 2 May 2024, at 13:43, Michael Paquier <michael@paquier.xyz> wrote:
>
> A detach is not a wakeup.

Oh, now I see. Sorry for the noise.

Detaching local injection point of other backend seems to be useless and can be forbidden.
As far as I understand, your patch is already doing this in
+    if (!injection_point_allowed(name))
+        elog(ERROR, "cannot detach injection point \"%s\" not allowed to run",
+             name);
+

As far as I understand this will effectively forbid calling injection_points_detach() for local injection point of
otherbackend. Do I get it right? 


Best regards, Andrey Borodin.


Re: Weird test mixup

От
Noah Misch
Дата:
On Thu, May 02, 2024 at 04:27:12PM +0900, Michael Paquier wrote:
> On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> > While writing an injection point test, I encountered a variant of the race
> > condition that f4083c4 fixed.  It had three sessions and this sequence of
> > events:
> > 
> > s1: local-attach to POINT
> > s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> > s3: detach POINT, deleting the InjectionPointCondition record
> > s2: wake up and run POINT as though it had been non-local

I should have given a simpler example:

s1: local-attach to POINT
s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
s1: exit
s2: wake up and run POINT as though it had been non-local

> Fun.  One thing I would ask is why it makes sense to be able to detach
> a local point from a different session than the one who defined it as
> local.  Shouldn't the operation of s3 be restricted rather than
> authorized as a safety measure, instead?

(That's orthogonal to the race condition.)  When s1 would wait at the
injection point multiple times in one SQL statement, I like issuing the detach
from s3 so s1 waits at just the first encounter with the injection point.
This mimics setting a gdb breakpoint and deleting that breakpoint before
"continue".  The alternative, waking s1 repeatedly until it finishes the SQL
statement, is less convenient.  (I also patched _detach() to wake the waiter,
and I plan to propose that.)

> > Separately, injection_points_cleanup() breaks the rules by calling
> > InjectionPointDetach() while holding a spinlock.  The latter has an
> > elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
> > given as much thought to solutions for this one.
> 
> Indeed.  That's a brain fade.  This one could be fixed by collecting
> the point names when cleaning up the conditions and detach after
> releasing the spinlock.  This opens a race condition between the
> moment when the spinlock is released and the detach, where another
> backend could come in and detach a point before the shmem_exit
> callback has the time to do its cleanup, even if detach() is
> restricted for local points.  So we could do the callback cleanup in

That race condition seems fine.  The test can be expected to control the
timing of backend exits vs. detach calls.  Unlike the InjectionPointRun()
race, it wouldn't affect backends unrelated to the test.

> three steps in the shmem exit callback: 
> - Collect the names of the points to detach, while holding the
> spinlock.
> - Do the Detach.
> - Take again the spinlock, clean up the conditions.
> 
> Please see the attached.

The injection_points_cleanup() parts look good.  Thanks.

> @@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
>  {
>      char       *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
>  
> +    if (!injection_point_allowed(name))
> +        elog(ERROR, "cannot detach injection point \"%s\" not allowed to run",
> +             name);
> +

As above, I disagree with the injection_points_detach() part.



Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, May 02, 2024 at 01:52:20PM +0500, Andrey M. Borodin wrote:
> As far as I understand this will effectively forbid calling
> injection_points_detach() for local injection point of other
> backend. Do I get it right?

Yes, that would be the intention.  Noah has other use cases in mind
with this interface that I did not think about supporting, hence he's
objecting to this restriction.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, May 02, 2024 at 12:35:55PM -0700, Noah Misch wrote:
> I should have given a simpler example:
>
> s1: local-attach to POINT
> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> s1: exit
> s2: wake up and run POINT as though it had been non-local

Hmm.  Even if you were to emulate that in a controlled manner, you
would need a second injection point that does a wait in s2, which is
something that would happen before injection_callback() and before
scanning the local entry.  This relies on the fact that we're holding
CPU in s2 between the backend shmem hash table lookup and the callback
being called.

>> Fun.  One thing I would ask is why it makes sense to be able to detach
>> a local point from a different session than the one who defined it as
>> local.  Shouldn't the operation of s3 be restricted rather than
>> authorized as a safety measure, instead?
>
> (That's orthogonal to the race condition.)  When s1 would wait at the
> injection point multiple times in one SQL statement, I like issuing the detach
> from s3 so s1 waits at just the first encounter with the injection point.
> This mimics setting a gdb breakpoint and deleting that breakpoint before
> "continue".  The alternative, waking s1 repeatedly until it finishes the SQL
> statement, is less convenient.  (I also patched _detach() to wake the waiter,
> and I plan to propose that.)

Okay.

>> Indeed.  That's a brain fade.  This one could be fixed by collecting
>> the point names when cleaning up the conditions and detach after
>> releasing the spinlock.  This opens a race condition between the
>> moment when the spinlock is released and the detach, where another
>> backend could come in and detach a point before the shmem_exit
>> callback has the time to do its cleanup, even if detach() is
>> restricted for local points.  So we could do the callback cleanup in
>
> That race condition seems fine.  The test can be expected to control the
> timing of backend exits vs. detach calls.  Unlike the InjectionPointRun()
> race, it wouldn't affect backends unrelated to the test.

Sure.  The fact that there are two spinlocks in the backend code and
the module opens concurrency issues.  Making that more robust if there
is a case for it is OK by me, but I'd rather avoid making the
backend-side more complicated than need be.

>> three steps in the shmem exit callback:
>> - Collect the names of the points to detach, while holding the
>> spinlock.
>> - Do the Detach.
>> - Take again the spinlock, clean up the conditions.
>>
>> Please see the attached.
>
> The injection_points_cleanup() parts look good.  Thanks.

Thanks for the check.

>> @@ -403,6 +430,10 @@ injection_points_detach(PG_FUNCTION_ARGS)
>>  {
>>      char       *name = text_to_cstring(PG_GETARG_TEXT_PP(0));
>>
>> +    if (!injection_point_allowed(name))
>> +        elog(ERROR, "cannot detach injection point \"%s\" not allowed to run",
>> +             name);
>> +
>
> As above, I disagree with the injection_points_detach() part.

Okay, noted.  Fine by me to expand this stuff as you feel, the code
has been written to be extended depending on what people want to
support.  There should be tests in the tree that rely on any
new behavior, though.

I've applied the patch to fix the spinlock logic in the exit callback
for now.
--
Michael

Вложения

Re: Weird test mixup

От
Noah Misch
Дата:
On Mon, May 06, 2024 at 10:03:37AM +0900, Michael Paquier wrote:
> On Thu, May 02, 2024 at 12:35:55PM -0700, Noah Misch wrote:
> > I should have given a simpler example:
> > 
> > s1: local-attach to POINT
> > s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> > s1: exit
> > s2: wake up and run POINT as though it had been non-local

Here's how I've patched it locally.  It does avoid changing the backend-side,
which has some attraction.  Shall I just push this?

> Hmm.  Even if you were to emulate that in a controlled manner, you
> would need a second injection point that does a wait in s2, which is
> something that would happen before injection_callback() and before
> scanning the local entry.  This relies on the fact that we're holding
> CPU in s2 between the backend shmem hash table lookup and the callback
> being called.

Right.  We would need "second-level injection points" to write a test for that
race in the injection point system.

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Mon, May 06, 2024 at 02:23:24PM -0700, Noah Misch wrote:
> Here's how I've patched it locally.  It does avoid changing the backend-side,
> which has some attraction.  Shall I just push this?

It looks like you did not rebase on top of HEAD to avoid the spinlock
taken with InjectionPointDetach() in the shmem callback.  I think that
you'd mean the attached, once rebased (apologies I've reset the author
field).

+ *    s1: local-attach to POINT
+ *    s2: yield CPU before InjectionPointRun(POINT) calls injection_callback()
+ *    s1: exit()
+ *    s2: run POINT as though it had been non-local

I see.  So you are taking a shortcut in the shape of never resetting
the name of a condition, so as it is possible to let the point of step
4 check the runtime condition with a condition still stored while the
point has been detached, removed from the hash table.

         if (strcmp(condition->name, name) == 0)
         {
+            condition->valid = false;
             condition->pid = 0;
-            condition->name[0] = '\0';
         }
     }

As of HEAD, we rely on InjectionPointCondition->name to be set to
check if a condition is valid.  Your patch adds a different variable
to do mostly the same thing, and never clears the name in any existing
conditions.  A side effect is that this causes the conditions to pile
up on a running server when running installcheck, and assuming that
many test suites are run on a server left running this could cause
spurious failures when failing to find a new slot.  Always resetting
condition->name when detaching a point is a simpler flow and saner
IMO.

Overall, this switches from one detach behavior to a different one,
which may or may not be intuitive depending on what one is looking
for.  FWIW, I see InjectionPointCondition as something that should be
around as long as its injection point exists, with the condition
entirely gone once the point is detached because it should not exist
anymore on the server running, with no information left in shmem.

Through your patch, you make conditions have a different meaning, with
a mix of "local" definition, but it is kind of permanent as it keeps a
trace of the point's name in shmem.  I find the behavior of the patch
less intuitive.  Perhaps it would be interesting to see first the bug
and/or problem you are trying to tackle with this different behavior
as I feel like we could do something even with the module as-is.  As
far as I understand, the implementation of the module on HEAD allows
one to emulate a breakpoint with a wait/wake, which can avoid the
window mentioned in step 2.  Even if a wait point is detached
concurrently, it can be awaken with its traces in shmem removed.
--
Michael

Вложения

Re: Weird test mixup

От
Noah Misch
Дата:
On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
> On Mon, May 06, 2024 at 02:23:24PM -0700, Noah Misch wrote:
> > Here's how I've patched it locally.  It does avoid changing the backend-side,
> > which has some attraction.  Shall I just push this?
> 
> It looks like you did not rebase on top of HEAD

Yes, the base was 713cfaf (Sunday).

> A side effect is that this causes the conditions to pile
> up on a running server when running installcheck, and assuming that
> many test suites are run on a server left running this could cause
> spurious failures when failing to find a new slot.

Yes, we'd be raising INJ_MAX_CONDITION more often under this approach.

> Always resetting
> condition->name when detaching a point is a simpler flow and saner
> IMO.
> 
> Overall, this switches from one detach behavior to a different one,

Can you say more about that?  The only behavior change known to me is that a
given injection point workload uses more of INJ_MAX_CONDITION.  If there's
another behavior change, it was likely unintended.

> which may or may not be intuitive depending on what one is looking
> for.  FWIW, I see InjectionPointCondition as something that should be
> around as long as its injection point exists, with the condition
> entirely gone once the point is detached because it should not exist
> anymore on the server running, with no information left in shmem.
> 
> Through your patch, you make conditions have a different meaning, with
> a mix of "local" definition, but it is kind of permanent as it keeps a
> trace of the point's name in shmem.  I find the behavior of the patch
> less intuitive.  Perhaps it would be interesting to see first the bug
> and/or problem you are trying to tackle with this different behavior
> as I feel like we could do something even with the module as-is.  As
> far as I understand, the implementation of the module on HEAD allows
> one to emulate a breakpoint with a wait/wake, which can avoid the
> window mentioned in step 2.  Even if a wait point is detached
> concurrently, it can be awaken with its traces in shmem removed.

The problem I'm trying to tackle in this thread is to make
src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
that work, having seen the intarray test suite break when run concurrently
with the injection_points test suite.  That combination still does break at
the exit-time race condition.  To reproduce, apply this attachment to add
sleeps, and run:

make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C contrib/intarray installcheck
USE_MODULE_DB=1

Separately, I see injection_points_attach() populates InjectionPointCondition
after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
avoid the same sort of race?  I've not tried to reproduce that one.

Вложения

Re: Weird test mixup

От
Noah Misch
Дата:
On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
> > Overall, this switches from one detach behavior to a different one,
> 
> Can you say more about that?  The only behavior change known to me is that a
> given injection point workload uses more of INJ_MAX_CONDITION.  If there's
> another behavior change, it was likely unintended.

I see patch inplace030-inj-exit-race-v1.patch does not fix the race seen with
repro-inj-exit-race-v1.patch.  I withdraw inplace030-inj-exit-race-v1.patch,
and I withdraw the above question.

> To reproduce, apply [repro-inj-exit-race-v1.patch] to add
> sleeps, and run:
> 
> make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C contrib/intarray installcheck
USE_MODULE_DB=1
> 
> Separately, I see injection_points_attach() populates InjectionPointCondition
> after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
> avoid the same sort of race?  I've not tried to reproduce that one.



Re: Weird test mixup

От
Michael Paquier
Дата:
On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
>> Always resetting
>> condition->name when detaching a point is a simpler flow and saner
>> IMO.
>>
>> Overall, this switches from one detach behavior to a different one,
>
> Can you say more about that?  The only behavior change known to me is that a
> given injection point workload uses more of INJ_MAX_CONDITION.  If there's
> another behavior change, it was likely unintended.

As far as I read the previous patch, the conditions stored in
InjectionPointSharedState would never be really gone, even if the
points are removed from InjectionPointHash.

> The problem I'm trying to tackle in this thread is to make
> src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
> that work, having seen the intarray test suite break when run concurrently
> with the injection_points test suite.  That combination still does break at
> the exit-time race condition.  To reproduce, apply this attachment to add
> sleeps, and run:
>
> make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C contrib/intarray installcheck
USE_MODULE_DB=1

Thanks for that.  I am not really sure how to protect that without a
small cost in flexibility for the cases of detach vs run paths.  This
comes down to the fact that a custom routine could be run while it
could be detached concurrently, removing any stuff a callback could
depend on in the module.

It was mentioned upthread to add to InjectionPointCacheEntry a fixed
area of memory that modules could use to store some "status" data, but
it would not close the run/detach race because a backend could still
hold a pointer to a callback, with concurrent backends playing with
the contents of InjectionPointCacheEntry (concurrent detaches and
attaches that would cause the same entries to be reused).

One way to close entirely the window would be to hold
InjectionPointLock longer in InjectionPointRun() until the callback
finishes or until it triggers an ERROR.  This would mean that the case
you've mentioned in [1] would change, by blocking the detach() of s3
until the callback of s2 finishes.  We don't have tests in the tree
that do any of that, so holding InjectionPointLock longer would not
break anything on HEAD.  A detach being possible while the callback is
run is something I've considered as valid in d86d20f0ba79, but perhaps
that was too cute of me, even more with the use case of local points.

> Separately, I see injection_points_attach() populates InjectionPointCondition
> after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
> avoid the same sort of race?  I've not tried to reproduce that one.

Good point.  You could run into the case of a concurrent backend
running an injection point that should be local if waiting between
InjectionPointAttach() and the condition getting registered in
injection_points_attach().  That should be reversed.

[1] https://www.postgresql.org/message-id/20240501231214.40@rfd.leadboat.com
--
Michael

Вложения

Re: Weird test mixup

От
Noah Misch
Дата:
On Thu, May 09, 2024 at 09:37:54AM +0900, Michael Paquier wrote:
> On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> > The problem I'm trying to tackle in this thread is to make
> > src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
> > that work, having seen the intarray test suite break when run concurrently
> > with the injection_points test suite.  That combination still does break at
> > the exit-time race condition.  To reproduce, apply this attachment to add
> > sleeps, and run:
> > 
> > make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C contrib/intarray installcheck
USE_MODULE_DB=1
> 
> Thanks for that.  I am not really sure how to protect that without a
> small cost in flexibility for the cases of detach vs run paths.  This
> comes down to the fact that a custom routine could be run while it
> could be detached concurrently, removing any stuff a callback could
> depend on in the module.
> 
> It was mentioned upthread to add to InjectionPointCacheEntry a fixed
> area of memory that modules could use to store some "status" data, but
> it would not close the run/detach race because a backend could still
> hold a pointer to a callback, with concurrent backends playing with
> the contents of InjectionPointCacheEntry (concurrent detaches and
> attaches that would cause the same entries to be reused).

> One way to close entirely the window would be to hold
> InjectionPointLock longer in InjectionPointRun() until the callback
> finishes or until it triggers an ERROR.  This would mean that the case
> you've mentioned in [1] would change, by blocking the detach() of s3
> until the callback of s2 finishes.

> [1] https://www.postgresql.org/message-id/20240501231214.40@rfd.leadboat.com

Yes, that would be a loss for test readability.  Also, I wouldn't be surprised
if some test will want to attach POINT-B while POINT-A is in injection_wait().
Various options avoiding those limitations:

1. The data area formerly called a "status" area is immutable after attach.
   The core code copies it into a stack buffer to pass in a const callback
   argument.

2. InjectionPointAttach() returns an attachment serial number, and the
   callback receives that as an argument.  injection_points.c would maintain
   an InjectionPointCondition for every still-attached serial number,
   including global attachments.  Finding no condition matching the serial
   number argument means detach already happened and callback should do
   nothing.

3. Move the PID check into core code.

4. Separate the concept of "make ineligible to fire" from "detach", with
   stronger locking for detach.  v1 pointed in this direction, though not
   using that terminology.

5. Injection point has multiple callbacks.  At least one runs with the lock
   held and can mutate the "status" data.  At least one runs without the lock.

(1) is, I think, simple and sufficient.  How about that?



Re: Weird test mixup

От
Michael Paquier
Дата:
On Wed, May 08, 2024 at 08:15:53PM -0700, Noah Misch wrote:
> Yes, that would be a loss for test readability.  Also, I wouldn't be surprised
> if some test will want to attach POINT-B while POINT-A is in injection_wait().

Not impossible, still annoying with more complex scenarios.

> Various options avoiding those limitations:
>
> 1. The data area formerly called a "status" area is immutable after attach.
>    The core code copies it into a stack buffer to pass in a const callback
>    argument.
>
> 3. Move the PID check into core code.

The PID checks are specific to the module, and there could be much
more conditions like running only in specific backend types (first
example coming into mind), so I want to avoid that kind of knowledge
in the backend.

> (1) is, I think, simple and sufficient.  How about that?

That sounds fine to do that at the end..  I'm not sure how large this
chunk area added to each InjectionPointEntry should be, though.  128B
stored in each InjectionPointEntry would be more than enough I guess?
Or more like 1024?  The in-core module does not need much currently,
but larger is likely better for pluggability.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, May 09, 2024 at 01:47:45PM +0900, Michael Paquier wrote:
> That sounds fine to do that at the end..  I'm not sure how large this
> chunk area added to each InjectionPointEntry should be, though.  128B
> stored in each InjectionPointEntry would be more than enough I guess?
> Or more like 1024?  The in-core module does not need much currently,
> but larger is likely better for pluggability.

Actually, this is leading to simplifications in the module, giving me
the attached:
4 files changed, 117 insertions(+), 134 deletions(-)

So I like your suggestion.  This version closes the race window
between the shmem exit detach in backend A and a concurrent backend B
running a point local to A so as B will never run the local point of
A.  However, it can cause a failure in the shmem exit callback of
backend A if backend B does a detach of a point local to A because A
tracks its local points with a static list in its TopMemoryContext, at
least in the attached.  The second case could be solved by tracking
the list of local points in the module's InjectionPointSharedState,
but is this case really worth the complications it adds in the module
knowing that the first case would be solid enough?  Perhaps not.

Another idea I have for the second case is to make
InjectionPointDetach() a bit "softer", by returning a boolean status
rather than fail if the detach cannot be done, so as the shmem exit
callback can still loop through the entries it has in store.  It could
always be possible that a concurrent backend does a detach followed by
an attach with the same name, causing the shmem exit callback to drop
a point it should not, but that's not really a plausible case IMO :)

This stuff can be adjusted in subtle ways depending on the cases you
are most interested in.  What do you think?
--
Michael

Вложения

Re: Weird test mixup

От
Noah Misch
Дата:
On Thu, May 09, 2024 at 04:40:43PM +0900, Michael Paquier wrote:
> So I like your suggestion.  This version closes the race window
> between the shmem exit detach in backend A and a concurrent backend B
> running a point local to A so as B will never run the local point of
> A.  However, it can cause a failure in the shmem exit callback of
> backend A if backend B does a detach of a point local to A because A
> tracks its local points with a static list in its TopMemoryContext, at
> least in the attached.  The second case could be solved by tracking
> the list of local points in the module's InjectionPointSharedState,
> but is this case really worth the complications it adds in the module
> knowing that the first case would be solid enough?  Perhaps not.
> 
> Another idea I have for the second case is to make
> InjectionPointDetach() a bit "softer", by returning a boolean status 
> rather than fail if the detach cannot be done, so as the shmem exit
> callback can still loop through the entries it has in store.

The return-bool approach sounds fine.  Up to you whether to do in this patch,
else I'll do it when I add the test.

> It could
> always be possible that a concurrent backend does a detach followed by
> an attach with the same name, causing the shmem exit callback to drop
> a point it should not, but that's not really a plausible case IMO :)

Agreed.  It's reasonable to expect test cases to serialize backend exits,
attach calls, and detach calls.  If we need to fix that later, we can use
attachment serial numbers.

> --- a/src/test/modules/injection_points/injection_points.c
> +++ b/src/test/modules/injection_points/injection_points.c

> +typedef enum InjectionPointConditionType
> +{
> +    INJ_CONDITION_INVALID = 0,

I'd name this INJ_CONDITION_UNCONDITIONAL or INJ_CONDITION_ALWAYS.  INVALID
sounds like a can't-happen event or an injection point that never runs.
Otherwise, the patch looks good and makes src/test/modules/gin safe for
installcheck.  Thanks.



Re: Weird test mixup

От
Michael Paquier
Дата:
On Thu, May 09, 2024 at 04:39:00PM -0700, Noah Misch wrote:

Thanks for the feedback.

> The return-bool approach sounds fine.  Up to you whether to do in this patch,
> else I'll do it when I add the test.

I see no reason to not change the signature of the routine now if we
know that we're going to do it anyway in the future.  I was shortly
wondering if doing the same for InjectionpointAttach() would make
sense, but it has more error states, so I'm not really tempted without
an actual reason (cannot think of a case where I'd want to put more
control into a module after a failed attach).

>> It could
>> always be possible that a concurrent backend does a detach followed by
>> an attach with the same name, causing the shmem exit callback to drop
>> a point it should not, but that's not really a plausible case IMO :)
>
> Agreed.  It's reasonable to expect test cases to serialize backend exits,
> attach calls, and detach calls.  If we need to fix that later, we can use
> attachment serial numbers.

Okay by me.

> I'd name this INJ_CONDITION_UNCONDITIONAL or INJ_CONDITION_ALWAYS.  INVALID
> sounds like a can't-happen event or an injection point that never runs.
> Otherwise, the patch looks good and makes src/test/modules/gin safe for
> installcheck.  Thanks.

INJ_CONDITION_ALWAYS sounds like a good compromise here.

Attached is an updated patch for now, indented with a happy CI.  I am
still planning to look at that a second time on Monday with a fresher
mind, in case I'm missing something now.
--
Michael

Вложения

Re: Weird test mixup

От
"Andrey M. Borodin"
Дата:

> On 10 May 2024, at 06:04, Michael Paquier <michael@paquier.xyz> wrote:
>
> Attached is an updated patch for now

Can you, please, add some more comments regarding purpose of private data?
I somewhat lost understanding of the discussion for a week or so. And I hoped to grasp the idea of private_data from
resultingcode. But I cannot do so from current patch version... 

I see that you store condition in private_data. So "private" means that this is a data specific to extension, do I
understandit right? 

As long as I started anyway, I also want to ask some more stupid questions:
1. Where is the border between responsibility of an extension and the core part? I mean can we define in simple words
whatfunctionality must be in extension? 
2. If we have some concurrency issues, why can't we just protect everything with one giant LWLock\SpinLock. We have
somelocking model instead of serializing access from enter until exit. 

Most probably, this was discussed somewhere, but I could not find it.

Thanks!


Best regards, Andrey Borodin.


Re: Weird test mixup

От
Noah Misch
Дата:
On Fri, May 10, 2024 at 10:04:17AM +0900, Michael Paquier wrote:
> Attached is an updated patch for now, indented with a happy CI.  I am
> still planning to look at that a second time on Monday with a fresher
> mind, in case I'm missing something now.

This looks correct, and it works well in my tests.  Thanks.



Re: Weird test mixup

От
Michael Paquier
Дата:
On Sun, May 12, 2024 at 10:48:51AM -0700, Noah Misch wrote:
> This looks correct, and it works well in my tests.  Thanks.

Thanks for looking.  While looking at it yesterday I've decided to
split the change into two commits, one for the infra and one for the
module.  While doing so, I've noticed that the case of a private area
passed as NULL was not completely correct as memcpy would be
undefined.

The open item has been marked as fixed.
--
Michael

Вложения

Re: Weird test mixup

От
Michael Paquier
Дата:
On Sat, May 11, 2024 at 11:45:33AM +0500, Andrey M. Borodin wrote:
> I see that you store condition in private_data. So "private" means
> that this is a data specific to extension, do I understand it right?

Yes, it is possible to pass down some custom data to the callbacks
registered, generated in a module.  One example would be more complex
condition grammar, like a JSON-based thing.  I don't really see the
need for this amount of complexity in the tree yet, but one could do
that outside the tree easily.

> As long as I started anyway, I also want to ask some more stupid
> questions:
> 1. Where is the border between responsibility of an extension and
> the core part? I mean can we define in simple words what
> functionality must be in extension?

Rule 0 I've been using here: keep the footprint on the backend as
simple as possible.  These have as absolute minimum requirement:
- A function name.
- A library name.
- A point name.

The private area contents and size are added to address the
concurrency cases with runtime checks.  I didn't see a strong use for
that first, but Noah has been convincing enough with his use cases and
the fact that the race between detach and run was not completely
closed because we lacked consistency with the shmem hash table lookup.

> 2. If we have some concurrency issues, why can't we just protect
> everything with one giant LWLock\SpinLock. We have some locking
> model instead of serializing access from enter until exit.

This reduces the test infrastructure flexibility, because one may want
to attach or detach injection points while a point is running.  So it
is by design that the LWLock protecting the shmem hash table is not hold
when a point is running.  This has been covered a bit upthread, and I
want to be able to do that as well.
--
Michael

Вложения