Обсуждение: Interrupts vs signals

Поиск
Список
Период
Сортировка

Interrupts vs signals

От
Thomas Munro
Дата:
Hi,

I wonder if we really need signals to implement interrupts.  Given
that they are non-preemptive/cooperative (work happens at the next
CFI()), why not just use shared memory flags and latches?  That skips
a bunch of code, global variables and scary warnings about programming
in signal handlers.

I sketched out some code to try that a few months back, while
speculating about bite-sized subproblems that would come up if each
backend is, one day, a thread.

There are several other conditions that are also handled by
CHECK_FOR_INTERRUPTS(), but are not triggered by other backends
sending  signals, or are set by other signal handlers (SIGALRM,
SIGQUIT).  One idea is to convert those into "procsignals" too, for
consistency.  In the attached, they can be set (ie by the same
backend) with ProcSignalRaise(), but it's possible that in future we
might have a reason for another backend to set them too, so it seems
like a good idea to have a single system, effectively merging the
concepts of "procsignals" and "interrupts".

There are still a few more ad hoc (non-ProcSignal) uses of SIGUSR1 in
the tree.  For one thing, we don't allow the postmaster to set
latches; if we gave up on that rule, we wouldn't need the bgworker
please-signal-me thing.  Also the logical replication launcher does
the same sort of thing for no apparent reason.  Changed in the
attached -- mainly so I could demonstrate that check-world passes with
SIGUSR1 ignored.

The attached is only experiment grade code: in particular, I didn't
quite untangle the recovery conflict flags properly.  It's also doing
function calls where some kind of fast inlined magic is probably
required, and I probably have a few other details wrong, but I figured
it was good enough to demonstrate the concept.

Вложения

Re: Interrupts vs signals

От
Andres Freund
Дата:
Hi,

On 2021-10-21 07:55:54 +1300, Thomas Munro wrote:
> I wonder if we really need signals to implement interrupts.  Given
> that they are non-preemptive/cooperative (work happens at the next
> CFI()), why not just use shared memory flags and latches?  That skips
> a bunch of code, global variables and scary warnings about programming
> in signal handlers.

Depending on how you implement them, one difference could be whether / when
"slow" system calls (recv, poll, etc) are interrupted.

Another is that that signal handling provides a memory barrier in the
receiving process. For things that rarely change (like most interrupts), it
can be more efficient to move the cost of that out-of-line, instead of
incurring them on every check.


One nice thing of putting the state variables into shared memory would be that
that'd allow to see the pending interrupts of other backends for debugging
purposes.


> One idea is to convert those into "procsignals" too, for
> consistency.  In the attached, they can be set (ie by the same
> backend) with ProcSignalRaise(), but it's possible that in future we
> might have a reason for another backend to set them too, so it seems
> like a good idea to have a single system, effectively merging the
> concepts of "procsignals" and "interrupts".

This seems a bit confusing to me. For one, we need to have interrupts working
before we have a proc, IIRC. But leaving details like that aside, it just
seems a bit backwards to me. I'm on board with other backends directly setting
interrupt flags, but it seems to me that the procsignal stuff should be
"client" of the process-local interrupt infrastructure, rather than the other
way round.


> +bool
> +ProcSignalAnyPending(void)
> +{
> +    volatile ProcSignalSlot *slot = MyProcSignalSlot;
>  
> -    if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
> -        RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
> +    /* XXX make this static inline? */
> +    /* XXX point to a dummy entry instead of using NULL to avoid a branch */
> +    return slot && slot->pss_signaled;
> +}

ISTM it might be easier to make this stuff efficiently race-free if we made
this a count of pending operations.


> @@ -3131,12 +3124,13 @@ ProcessInterrupts(void)
>      /* OK to accept any interrupts now? */
>      if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
>          return;
> -    InterruptPending = false;
> +    ProcSignalClearAnyPending();
> +
> +    pg_read_barrier();
>  
> -    if (ProcDiePending)
> +    if (ProcSignalConsume(PROCSIG_DIE))
>      {

I think making all of these checks into function calls isn't great. How about
making the set of pending signals a bitmask? That'd allow to efficiently check
a bunch of interrupts together and even where not, it'd just be a single test
of the mask, likely already in a register.


Greetings,

Andres Freund



Re: Interrupts vs signals

От
Thomas Munro
Дата:
On Thu, Oct 21, 2021 at 8:27 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-10-21 07:55:54 +1300, Thomas Munro wrote:
> > I wonder if we really need signals to implement interrupts.  Given
> > that they are non-preemptive/cooperative (work happens at the next
> > CFI()), why not just use shared memory flags and latches?  That skips
> > a bunch of code, global variables and scary warnings about programming
> > in signal handlers.
>
> Depending on how you implement them, one difference could be whether / when
> "slow" system calls (recv, poll, etc) are interrupted.

Hopefully by now all such waits are implemented with latch.c facilities?

> Another is that that signal handling provides a memory barrier in the
> receiving process. For things that rarely change (like most interrupts), it
> can be more efficient to move the cost of that out-of-line, instead of
> incurring them on every check.

Agreed, but in this experiment I was trying out the idea that a memory
barrier is not really needed at all, unless you're about to go to
sleep.  We already insert one of those before a latch wait.  That is,
if we see !set->latch->is_set, we do pg_memory_barrier() and check
again, before sleeping, so your next CFI must see the flag.  For
computation loops (sort, hash, query execution, ...), I speculate that
a relaxed read of memory is fine... you'll see the flag pretty soon,
and you certainly won't be allowed to finish your computation and go
to sleep.

> One nice thing of putting the state variables into shared memory would be that
> that'd allow to see the pending interrupts of other backends for debugging
> purposes.

+1

> > One idea is to convert those into "procsignals" too, for
> > consistency.  In the attached, they can be set (ie by the same
> > backend) with ProcSignalRaise(), but it's possible that in future we
> > might have a reason for another backend to set them too, so it seems
> > like a good idea to have a single system, effectively merging the
> > concepts of "procsignals" and "interrupts".
>
> This seems a bit confusing to me. For one, we need to have interrupts working
> before we have a proc, IIRC. But leaving details like that aside, it just
> seems a bit backwards to me. I'm on board with other backends directly setting
> interrupt flags, but it seems to me that the procsignal stuff should be
> "client" of the process-local interrupt infrastructure, rather than the other
> way round.

Hmm.  Yeah, I see your point.  But I can also think of some arguments
for merging the concepts of local and shared interrupts; see below.

In this new sketch, I tried doing it the other way around.  That is,
completely removing the concept of "ProcSignal", leaving only
"Interrupts".  Initially, MyPendingInterrupts points to something
private, and once you're connected to shared memory it points to
MyProc->pending_interrupts.

> > +bool
> > +ProcSignalAnyPending(void)
> > +{
> > +     volatile ProcSignalSlot *slot = MyProcSignalSlot;
> >
> > -     if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
> > -             RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
> > +     /* XXX make this static inline? */
> > +     /* XXX point to a dummy entry instead of using NULL to avoid a branch */
> > +     return slot && slot->pss_signaled;
> > +}
>
> ISTM it might be easier to make this stuff efficiently race-free if we made
> this a count of pending operations.

Hmm, with a unified interrupt system and a bitmap it's not necessary
to have a separate flag/counter at all.

> > @@ -3131,12 +3124,13 @@ ProcessInterrupts(void)
> >       /* OK to accept any interrupts now? */
> >       if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
> >               return;
> > -     InterruptPending = false;
> > +     ProcSignalClearAnyPending();
> > +
> > +     pg_read_barrier();
> >
> > -     if (ProcDiePending)
> > +     if (ProcSignalConsume(PROCSIG_DIE))
> >       {
>
> I think making all of these checks into function calls isn't great. How about
> making the set of pending signals a bitmask? That'd allow to efficiently check
> a bunch of interrupts together and even where not, it'd just be a single test
> of the mask, likely already in a register.

+1.

Some assorted notes:

1.  Aside from doing interrupts in this new way, I also have the
postmaster setting latches (!) instead of sending ad hoc SIGUSR1 here
and there.  My main reason for doing that was to be able to chase out
all reasons to register a SIGUSR1 handler, so I could prove that
check-world passes.  I like it, though. But maybe it's really a
separate topic.

2.  I moved this stuff into interrupt.{h,c}.  There is nothing left in
procsignal.c except code relating to ProcSignalBarrier.  I guess that
thing could use another name, anyway.  It's a ...
SystemInterruptBarrier?

3.  Child-level SIGINT and SIGTERM handlers probably aren't really
necessary, either: maybe the sender could do
InterruptSend(INTERRUPT_{DIE,QUERY_CANCEL}, pgprocno) instead?  But
perhaps people are attached to being able to send those signals from
external programs directly to backends.

4.  Like the above, a SIGALRM handler might need to do eg
InterruptRaise(INTERRUPT_STATEMENT_TIMEOUT).  That's a problem for
systems using spinlocks (self-deadlock against user context in
InterruptRaise()), so I'd need to come up with some flag protocol for
dinosaurs to make that safe, OR revert to having these "local only"
interrupts done with separate flags, as you were getting at earlier.

5.  The reason I prefer to put currently "local only" interrupts into
the same atomic system is that I speculate that ultimately all of the
backend-level signal handlers won't be needed.  They all fall into
three categories: (1) could be replaced with these interrupts
directly, (2) could be replaced by the new timer infrastructure that
multithreaded postgres would need to have to deliver interrupts to the
right recipients, (3) are quickdie and can be handled at the
containing process level.  Then the only signal handlers left are top
level external ones.

But perhaps you're right and I should try reintroducing separate local
interrupts for now.  I dunno, I like the simplicity of the unified
system; if only it weren't for those spinlock-backed atomics.

Вложения

Re: Interrupts vs signals

От
Robert Haas
Дата:
On Thu, Nov 11, 2021 at 12:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Depending on how you implement them, one difference could be whether / when
> > "slow" system calls (recv, poll, etc) are interrupted.
>
> Hopefully by now all such waits are implemented with latch.c facilities?

Do read(), write(), etc. count? Because we certainly have raw calls to
those functions in lots of places.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Interrupts vs signals

От
Andres Freund
Дата:
Hi,

On 2021-11-11 09:06:01 -0500, Robert Haas wrote:
> On Thu, Nov 11, 2021 at 12:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Depending on how you implement them, one difference could be whether / when
> > > "slow" system calls (recv, poll, etc) are interrupted.
> >
> > Hopefully by now all such waits are implemented with latch.c facilities?
> 
> Do read(), write(), etc. count? Because we certainly have raw calls to
> those functions in lots of places.

They can count, but only when used for network sockets or pipes ("slow
devices" or whatever the posix language is). Disk IO doesn't count as that. So
I don't think it'd be a huge issue.

Greetings,

Andres Freund



Re: Interrupts vs signals

От
Robert Haas
Дата:
On Thu, Nov 11, 2021 at 2:50 PM Andres Freund <andres@anarazel.de> wrote:
> They can count, but only when used for network sockets or pipes ("slow
> devices" or whatever the posix language is). Disk IO doesn't count as that. So
> I don't think it'd be a huge issue.

Somehow the idea that the network is a slow device and the disk a fast
one does not seem like it's necessarily accurate on modern hardware,
but I guess the spec is what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Interrupts vs signals

От
Thomas Munro
Дата:
On Fri, Nov 12, 2021 at 9:24 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 11, 2021 at 2:50 PM Andres Freund <andres@anarazel.de> wrote:
> > They can count, but only when used for network sockets or pipes ("slow
> > devices" or whatever the posix language is). Disk IO doesn't count as that. So
> > I don't think it'd be a huge issue.
>
> Somehow the idea that the network is a slow device and the disk a fast
> one does not seem like it's necessarily accurate on modern hardware,
> but I guess the spec is what it is.

[Somehow I managed to reply to Robert only; let me try that again,
this time to the list...]

Network filesystems have in the past been confusing because they're
both disk-like and network-like, and also slow as !@#$, which is why
there have been mount point options like "intr", "nointr" (now ignored
on Linux) to control what happens if you receive an async signal
during a sleepy read/write.  But even if you had some kind of
Deathstation 9000 that had a switch on the front panel that ignores
SA_RESTART and produces EINTR for disk I/O when a signal arrives,
PostgreSQL already doesn't work today.  Our pread() and pwrite() paths
for data and WAL don't not have a EINTR loops or
CHECK_FOR_INTERRUPTS() (we just can't take interrupts in the middle of
eg a synchronous write), so I think we'd produce an ERROR or PANIC.
So I think disk I/O is irrelevant, and network/pipe I/O is already
handled everywhere via latch.c facilities.

If there are any eg blocking reads on a socket in PostgreSQL, we
should fix that to use latch.c non-blocking techniques, because such a
place is already a place that ignores postmaster death and interrupts.
To be more precise: such a place could of course wake up for EINTR on
SIGUSR1 from procsignal.c, and that would no longer happen with my
patch, but if we're relying on that anywhere, it's dangerous and
unreliable.  If SIGUSR1 is delivered right before you enter a blocking
read(), you'll sleep waiting for the socket or whatever.  That's
precisely the problem that latch.c solves, and why it's already a bug
if there are such places.



Re: Interrupts vs signals

От
Thomas Munro
Дата:
Here's an updated version of this patch.

The main idea is that SendProcSignal(pid, PROCSIGNAL_XXX, procno)
becomes SendInterrupt(INTERRUPT_XXX, procno), and all the pending
interrupt global variables and pss_procsignalFlags[] go away, along
with the SIGUSR1 handler.  The interrupts are compressed into a single
bitmap.  See commit message for more details.

The patch is failing on Windows CI for reasons I haven't debugged yet,
but I wanted to share what I have so far.  Work in progress!

Here is my attempt to survey the use of signals and write down what I
think we might do about them all so far, to give the context for this
patch:

https://wiki.postgresql.org/wiki/Signals

Comments, corrections, edits very welcome.

Вложения

Re: Interrupts vs signals

От
Heikki Linnakangas
Дата:
On 08/07/2024 05:56, Thomas Munro wrote:
> Here's an updated version of this patch.
> 
> The main idea is that SendProcSignal(pid, PROCSIGNAL_XXX, procno)
> becomes SendInterrupt(INTERRUPT_XXX, procno), and all the pending
> interrupt global variables and pss_procsignalFlags[] go away, along
> with the SIGUSR1 handler.  The interrupts are compressed into a single
> bitmap.  See commit message for more details.
> 
> The patch is failing on Windows CI for reasons I haven't debugged yet,
> but I wanted to share what I have so far.  Work in progress!
> 
> Here is my attempt to survey the use of signals and write down what I
> think we might do about them all so far, to give the context for this
> patch:
> 
> https://wiki.postgresql.org/wiki/Signals
> 
> Comments, corrections, edits very welcome.

Nice, thanks!

> Background worker state notifications are also changed from raw
> kill(SIGUSR1) to SetLatch().  That means that SetLatch() is now called
> from the postmaster.  The main purpose of including that change is to be
> able to remove procsignal_sigusr1_handler completely and set SIGUSR1 to
> SIG_IGN, and show the system working.
> 
> XXX Do we need to invent SetLatchRobust() that doesn't trust anything in
> shared memory, to be able to set latches from the postmaster?

The patch actually does both: it still does kill(SIGUSR1) and also sets 
the latch.

I think it would be nice if RegisterDynamicBackgroundWorker() had a 
"bool notify_me" argument, instead of requiring the caller to set 
"bgw_notify_pid = MyProcPid" before the call. That's a 
backwards-compatibility break, but maybe we should bite the bullet and 
do it. Or we could do this in RegisterDynamicBackgroundWorker():

if (worker->bgw_notify_pid == MyProcPid)
     worker->bgw_notify_pgprocno = MyProcNumber;

I think we can forbid setting pgw_notify_pid to anything else than 0 or 
MyProcPid.

A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it 
can do any damage if you called it on a pointer to garbage, except if 
the pointer itself is bogus, then just dereferencing it an cause a 
segfault. So it would be nice to have a version specifically designed 
with that in mind. For example, it could assume that the Latch's pid is 
never legally equal to MyProcPid, because postmaster cannot own any latches.

Another approach would be to move the responsibility of background 
worker state notifications out of postmaster completely. When a new 
background worker is launched, the worker process itself could send the 
notification that it has started. And similarly, when a worker exits, it 
could send the notification just before exiting. There's a little race 
condition with exiting: if a process is waiting for the bgworker to 
exit, and launches a new worker immediately when the old one exits, 
there will be a brief period when the old and new process are alive at 
the same time. The old worker wouldn't be doing anything interesting 
anymore since it's exiting, but it still counts towards 
max_worker_processes, so launching the new process might fail because of 
hitting the limit. Maybe we should just bump up max_worker_processes. Or 
postmaster could check PMChildFlags and not count processes that have 
already deregistered from PMChildFlags towards the limit.

> -volatile uint32 InterruptHoldoffCount = 0;
> -volatile uint32 QueryCancelHoldoffCount = 0;
> -volatile uint32 CritSectionCount = 0;
> +uint32 InterruptHoldoffCount = 0;
> +uint32 QueryCancelHoldoffCount = 0;
> +uint32 CritSectionCount = 0;

I wondered if these are used in PG_TRY-CATCH blocks in a way that would 
still require volatile. I couldn't find any such usage by some quick 
grepping, so I think we're good, but I thought I'd mention it.

> +/*
> + * The set of "standard" interrupts that CHECK_FOR_INTERRUPTS() and
> + * ProcessInterrupts() handle.  These perform work that is safe to run whenever
> + * interrupts are not "held".  Other kinds of interrupts are only handled at
> + * more restricted times.
> + */
> +#define INTERRUPT_STANDARD_MASK                               \

Some interrupts are missing from this mask:

- INTERRUPT_PARALLEL_APPLY_MESSAGE
- INTERRUPT_IDLE_STATS_UPDATE_TIMEOUT
- INTERRUPT_SINVAL_CATCHUP
- INTERRUPT_NOTIFY

Is that on purpose?

> -/*
> - * Because backends sitting idle will not be reading sinval events, we
> - * need a way to give an idle backend a swift kick in the rear and make
> - * it catch up before the sinval queue overflows and forces it to go
> - * through a cache reset exercise.  This is done by sending
> - * PROCSIG_CATCHUP_INTERRUPT to any backend that gets too far behind.
> - *
> - * The signal handler will set an interrupt pending flag and will set the
> - * processes latch. Whenever starting to read from the client, or when
> - * interrupted while doing so, ProcessClientReadInterrupt() will call
> - * ProcessCatchupEvent().
> - */
> -volatile sig_atomic_t catchupInterruptPending = false;

Would be nice to move that comment somewhere else rather than remove it 
completely.

> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -444,6 +444,10 @@ InitProcess(void)
>      OwnLatch(&MyProc->procLatch);
>      SwitchToSharedLatch();
>  
> +    /*We're now ready to accept interrupts from other processes. */
> +    pg_atomic_init_u32(&MyProc->pending_interrupts, 0);
> +    SwitchToSharedInterrupts();
> +
>      /* now that we have a proc, report wait events to shared memory */
>      pgstat_set_wait_event_storage(&MyProc->wait_event_info);
>  
> @@ -611,6 +615,9 @@ InitAuxiliaryProcess(void)
>      OwnLatch(&MyProc->procLatch);
>      SwitchToSharedLatch();
>  
> +    /* We're now ready to accept interrupts from other processes. */
> +    SwitchToSharedInterrupts();
> +
>      /* now that we have a proc, report wait events to shared memory */
>      pgstat_set_wait_event_storage(&MyProc->wait_event_info);
>  

Is there a reason for the different initialization between a regular 
backend and aux process?

> +/*
> + * Switch to shared memory interrupts.  Other backends can send interrupts
> + * to this one if they know its ProcNumber.
> + */
> +void
> +SwitchToSharedInterrupts(void)
> +{
> +    pg_atomic_fetch_or_u32(&MyProc->pending_interrupts, pg_atomic_read_u32(MyPendingInterrupts));
> +    MyPendingInterrupts = &MyProc->pending_interrupts;
> +}

Hmm, I think there's a race condition here (and similarly in 
SwitchToLocalInterrupts), if the signal handler runs between the 
pg_atomic_fetch_or_u32, and changing MyPendingInterrupts. Maybe 
something like this instead:

MyPendingInterrupts = &MyProc->pending_interrupts;
pg_memory_barrier();
pg_atomic_fetch_or_u32(&MyProc->pending_interrupts, 
pg_atomic_read_u32(LocalPendingInterrupts));

And perhaps this should also clear LocalPendingInterrupts, just to be tidy.

> @@ -138,7 +139,8 @@
>  typedef struct ProcState
>  {
>      /* procPid is zero in an inactive ProcState array entry. */
> -    pid_t        procPid;        /* PID of backend, for signaling */
> +    pid_t        procPid;        /* pid of backend */
> +    ProcNumber    pgprocno;        /* for sending interrupts */
>      /* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
>      int            nextMsgNum;        /* next message number to read */
>      bool        resetState;        /* backend needs to reset its state */

We can easily remove procPid altogether now that we have pgprocno here. 
Similarly with the pid/pgprocno in ReplicationSlot and WalSndState.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Interrupts vs signals

От
Robert Haas
Дата:
On Mon, Jul 8, 2024 at 5:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Another approach would be to move the responsibility of background
> worker state notifications out of postmaster completely. When a new
> background worker is launched, the worker process itself could send the
> notification that it has started. And similarly, when a worker exits, it
> could send the notification just before exiting. There's a little race
> condition with exiting: if a process is waiting for the bgworker to
> exit, and launches a new worker immediately when the old one exits,
> there will be a brief period when the old and new process are alive at
> the same time. The old worker wouldn't be doing anything interesting
> anymore since it's exiting, but it still counts towards
> max_worker_processes, so launching the new process might fail because of
> hitting the limit. Maybe we should just bump up max_worker_processes. Or
> postmaster could check PMChildFlags and not count processes that have
> already deregistered from PMChildFlags towards the limit.

I can testify that the current system is the result of a lot of trial
and error. I'm not saying it can't be made better, but my initial
attempts at getting this to work (back in the 9.4 era) resembled what
you proposed here, were consequently a lot simpler than what we have
now, and also did not work. Race conditions like you mention here were
part of that. Another consideration is that fork() can fail, and in
that case, the process that tried to register the new background
worker needs to find out that the background worker won't ever be
starting. Yet another problem is that, even if fork() succeeds, the
new process might fail before it executes any of our code e.g. because
it seg faults very early, a case that actually happened to me -
inadvertently - while I was testing these facilities. I ended up
deciding that we can't rely on the new process to do anything until
it's given us some signal that it is alive and able to carry out its
duties. If it dies before telling us that, or never starts in the
first place, we have to have some other way of finding that out, and
it's difficult to see how that can happen without postmaster
involvement.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Interrupts vs signals

От
Thomas Munro
Дата:
On Mon, Jul 8, 2024 at 9:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> The patch actually does both: it still does kill(SIGUSR1) and also sets
> the latch.

Yeah, I had some ideas about supporting old extension code that really
wanted a SIGUSR1, but on reflection, the only reason anyone ever wants
that is so that sigusr1_handler can SetLatch(), which pairs with
WaitLatch() in WaitForBackgroundWorker*().  Let's go all the way and
assume that.

> I think it would be nice if RegisterDynamicBackgroundWorker() had a
> "bool notify_me" argument, instead of requiring the caller to set
> "bgw_notify_pid = MyProcPid" before the call. That's a
> backwards-compatibility break, but maybe we should bite the bullet and
> do it. Or we could do this in RegisterDynamicBackgroundWorker():
>
> if (worker->bgw_notify_pid == MyProcPid)
>      worker->bgw_notify_pgprocno = MyProcNumber;
>
> I think we can forbid setting pgw_notify_pid to anything else than 0 or
> MyProcPid.

Another idea: we could keep the bgw_notify_pid field around for a
while, documented as unused and due to be removed in future.  We could
automatically capture the caller's proc number.  So you get latch
wakeups by default, which I expect many people want, and most people
could cope with even if they don't want them.  If you really really
don't want them, you could set a new flag BGW_NO_NOTIFY.

I have now done this part of the change in a new first patch.  This
particular use of kill(SIGUSR1) is separate from the ProcSignal
removal, it's just that it relies on ProcSignal's handler's default
action of calling SetLatch().  It's needed so the ProcSignal-ectomy
can fully delete sigusr1_handler(), but it's not directly the same
thing, so it seemed good to split the patch.

> A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it
> can do any damage if you called it on a pointer to garbage, except if
> the pointer itself is bogus, then just dereferencing it an cause a
> segfault. So it would be nice to have a version specifically designed
> with that in mind. For example, it could assume that the Latch's pid is
> never legally equal to MyProcPid, because postmaster cannot own any latches.

Yeah I'm starting to think that all we need to do here is range-check
the proc number.  Here's a version that adds:
ProcSetLatch(proc_number).  Another idea would be for SetLatch(latch)
to sanitise the address of a latch, ie its offset and range.

What the user really wants to be able to do with this bgworker API, I
think, is wait for a given a handle, which could find a condition
variable + generation in the slot, so that we don't have to register
any proc numbers anywhere until we're actually waiting.  But *clearly*
the postmaster can't use the condition variable API without risking
following corrupted pointers in shared memory.  Whereas AFAICT
ProcSetLatch() from the patched postmaster can't really be corrupted
in any new way that it couldn't already be corrupted in master (where
it runs in the target process), if we're just a bit paranoid about how
we find our way to the latch.

Receiving latch wakeups in the postmaster might be another question,
but I don't think we need to confront that question just yet.

> > -volatile uint32 InterruptHoldoffCount = 0;
> > -volatile uint32 QueryCancelHoldoffCount = 0;
> > -volatile uint32 CritSectionCount = 0;
> > +uint32 InterruptHoldoffCount = 0;
> > +uint32 QueryCancelHoldoffCount = 0;
> > +uint32 CritSectionCount = 0;
>
> I wondered if these are used in PG_TRY-CATCH blocks in a way that would
> still require volatile. I couldn't find any such usage by some quick
> grepping, so I think we're good, but I thought I'd mention it.

Hmm.  Still thinking about this.

> > +/*
> > + * The set of "standard" interrupts that CHECK_FOR_INTERRUPTS() and
> > + * ProcessInterrupts() handle.  These perform work that is safe to run whenever
> > + * interrupts are not "held".  Other kinds of interrupts are only handled at
> > + * more restricted times.
> > + */
> > +#define INTERRUPT_STANDARD_MASK                                                         \
>
> Some interrupts are missing from this mask:
>
> - INTERRUPT_PARALLEL_APPLY_MESSAGE

Oops, that one ^ is a rebasing mistake.  I wrote the ancestor of this
patch in 2021, and that new procsignal arrived in 2023, and I put the
code in to handle it, but I forgot to add it to the mask, which gives
me an idea (see below)...

> - INTERRUPT_IDLE_STATS_UPDATE_TIMEOUT
> - INTERRUPT_SINVAL_CATCHUP
> - INTERRUPT_NOTIFY
>
> Is that on purpose?

INTERRUPT_SINVAL_CATCHUP and INTERRUPT_NOTIFY are indeed handled
differently on purpose.  In master, they don't set InterruptPending,
and they are not handled by regular CHECK_FOR_INTERRUPTS() sites, but
in the patch they still need a bit in pending_interrupts, and that is
what that mask hides from CHECK_FOR_INTERRUPTS().  They are checked
explicitly in ProcessClientReadInterrupt().  I think the idea is that
we can't handle sinval at random places because that might create
dangling pointers to cached objects where we don't expect them, and we
can't emit NOTIFY-related protocol messages at random times either.

There is something a little funky about _IDLE_STATS_UPDATE_TIMEOUT,
though.  It has a different scheme for running only when idle, where
if it opts not to do anything, it doesn't consume the interrupt (a
later CFI() will, but the latch is not set so we might sleep).  I was
confused by that.  I think I have changed to be more faithful to
master's behaviour now.

Hmm, a better terminology for the interupts that CFI handles would be
s/standard/regular/, so I have changed that.

New idea: it would be less error-prone if we instead had a mask of
these special cases, of which there are now only two.  Tried that way!

> > -/*
> > - * Because backends sitting idle will not be reading sinval events, we
> > - * need a way to give an idle backend a swift kick in the rear and make
> > - * it catch up before the sinval queue overflows and forces it to go
> > - * through a cache reset exercise.  This is done by sending
> > - * PROCSIG_CATCHUP_INTERRUPT to any backend that gets too far behind.
> > - *
> > - * The signal handler will set an interrupt pending flag and will set the
> > - * processes latch. Whenever starting to read from the client, or when
> > - * interrupted while doing so, ProcessClientReadInterrupt() will call
> > - * ProcessCatchupEvent().
> > - */
> > -volatile sig_atomic_t catchupInterruptPending = false;
>
> Would be nice to move that comment somewhere else rather than remove it
> completely.

OK, I moved it to the top of ProcessCatchupInterrupt().

> > --- a/src/backend/storage/lmgr/proc.c
> > +++ b/src/backend/storage/lmgr/proc.c
> > @@ -444,6 +444,10 @@ InitProcess(void)
> >       OwnLatch(&MyProc->procLatch);
> >       SwitchToSharedLatch();
> >
> > +     /*We're now ready to accept interrupts from other processes. */
> > +     pg_atomic_init_u32(&MyProc->pending_interrupts, 0);
> > +     SwitchToSharedInterrupts();
> > +
> >       /* now that we have a proc, report wait events to shared memory */
> >       pgstat_set_wait_event_storage(&MyProc->wait_event_info);
> >
> > @@ -611,6 +615,9 @@ InitAuxiliaryProcess(void)
> >       OwnLatch(&MyProc->procLatch);
> >       SwitchToSharedLatch();
> >
> > +     /* We're now ready to accept interrupts from other processes. */
> > +     SwitchToSharedInterrupts();
> > +
> >       /* now that we have a proc, report wait events to shared memory */
> >       pgstat_set_wait_event_storage(&MyProc->wait_event_info);
> >
>
> Is there a reason for the different initialization between a regular
> backend and aux process?

No.  But I thought about something else to fix here.  Really we don't
want to switch to shared interrupts until we are ready for CFI() to do
stuff.  I think that should probably be at the places where master
unblocks signals.  Otherwise, for example, if someone sends you an
interrupt while you're starting up, something as innocent as
elog(DEBUG, ...), which reaches CFI(), might try to do things for
which the infrastructure is not yet fully set up, for example
INTERRUPT_BARRIER.

Not done yet, but wanted to share this new version.

> > +/*
> > + * Switch to shared memory interrupts.  Other backends can send interrupts
> > + * to this one if they know its ProcNumber.
> > + */
> > +void
> > +SwitchToSharedInterrupts(void)
> > +{
> > +     pg_atomic_fetch_or_u32(&MyProc->pending_interrupts, pg_atomic_read_u32(MyPendingInterrupts));
> > +     MyPendingInterrupts = &MyProc->pending_interrupts;
> > +}
>
> Hmm, I think there's a race condition here (and similarly in
> SwitchToLocalInterrupts), if the signal handler runs between the
> pg_atomic_fetch_or_u32, and changing MyPendingInterrupts. Maybe
> something like this instead:
>
> MyPendingInterrupts = &MyProc->pending_interrupts;
> pg_memory_barrier();
> pg_atomic_fetch_or_u32(&MyProc->pending_interrupts,
> pg_atomic_read_u32(LocalPendingInterrupts));

Yeah, right, done.

> And perhaps this should also clear LocalPendingInterrupts, just to be tidy.

I used atomic_exchange() to read and clear the bits in one go.

> > @@ -138,7 +139,8 @@
> >  typedef struct ProcState
> >  {
> >       /* procPid is zero in an inactive ProcState array entry. */
> > -     pid_t           procPid;                /* PID of backend, for signaling */
> > +     pid_t           procPid;                /* pid of backend */
> > +     ProcNumber      pgprocno;               /* for sending interrupts */
> >       /* nextMsgNum is meaningless if procPid == 0 or resetState is true. */
> >       int                     nextMsgNum;             /* next message number to read */
> >       bool            resetState;             /* backend needs to reset its state */
>
> We can easily remove procPid altogether now that we have pgprocno here.

Since other things access those values, I propose to remove them in
separate patches.

> Similarly with the pid/pgprocno in ReplicationSlot and WalSndState.

Same.  Those pids show up in user interfaces, so I think they should
be handled in separate patches.

Note to self: I need to change some pgprocno to proc_number...

The next problems to remove are, I think, the various SIGUSR2, SIGINT,
SIGTERM signals sent by the postmaster.  These should clearly become
SendInterrupt() or ProcSetLatch().  The problem here is that the
postmaster doesn't have the proc numbers yet.  One idea is to teach
the postmaster to assign them!  Not explored yet.

This version is passing on Windows.  I'll create a CF entry.  Still
work in progress!

Вложения

Re: Interrupts vs signals

От
Heikki Linnakangas
Дата:
On 10/07/2024 09:48, Thomas Munro wrote:
> On Mon, Jul 8, 2024 at 9:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> The patch actually does both: it still does kill(SIGUSR1) and also sets
>> the latch.
> 
> Yeah, I had some ideas about supporting old extension code that really
> wanted a SIGUSR1, but on reflection, the only reason anyone ever wants
> that is so that sigusr1_handler can SetLatch(), which pairs with
> WaitLatch() in WaitForBackgroundWorker*().  Let's go all the way and
> assume that.

+1

>> I think it would be nice if RegisterDynamicBackgroundWorker() had a
>> "bool notify_me" argument, instead of requiring the caller to set
>> "bgw_notify_pid = MyProcPid" before the call. That's a
>> backwards-compatibility break, but maybe we should bite the bullet and
>> do it. Or we could do this in RegisterDynamicBackgroundWorker():
>>
>> if (worker->bgw_notify_pid == MyProcPid)
>>       worker->bgw_notify_pgprocno = MyProcNumber;
>>
>> I think we can forbid setting pgw_notify_pid to anything else than 0 or
>> MyProcPid.
> 
> Another idea: we could keep the bgw_notify_pid field around for a
> while, documented as unused and due to be removed in future.  We could
> automatically capture the caller's proc number.  So you get latch
> wakeups by default, which I expect many people want, and most people
> could cope with even if they don't want them.  If you really really
> don't want them, you could set a new flag BGW_NO_NOTIFY.

Ok. I was going to say that it feels excessive to change the default 
like that. However, searching for RegisterDynamicBackgroundWorker() in 
github, I can't actually find any callers that don't set pg_notify_pid. 
So yeah, make sense.

> I have now done this part of the change in a new first patch.  This
> particular use of kill(SIGUSR1) is separate from the ProcSignal
> removal, it's just that it relies on ProcSignal's handler's default
> action of calling SetLatch().  It's needed so the ProcSignal-ectomy
> can fully delete sigusr1_handler(), but it's not directly the same
> thing, so it seemed good to split the patch.

PostmasterMarkPIDForWorkerNotify() is now unused. Which means that 
bgworker_notify is never set, and BackgroundWorkerStopNotifications() is 
never called either.

>> A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it
>> can do any damage if you called it on a pointer to garbage, except if
>> the pointer itself is bogus, then just dereferencing it an cause a
>> segfault. So it would be nice to have a version specifically designed
>> with that in mind. For example, it could assume that the Latch's pid is
>> never legally equal to MyProcPid, because postmaster cannot own any latches.
> 
> Yeah I'm starting to think that all we need to do here is range-check
> the proc number.  Here's a version that adds:
> ProcSetLatch(proc_number).  Another idea would be for SetLatch(latch)
> to sanitise the address of a latch, ie its offset and range.

Hmm, I don't think postmaster should trust ProcGlobal->allProcCount either.

> The next problems to remove are, I think, the various SIGUSR2, SIGINT,
> SIGTERM signals sent by the postmaster.  These should clearly become
> SendInterrupt() or ProcSetLatch().

+1

> The problem here is that the
> postmaster doesn't have the proc numbers yet.  One idea is to teach
> the postmaster to assign them!  Not explored yet.

I've been thinking that we should:

a) assign every child process a PGPROC entry, and make postmaster 
responsible for assigning them like you suggest. We'll need more PGPROC 
entries, because currently a process doesn't reserve one until 
authentication has happened. Or we change that behavior.

or

b) Use the pmsignal.c slot numbers for this, instead of ProcNumbers. 
Postmaster already assigns those.

I'm kind of leaning towards b) for now, because it feels like a much 
smaller patch. In the long run, it would be nice if every child process 
had a ProcNumber, though. It was a nice simplification in v17 that we 
don't have separate BackendId and ProcNumber anymore; similarly it would 
be nice to not have separate PMChildSlot and ProcNumber concepts.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Interrupts vs signals

От
Robert Haas
Дата:
On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> a) assign every child process a PGPROC entry, and make postmaster
> responsible for assigning them like you suggest. We'll need more PGPROC
> entries, because currently a process doesn't reserve one until
> authentication has happened. Or we change that behavior.

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Interrupts vs signals

От
Heikki Linnakangas
Дата:
On 29/07/2024 19:56, Robert Haas wrote:
> On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> a) assign every child process a PGPROC entry, and make postmaster
>> responsible for assigning them like you suggest. We'll need more PGPROC
>> entries, because currently a process doesn't reserve one until
>> authentication has happened. Or we change that behavior.
> 
> I wonder how this works right now. Is there something that limits the
> number of authentication requests that can be in flight concurrently,
> or is it completely uncapped (except by machine resources)?

> My first thought when I read this was that it would be bad to have to
> put a limit on something that's currently unlimited. But then I
> started thinking that, even if it is currently unlimited, that might
> be a bug rather than a feature. If you have hundreds of pending
> authentication requests, that just means you're using a lot of machine
> resources on something that doesn't really help anybody. A machine
> with hundreds of authentication-pending connections is possibly
> getting DDOS'd and probably getting buried. You'd be better off
> focusing the machine's limited resources on the already-established
> connections and a more limited number of new connection attempts. If
> you accept so many connection attempts that you don't actually have
> enough memory/CPU/kernel scheduling firepower to complete the
> authentication process with any of them, it does nobody any good.
> 
> I'm not sure what's best to do here; just thinking out loud.

Yes, there's a limit, roughly 2x max_connections. see 
MaxLivePostmasterChildren().

There's another issue with that that I was about to post in another 
thread, but here goes: the MaxLivePostmasterChildren() limit is shared 
by all regular backends, bgworkers and autovacuum workers. If you open a 
lot of TCP connections to postmaster and don't send anything to the 
server, you exhaust those slots, and the server won't be able to start 
any autovacuum workers or background workers either. That's not great. I 
started to work on approach b), with separate pools of slots for 
different kinds of child processes, which fixes that. Stay tuned for a 
patch.

In addition to that, you can have an unlimited number of "dead-end" 
backends, which are doomed to just respond with "sorry, too many 
clients" error. The only limit on those is the amount of resources 
needed for all the processes and a little memory to track them.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Interrupts vs signals

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I wonder how this works right now. Is there something that limits the
> number of authentication requests that can be in flight concurrently,
> or is it completely uncapped (except by machine resources)?

The former.  IIRC, the postmaster won't spawn more than 2X max_connections
subprocesses (don't recall the exact limit, but it's around there).

            regards, tom lane



Re: Interrupts vs signals

От
Robert Haas
Дата:
On Mon, Jul 29, 2024 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I wonder how this works right now. Is there something that limits the
> > number of authentication requests that can be in flight concurrently,
> > or is it completely uncapped (except by machine resources)?
>
> The former.  IIRC, the postmaster won't spawn more than 2X max_connections
> subprocesses (don't recall the exact limit, but it's around there).

Hmm. Not to sidetrack this thread too much, but multiplying by two
doesn't really sound like the right idea to me. The basic idea
articulated in the comment for canAcceptConnections() makes sense:
some backends might fail authentication, or might be about to exit, so
it makes sense to allow for some slop. But 2X is a lot of slop even on
a machine with the default max_connections=100, and people with
connection management problems are likely to be running with
max_connections=500 or max_connections=900 or even (insanely)
max_connections=2000. Continuing with a connection attempt because we
think that hundreds or thousands of connections that are ahead of us
in the queue might clear out of the way before we need a PGPROC is not
a good bet.

I wonder if we ought to restrict this to a small, flat value, like say
50, or by a new GUC that defaults to such a value if a constant seems
problematic. Maybe it doesn't really matter. I'm not sure how much
work we'd save by booting out the doomed connection attempt earlier.

The unlimited number of dead-end backends doesn't sound too great
either. I don't have another idea, but presumably resisting DDOS
attacks and/or preserving resources for things that still have a
chance of working ought to take priority over printing a nicer error
message from a connection that's doomed to fail anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Interrupts vs signals

От
Heikki Linnakangas
Дата:
On 10/07/2024 09:48, Thomas Munro wrote:
> The next problems to remove are, I think, the various SIGUSR2, SIGINT,
> SIGTERM signals sent by the postmaster.  These should clearly become
> SendInterrupt() or ProcSetLatch().  The problem here is that the
> postmaster doesn't have the proc numbers yet.  One idea is to teach
> the postmaster to assign them!  Not explored yet.

With my latest patches on the "Refactoring postmaster's code to cleanup 
after child exit" thread [1], every postmaster child process is assigned 
a slot in the pmsignal.c array, including all the aux processes. If we 
moved 'pending_interrupts' and the process Latch to the pmsignal.c 
array, then you could send an interrupt also to a process that doesn't 
have a PGPROC entry. That includes dead-end backends, backends that are 
still in authentication, and the syslogger.

That would also make it so that the postmaster would never need to poke 
into the procarray. pmsignal.c is already designated as the limited 
piece of shared memory that is accessed by the postmaster 
(BackgroundWorkerSlots is the other exception), so it would be kind of 
nice if all the information that the postmaster needs to send an 
interrupt was there. That would mean that where you currently use a 
ProcNumber to identify a process, you'd use an index into the 
PMSignalState array instead.

I don't insist on changing that right now, I think this patch is OK as 
it is, but that might be a good next step later.

[1] 
https://www.postgresql.org/message-id/8f2118b9-79e3-4af7-b2c9-bd5818193ca4%40iki.fi

I'm also wondering about the relationship between interrupts and 
latches. Currently, SendInterrupt sets a latch to wake up the target 
process. I wonder if it should be the other way 'round? Move all the 
wakeup code, with the signalfd, the self-pipe etc to interrupt.c, and in 
SetLatch, call SendInterrupt to wake up the target process? Somehow that 
feels more natural to me, I think.

> This version is passing on Windows.  I'll create a CF entry.  Still
> work in progress!

Some comments on the patch details:

>          ereport(WARNING,
>                  (errmsg("NOTIFY queue is %.0f%% full", fillDegree * 100),
> -                 (minPid != InvalidPid ?
> -                  errdetail("The server process with PID %d is among those with 
the oldest transactions.", minPid)
> +                 (minPgprocno != INVALID_PROC_NUMBER ?
> +                  errdetail("The server process with pgprocno %d is among those 
with the oldest transactions.", minPgprocno)
>                    : 0),
> -                 (minPid != InvalidPid ?
> +                 (minPgprocno != INVALID_PROC_NUMBER ?
>                    errhint("The NOTIFY queue cannot be emptied until that process 
ends its current transaction.")
>                    : 0)));

This makes the message less useful to the user, as the ProcNumber isn't 
exposed to users. With the PID, you can do "pg_terminate_backend(pid)"

> diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
> index c42742d2c7b..bfb89049020 100644
> --- a/src/backend/optimizer/util/pathnode.c
> +++ b/src/backend/optimizer/util/pathnode.c
> @@ -18,6 +18,7 @@
>
>  #include "foreign/fdwapi.h"
>  #include "miscadmin.h"
> +#include "postmaster/interrupt.h"
>  #include "nodes/extensible.h"
>  #include "optimizer/appendinfo.h"
>  #include "optimizer/clauses.h"

misordered

> +     * duplicated interrupts later if we switch backx.

typo: backx -> back

> -    if (IdleInTransactionSessionTimeoutPending)
> +    if (ConsumeInterrupt(INTERRUPT_IDLE_TRANSACTION_TIMEOUT))
>      {
>          /*
>           * If the GUC has been reset to zero, ignore the signal.  This is
> @@ -3412,7 +3361,6 @@ ProcessInterrupts(void)
>           * interrupt.  We need to unset the flag before the injection point,
>           * otherwise we could loop in interrupts checking.
>           */
> -        IdleInTransactionSessionTimeoutPending = false;
>          if (IdleInTransactionSessionTimeout > 0)
>          {
>              INJECTION_POINT("idle-in-transaction-session-timeout");

The "We need to unset the flag.." comment is a bit out of place now, 
since the flag was already cleared by ConsumeInterrupt(). Same in the 
INTERRUPT_TRANSACTION_TIMEOUT and INTERRUPT_IDLE_SESSION_TIMEOUT 
handling after this.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Interrupts vs signals

От
Thomas Munro
Дата:
On Sun, Aug 25, 2024 at 5:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 07/08/2024 17:59, Heikki Linnakangas wrote:
> > I'm also wondering about the relationship between interrupts and
> > latches. Currently, SendInterrupt sets a latch to wake up the target
> > process. I wonder if it should be the other way 'round? Move all the
> > wakeup code, with the signalfd, the self-pipe etc to interrupt.c, and in
> > SetLatch, call SendInterrupt to wake up the target process? Somehow that
> > feels more natural to me, I think.
>
> I explored that a little, see attached patch set. It's going towards the
> same end state as your patches, I think, but it starts from different
> angle. In a nutshell:
>
> Remove Latch as an abstraction, and replace all use of Latches with
> Interrupts. When I originally created the Latch abstraction, I imagined
> that we would have different latches for different purposes, but in
> reality, almost all code just used the general-purpose "process latch".
> this patch accepts that reality and replaces the Latch struct directly
> with the interrupt mask in PGPROC.

Some very initial reactions:

* I like it!

* This direction seems to fit quite nicely with future ideas about
asynchronous network I/O.  That may sound unrelated, but imagine that
a future version of WaitEventSet is built on Linux io_uring (or
Windows iorings, or Windows IOCP, or kqueue), and waits for the kernel
to tell you that network data has been transferred directly into a
user space buffer.  You could wait for the interrupt word to change at
the same time by treating it as a futex[1].  Then all that other stuff
-- signalfd, is_set, maybe_sleeping -- just goes away, and all we have
left is one single word in memory.  (That it is possible to do that is
not really a coincidence, as our own Mr Freund asked Mr Axboe to add
it[2].  The existing latch implementation techniques could be used as
fallbacks, but when looked at from the right angle, once you squish
all the wakeup reasons into a single word, it's all just an
implementation of a multiplexable futex with extra steps.)

* Speaking of other problems in other threads that might be solved by
this redesign, I think I see the outline of some solutions to the
problem of different classes of wakeup which you can handle at
different times, using masks.  There is a tension in a few places
where we want to handle some kind of interrupts but not others in
localised wait points, which we sort of try to address by holding
interrupts or holding cancel interrupts, but it's not satisfying and
there are some places where it doesn't work well.  Needs a lot more
thought, but a basic step would be: after old_interrupt_vector =
pg_atomic_fetch_or_u32(interrupt_vector, new_bits), if
(old_interrupt_vector & new_bits) == new_bits, then you didn't
actually change any bits, so you probably don't really need to wake
the other backend.  If someone is currently unable to handle that type
of interrupt (has ignored, ie not cleared, those bits) or is already
in the process of handling it (is currently being rescheduled but
hasn't cleared those bits yet), then you don't bother to wake it up.
Concretely, it could mean that we avoid some of the useless wakeup
storm problems we see in vacuum delays or while executing a query and
not in a good place to handle sinval wakeups, etc.  These are just
some raw thoughts, I am not sure about the bigger picture of that
topic yet.

* Archeological note on terminology: the reason almost every relation
database and all the literature uses the term "latch" for something
like our LWLocks seems to be that latches were/are one of the kinds of
system-provided mutex on IBM System/370 and modern descendents ie
z/OS.  Oracle and other systems that started as knock-offs of the IBM
System R (the original SQL system, of which DB2 is the modern heir)
continued that terminology, even though they ran on VMS or Unix or
whatever.  I would not be sad if we removed our unusual use of the
term latch.

[1] https://man7.org/linux/man-pages/man3/io_uring_prep_futex_wait.3.html
[2] https://lore.kernel.org/lkml/20230720221858.135240-1-axboe@kernel.dk/



Re: Interrupts vs signals

От
Thomas Munro
Дата:
On Sat, Aug 31, 2024 at 10:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > * This direction seems to fit quite nicely with future ideas about
> > asynchronous network I/O.  That may sound unrelated, but imagine that
> > a future version of WaitEventSet is built on Linux io_uring (or
> > Windows iorings, or Windows IOCP, or kqueue), and waits for the kernel
> > to tell you that network data has been transferred directly into a
> > user space buffer.  You could wait for the interrupt word to change at
> > the same time by treating it as a futex[1].  Then all that other stuff
> > -- signalfd, is_set, maybe_sleeping -- just goes away, and all we have
> > left is one single word in memory.  (That it is possible to do that is
> > not really a coincidence, as our own Mr Freund asked Mr Axboe to add
> > it[2].  The existing latch implementation techniques could be used as
> > fallbacks, but when looked at from the right angle, once you squish
> > all the wakeup reasons into a single word, it's all just an
> > implementation of a multiplexable futex with extra steps.)
>
> Cool

Just by the way, speaking of future tricks and the connections between
this code and other problems in other threads, I wanted to mention
that the above thought is also connected to CF #3998.  When I started
working on this, in parallel I had an experimental patch set using
futexes[1] (back then, to try out futexes, I had to patch my OS[2]
because Linux couldn't multiplex them yet, and macOS/*BSD had
something sort of vaguely similar but effectively only usable between
threads in one process).  I planned to switch to waiting directly on
the interrupt vector as a futex when bringing that idea together with
the one in this thread, but I guess I assumed we had to keep latches
too since they seemed like such a central concept in PostgreSQL.  Your
idea seems much better, the more I think about it, but maybe only the
inventor of latches could have the idea of blowing them up :-)
Anyway, in that same experiment I realised I could wake multiple
backends in one system call, which led to more discoveries about the
negative interactions between latches and locks, and begat CF #3998
(SetLatches()).   By way of excuse, unfortunately I got blocked in my
progress on interrupt vectors for a couple of release cycles by the
recovery conflict system, a set of procsignals that were not like the
others, and turned out to be broken more or less as a result.  That
was tricky to fix (CF #3615), leading to journeys into all kinds of
strange places like the regex code...

[1] https://github.com/macdice/postgres/commits/kqueue-usermem/
[2] https://reviews.freebsd.org/D37102