Re: [BUGS] 9.6rc1 Background worker starting multiple times

Поиск
Список
Период
Сортировка
От Keith Fiske
Тема Re: [BUGS] 9.6rc1 Background worker starting multiple times
Дата
Msg-id CAG1_KcAw=Gacd0JxKOunXnj4yNFaNs6XHX_RtUk0Vsh1t1uB6g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [BUGS] 9.6rc1 Background worker starting multiple times  (Keith Fiske <keith@omniti.com>)
Ответы Re: [BUGS] 9.6rc1 Background worker starting multiple times  (Keith Fiske <keith@omniti.com>)
Список pgsql-bugs

On Fri, Dec 23, 2016 at 4:18 PM, Keith Fiske <keith@omniti.com> wrote:

On Tue, Dec 20, 2016 at 12:04 PM, Keith Fiske <keith@omniti.com> wrote:


-On Tue, Dec 20, 2016 at 10:26 AM, Keith Fiske <keith@omniti.com> wrote:


On Tue, Dec 20, 2016 at 1:42 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Dec 20, 2016 at 6:47 AM, Keith Fiske <keith@omniti.com> wrote:
> I'm not quite sure exactly how to debug this further to see what is actually
> setting the latch in 9.6.

The commit you are complaining about here is likely this one:
commit: db0f6cad4884bd4c835156d3a720d9a79dbd63a9
author: Robert Haas <rhaas@postgresql.org>
date: Fri, 9 Oct 2015 14:31:04 -0400
Remove set_latch_on_sigusr1 flag.

This flag has proven to be a recipe for bugs, and it doesn't seem like
it can really buy anything in terms of performance.  So let's just
*always* set the process latch when we receive SIGUSR1 instead of
trying to do it only when needed.

So your process gets a SIGUSR1 and its latch is set, causing the loop
to repeat... I would discard that as a backend bug, because the
removal of set_latch_on_sigusr1 was a good move.
--
Michael

Anything I can do in the mean time to fix this, or have to wait for a bugfix patch?

Keith

So, attached is my naive attempt to try and handle SIGUSR1. Can at least confirm this is the source of the problem, but I'm unable to figure out how to work around it. Handling it does in fact stop the background worker from going nuts anymore. But it also completely stops the maintenance loop from working properly. It runs once, then never again. It seems to get stuck after the call to WaitForBackgroundWorkerStartup() and never even gets to my WaitLatch() call. There is logging that it successfully unregisters the dynamic BGW, but never seems to go further than that. I tried adding latch_sigusr1_handler() similar to how the handler in procsignal.c does, but didn't seem to fix it. With some additional debugging I added, it doesn't even appear to be calling my custom signal handler function at all. got_sigusr1 is never set to true.

The simple act of registering to listen for the SIGUSR1 seems to stop the original bug. If I comment out just this line in pg_partman_bgw_main() (the one in pg_partman_bgw_run_maint() seems to make no difference):

    pqsignal(SIGUSR1, pg_partman_bgw_sigusr1);

it goes back to doing what it was doing before.

Guessing there's a lot more to SIGUSR1 than I can see right now. But figured I'd at least try to provide some additional investigation. I noticed in the BGW code, it mentions using SIGUSR1 on start/stop.

https://doxygen.postgresql.org/bgworker_8h_source.html#l00099

Perhaps that's part of the issue when I try and handle it instead?

partman maintenance timer is set for 10 seconds so the timestamp on the log entries shows it's not running on time.

2016-12-20 11:56:46 EST [] [14486]: [7-1] user=,db=,e=00000 LOG:  database system is ready to accept connections
2016-12-20 11:56:46 EST [] [14495]: [1-1] user=,db=,e=00000 LOG:  pg_partman master background worker master process initialized with role keith
2016-12-20 11:56:46 EST [] [14495]: [2-1] user=,db=,e=00000 DEBUG:  Latch status just before for loop: 1
2016-12-20 11:56:46 EST [] [14495]: [3-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-20 11:56:46 EST [] [14495]: [4-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
2016-12-20 11:56:46 EST [] [14495]: [5-1] user=,db=,e=00000 DEBUG:  Latch status just after reset after interupt check: 0
2016-12-20 11:56:46 EST [] [14495]: [6-1] user=,db=,e=00000 DEBUG:  After sighup check: 0
2016-12-20 11:56:46 EST [] [14495]: [7-1] user=,db=,e=00000 DEBUG:  After sigterm check: 0
2016-12-20 11:56:46 EST [] [14495]: [8-1] user=,db=,e=00000 DEBUG:  Before sigusr1 check: 0
2016-12-20 11:56:46 EST [] [14495]: [9-1] user=,db=,e=00000 DEBUG:  After sigusr1 check: 0
2016-12-20 11:56:46 EST [] [14495]: [10-1] user=,db=,e=00000 DEBUG:  Dynamic bgw launch begun for keith (0)
2016-12-20 11:56:46 EST [] [14495]: [11-1] user=,db=,e=00000 DEBUG:  Registering dynamic background worker...
2016-12-20 11:56:46 EST [] [14495]: [12-1] user=,db=,e=00000 DEBUG:  Waiting for BGW startup...
2016-12-20 11:56:46 EST [] [14486]: [8-1] user=,db=,e=00000 DEBUG:  registering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-20 11:56:46 EST [] [14486]: [9-1] user=,db=,e=00000 DEBUG:  starting background worker process "pg_partman dynamic background worker (dbname=keith)"
2016-12-20 11:56:46 EST [] [14493]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started
2016-12-20 11:56:46 EST [] [14496]: [1-1] user=,db=,e=00000 DEBUG:  Before parsing dbname GUC in dynamic main func: keith
2016-12-20 11:56:46 EST [] [14496]: [2-1] user=,db=,e=00000 DEBUG:  GUC rawstring copy: keith
2016-12-20 11:56:46 EST [] [14496]: [3-1] user=,db=,e=00000 DEBUG:  Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-20 11:56:46 EST [] [14496]: [4-1] user=,db=,e=00000 DEBUG:  Parsing list: keith (0)
2016-12-20 11:56:46 EST [] [14496]: [5-1] user=,db=,e=00000 DEBUG:  Before run_maint initialize connection for db keith
2016-12-20 11:56:46 EST [] [14496]: [6-1] user=,db=,e=00000 DEBUG:  After run_maint initialize connection for db keith
2016-12-20 11:56:46 EST [] [14496]: [7-1] user=,db=,e=00000 DEBUG:  Checking if pg_partman extension is installed in database: keith
2016-12-20 11:56:46 EST [] [14496]: [8-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith) dynamic background worker initialized with role keith on database keith
2016-12-20 11:56:46 EST [] [14496]: [9-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith): SELECT "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by role keith on database keith
2016-12-20 11:56:46 EST [] [14496]: [10-1] user=,db=,e=00000 DEBUG:  pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-20 11:56:46 EST [] [14486]: [10-1] user=,db=,e=00000 DEBUG:  worker process: pg_partman dynamic background worker (dbname=keith) (PID 14496) exited with exit code 0
2016-12-20 11:56:46 EST [] [14486]: [11-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-20 11:57:16 EST [] [14506]: [1-1] user=,db=,e=00000 DEBUG:  autovacuum: processing database "keith"
2016-12-20 11:57:46 EST [] [14515]: [1-1] user=,db=,e=00000 DEBUG:  autovacuum: processing database "postgres"
2


Found this pull request in pg_cron where they suggested using WaitLatch similar to what I was doing in pg_partman.

https://github.com/citusdata/pg_cron/pull/15

Decided to apply that patch and see if it worked. After fixing the missing WL_TIMEOUT flag, everything was working fine and this extension didn't seem to have the same problem pg_partman was. So, went back to adding some more debugging flags and my suspicion about the use of SIGUSR1 on background worker startup/stop seems justified (thanks to whoever added that comment).

I noticed the timing of things in the log (see below). The dynamic background worker is unregistered just before the WaitLatch() is called. So perhaps it's the unregistering of the dynamic BGW worker that the main BGW process had called that's setting the latch...

Sure enough, I use bgw_notify_pid so my main BGW process can know when my dynamic worker finishes running. And it seems it sends SIGUSR1 to do this notification if I'm reading this right

https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/postmaster/bgworker.c#L276

So, now because the SIGUSR1 signal always sets the latch in 9.6, this WaitLatch process doesn't seem like it can be used when a BGW itself starts other dynamic background workers and has to rely on them informing it when they're done running.

Can anyone confirm/deny what I think I've found here? If this is true, any idea how this can be handled now in 9.6 or is this a bug that needs to be fixed?

2016-12-23 15:31:36 EST [] [30766]: [1-1] user=,db=,e=00000 LOG:  pg_partman master background worker master process initialized with role keith
2016-12-23 15:31:36 EST [] [30766]: [3-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-23 15:31:36 EST [] [30766]: [4-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
[..]
2016-12-23 15:31:36 EST [] [30766]: [14-1] user=,db=,e=00000 DEBUG:  Latch status just after BGW status check: 0
2016-12-23 15:31:36 EST [] [30766]: [15-1] user=,db=,e=00000 DEBUG:  Latch status just before wait call: 0
[...]
2016-12-23 15:31:36 EST [] [30758]: [11-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-23 15:31:36 EST [] [30766]: [16-1] user=,db=,e=00000 DEBUG:  Latch status after wait call: 1
2016-12-23 15:31:36 EST [] [30766]: [18-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-23 15:31:36 EST [] [30766]: [19-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
[...]
2016-12-23 15:31:36 EST [] [30766]: [29-1] user=,db=,e=00000 DEBUG:  Latch status just after BGW status check: 0
2016-12-23 15:31:36 EST [] [30766]: [30-1] user=,db=,e=00000 DEBUG:  Latch status just before wait call: 0
[...]
2016-12-23 15:31:36 EST [] [30758]: [15-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-23 15:31:36 EST [] [30766]: [31-1] user=,db=,e=00000 DEBUG:  Latch status after wait call: 1
2016-12-23 15:31:36 EST [] [30766]: [33-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-23 15:31:36 EST [] [30766]: [34-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0


Just to help confirm my above assumption, commenting out this single line of code in my main background worker allowed the WaitLatch timeout to work again

worker.bgw_notify_pid = MyProcPid;

Thankfully the only thing my background worker does is call another plpgsql function on an interval, and that function has an advisory lock to prevent concurrent runs. Don't think this bgw_notify_pid was actually preventing concurrent runs before, but if someone actually needed that feedback and used a WaitLatch, this could cause it to break. Only reason I had this in there was because I used the worker_spi code as my initial example on how to create a dynamic background worker.

Keith

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Keith Fiske
Дата:
Сообщение: Re: [BUGS] 9.6rc1 Background worker starting multiple times
Следующее
От: fabris@colliniconsulting.it
Дата:
Сообщение: [BUGS] BUG #14475: buffer overflow and segmentation fault