Re: [BUGS] 9.6rc1 Background worker starting multiple times

Поиск
Список
Период
Сортировка
От Keith Fiske
Тема Re: [BUGS] 9.6rc1 Background worker starting multiple times
Дата
Msg-id CAG1_KcA==oju1cYnN36_ZWZ4CbmkeM485qRVDPzogPYoA=HfjQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 9.6rc1 Background worker starting multiple times  (Keith Fiske <keith@omniti.com>)
Ответы Re: [BUGS] 9.6rc1 Background worker starting multiple times  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-bugs

On Tue, Sep 27, 2016 at 11:31 AM, Keith Fiske <keith@omniti.com> wrote:



On Thu, Sep 22, 2016 at 4:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 15, 2016 at 6:17 PM, Keith Fiske <keith@omniti.com> wrote:
> Running into an issue with pg_partman's (version 2.6.0) background worker
> process with 9.6rc1.
>
> https://github.com/keithf4/pg_partman/tree/master/test/test_bgw
>
> Noticed the issue when a user ran the pgtap tests against it and I'm seeing
> the same thing as well. It's working fine with 9.4 and 9.5. Looking in the
> postgresql logs it seems to be trying to start the initial background worker
> process multiple times and then one of them fails to start properly which
> then causes issues when maintenance tries to run. I've attached the
> postgresql log file from when 9.5 starts up
> (95_postgresql-2016-09-15_180306.log) and things are working normally and
> another log file from 9.6 (96_postgresql-2016-09-15_180131.log) that shows
> the odd startup behavior

I don't know what's happening here exactly, but I think your statement
that it is trying to start the same background worker multiple times
is not accurate.  There are a series of log entries like this:

2016-09-15 18:03:06 EDT [] [7688]: [8-1] user=,db=,e=00000 DEBUG:
registering background worker "pg_partman dynamic background worker
(dbname=keith)"
2016-09-15 18:03:06 EDT [] [7688]: [9-1] user=,db=,e=00000 DEBUG:
starting background worker process "pg_partman dynamic background
worker (dbname=keith)"
2016-09-15 18:03:06 EDT [] [7694]: [1-1] user=,db=,e=00000 LOG:
autovacuum launcher started
2016-09-15 18:03:06 EDT [] [7697]: [1-1] user=,db=,e=00000 DEBUG:
Before parsing dbname GUC in dynamic main func: keith
2016-09-15 18:03:06 EDT [] [7697]: [2-1] user=,db=,e=00000 DEBUG:  GUC
rawstring copy: keith
2016-09-15 18:03:06 EDT [] [7697]: [3-1] user=,db=,e=00000 DEBUG:
Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-09-15 18:03:06 EDT [] [7697]: [4-1] user=,db=,e=00000 DEBUG:
Parsing list: keith (0)
2016-09-15 18:03:06 EDT [] [7697]: [5-1] user=,db=,e=00000 DEBUG:
Before run_maint initialize connection for db keith
2016-09-15 18:03:06 EDT [] [7697]: [6-1] user=,db=,e=00000 DEBUG:
After run_maint initialize connection for db keith
2016-09-15 18:03:06 EDT [] [7697]: [7-1] user=,db=,e=00000 DEBUG:
Checking if pg_partman extension is installed in database: keith
2016-09-15 18:03:06 EDT [] [7697]: [8-1] user=,db=,e=00000 LOG:
pg_partman dynamic background worker (dbname=keith) dynamic background
worker initialized with role keith on database keith
2016-09-15 18:03:06 EDT [] [7697]: [9-1] user=,db=,e=00000 LOG:
pg_partman dynamic background worker (dbname=keith): SELECT
partman.run_maintenance(p_analyze := true, p_jobmon := true) called by
role keith on database keith
2016-09-15 18:03:06 EDT [] [7697]: [10-1] user=,db=,e=00000 DEBUG:
pg_partman dynamic BGW shutting down gracefully for database keith.
2016-09-15 18:03:06 EDT [] [7688]: [10-1] user=,db=,e=00000 DEBUG:
worker process: pg_partman dynamic background worker (dbname=keith)
(PID 7697) exited with exit code 0
2016-09-15 18:03:06 EDT [] [7688]: [11-1] user=,db=,e=00000 DEBUG:
unregistering background worker "pg_partman dynamic background worker
(dbname=keith)"

Note that the worker is registered (by something) at the beginning of
that sequence and then unregistered (because it exited with code 0) at
the end of that sequence.  So the trouble isn't that the bgworker
system itself is restarting the worker, but that your code keeps
reregistering it.  I can't guess why your code behaves differently on
9.6 than on 9.5 and prior off the top of my head, but you might want
to try to figure that out.  I don't rule out the possibility that
there's some bug in 9.6 that is tickling your code to behave
differently, but I don't think I want to try to go analyze your code
and figure out what the issue is either...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

I was mistaken about it starting multiple times being the actual problem. I thought the master process was trying to start more than once, but that's working normally. It's the dynamic portion that's starting multiple times, and that's normal, along with the unregistering of it. The master background worker process starts up a new dynamic background worker for each database configured to run pg_partman.

So, someone forked pg_partman and submitted a patch that I thought may be relevant here. Didn't fix things, but may have made the issue clearer.

https://github.com/keithf4/pg_partman/pull/140/

The unclosed if statement was causing a "continue" to always be run which seemed to be causing the eventual conflict of advisory locks at some point. Removing the "continue" caused the conflict to no longer occur, but what now happens is that background worker loop executes extremely fast with no "sleep" state. It seems the latch process to cause the main background worker to wait the given number of seconds between creating the dynamic one(s) is not working. I didn't write this latch wait process myself but took it from the worker_spi example that comes with the pg source. I tested the 9.4 worker_spi I based it on in 9.6, though, and it seems to work fine. The worker_spi has changed a small bit of code in 9.5, but the change it incorporates doesn't seem to have been backpatched to 9.4, which I'm trying to stay compatible with.

Right now, it seems the latch wait is no longer working. I'll continue looking myself and I've also asked the person that submitted that patch if he may be able to look as well. Appreciate your time!

Keith


So trying to dig into this more myself, I added some debugging lines in the attached copy of the bgw file to check the status of whether the latch is getting set/reset properly. You don't even have to configure any partition sets to see this happen, just load the library in postgresql.conf and set a time, database & user with the partman GUCs. Below is from the log file when running 9.6. You can see that something is setting the Latch when the Wait is called causing it to immediately continue to the next loop with no timeout.

2016-12-13 16:16:31 EST [] [6750]: [2-1] user=,db=,e=00000 DEBUG:  Latch status just before for loop: 1
2016-12-13 16:16:31 EST [] [6750]: [3-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-13 16:16:31 EST [] [6750]: [4-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [5-1] user=,db=,e=00000 DEBUG:  Latch status just after reset after interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [6-1] user=,db=,e=00000 DEBUG:  Dynamic bgw launch begun for keith (0)
2016-12-13 16:16:31 EST [] [6741]: [8-1] user=,db=,e=00000 DEBUG:  registering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:16:31 EST [] [6741]: [9-1] user=,db=,e=00000 DEBUG:  starting background worker process "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:16:31 EST [] [6750]: [7-1] user=,db=,e=00000 DEBUG:  Latch status just before wait call: 0
2016-12-13 16:16:31 EST [] [6747]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started
2016-12-13 16:16:31 EST [] [6751]: [1-1] user=,db=,e=00000 DEBUG:  Before parsing dbname GUC in dynamic main func: keith
2016-12-13 16:16:31 EST [] [6751]: [2-1] user=,db=,e=00000 DEBUG:  GUC rawstring copy: keith
2016-12-13 16:16:31 EST [] [6751]: [3-1] user=,db=,e=00000 DEBUG:  Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:16:31 EST [] [6751]: [4-1] user=,db=,e=00000 DEBUG:  Parsing list: keith (0)
2016-12-13 16:16:31 EST [] [6751]: [5-1] user=,db=,e=00000 DEBUG:  Before run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6751]: [6-1] user=,db=,e=00000 DEBUG:  After run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6751]: [7-1] user=,db=,e=00000 DEBUG:  Checking if pg_partman extension is installed in database: keith
2016-12-13 16:16:31 EST [] [6751]: [8-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith) dynamic background worker initialized with role keith on database keith
2016-12-13 16:16:31 EST [] [6751]: [9-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith): SELECT "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by role keith on database keith
2016-12-13 16:16:31 EST [] [6751]: [10-1] user=,db=,e=00000 DEBUG:  pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:16:31 EST [] [6741]: [10-1] user=,db=,e=00000 DEBUG:  worker process: pg_partman dynamic background worker (dbname=keith) (PID 6751) exited with exit code 0
2016-12-13 16:16:31 EST [] [6741]: [11-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:16:31 EST [] [6750]: [8-1] user=,db=,e=00000 DEBUG:  Latch status after wait call: 1
2016-12-13 16:16:31 EST [] [6750]: [9-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-13 16:16:31 EST [] [6750]: [10-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [11-1] user=,db=,e=00000 DEBUG:  Latch status just after reset after interupt check: 0
2016-12-13 16:16:31 EST [] [6750]: [12-1] user=,db=,e=00000 DEBUG:  Dynamic bgw launch begun for keith (0)
2016-12-13 16:16:31 EST [] [6741]: [12-1] user=,db=,e=00000 DEBUG:  registering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:16:31 EST [] [6741]: [13-1] user=,db=,e=00000 DEBUG:  starting background worker process "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:16:31 EST [] [6752]: [1-1] user=,db=,e=00000 DEBUG:  Before parsing dbname GUC in dynamic main func: keith
2016-12-13 16:16:31 EST [] [6752]: [2-1] user=,db=,e=00000 DEBUG:  GUC rawstring copy: keith
2016-12-13 16:16:31 EST [] [6752]: [3-1] user=,db=,e=00000 DEBUG:  Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:16:31 EST [] [6752]: [4-1] user=,db=,e=00000 DEBUG:  Parsing list: keith (0)
2016-12-13 16:16:31 EST [] [6752]: [5-1] user=,db=,e=00000 DEBUG:  Before run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6752]: [6-1] user=,db=,e=00000 DEBUG:  After run_maint initialize connection for db keith
2016-12-13 16:16:31 EST [] [6752]: [7-1] user=,db=,e=00000 DEBUG:  Checking if pg_partman extension is installed in database: keith
2016-12-13 16:16:31 EST [] [6752]: [8-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith) dynamic background worker initialized with role keith on database keith
2016-12-13 16:16:31 EST [] [6750]: [13-1] user=,db=,e=00000 DEBUG:  Latch status just before wait call: 0
2016-12-13 16:16:31 EST [] [6752]: [9-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith): SELECT "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by role keith on database keith
2016-12-13 16:16:31 EST [] [6752]: [10-1] user=,db=,e=00000 DEBUG:  pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:16:31 EST [] [6741]: [14-1] user=,db=,e=00000 DEBUG:  worker process: pg_partman dynamic background worker (dbname=keith) (PID 6752) exited with exit code 0
2016-12-13 16:16:31 EST [] [6741]: [15-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:16:31 EST [] [6750]: [14-1] user=,db=,e=00000 DEBUG:  Latch status after wait call: 1
2016-12-13 16:16:31 EST [] [6750]: [15-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1


However, when running this in 9.5, you can see the latch is never set during the Wait (except when the bgw is first started) and the WaitLatch() is able to use the timeout properly (currently set for 10 seconds)

2016-12-13 16:22:58 EST [] [7743]: [2-1] user=,db=,e=00000 DEBUG:  Latch status just before for loop: 1
2016-12-13 16:22:58 EST [] [7743]: [3-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 1
2016-12-13 16:22:58 EST [] [7743]: [4-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
2016-12-13 16:22:58 EST [] [7743]: [5-1] user=,db=,e=00000 DEBUG:  Latch status just after reset after interupt check: 0
2016-12-13 16:22:58 EST [] [7743]: [6-1] user=,db=,e=00000 DEBUG:  Dynamic bgw launch begun for keith (0)
2016-12-13 16:22:58 EST [] [7735]: [8-1] user=,db=,e=00000 DEBUG:  registering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:22:58 EST [] [7735]: [9-1] user=,db=,e=00000 DEBUG:  starting background worker process "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:22:58 EST [] [7743]: [7-1] user=,db=,e=00000 DEBUG:  Latch status just before wait call: 0
2016-12-13 16:22:58 EST [] [7741]: [1-1] user=,db=,e=00000 LOG:  autovacuum launcher started
2016-12-13 16:22:58 EST [] [7744]: [1-1] user=,db=,e=00000 DEBUG:  Before parsing dbname GUC in dynamic main func: keith
2016-12-13 16:22:58 EST [] [7744]: [2-1] user=,db=,e=00000 DEBUG:  GUC rawstring copy: keith
2016-12-13 16:22:58 EST [] [7744]: [3-1] user=,db=,e=00000 DEBUG:  Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:22:58 EST [] [7744]: [4-1] user=,db=,e=00000 DEBUG:  Parsing list: keith (0)
2016-12-13 16:22:58 EST [] [7744]: [5-1] user=,db=,e=00000 DEBUG:  Before run_maint initialize connection for db keith
2016-12-13 16:22:58 EST [] [7744]: [6-1] user=,db=,e=00000 DEBUG:  After run_maint initialize connection for db keith
2016-12-13 16:22:58 EST [] [7744]: [7-1] user=,db=,e=00000 DEBUG:  Checking if pg_partman extension is installed in database: keith
2016-12-13 16:22:58 EST [] [7744]: [8-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith) dynamic background worker initialized with role keith on database keith
2016-12-13 16:22:58 EST [] [7744]: [9-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith): SELECT "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by role keith on database keith
2016-12-13 16:22:58 EST [] [7744]: [10-1] user=,db=,e=00000 DEBUG:  pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:22:58 EST [] [7735]: [10-1] user=,db=,e=00000 DEBUG:  worker process: pg_partman dynamic background worker (dbname=keith) (PID 7744) exited with exit code 0
2016-12-13 16:22:58 EST [] [7735]: [11-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:23:08 EST [] [7743]: [8-1] user=,db=,e=00000 DEBUG:  Latch status after wait call: 0
2016-12-13 16:23:08 EST [] [7743]: [9-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 0
2016-12-13 16:23:08 EST [] [7743]: [10-1] user=,db=,e=00000 DEBUG:  Latch status just after reset before interupt check: 0
2016-12-13 16:23:08 EST [] [7743]: [11-1] user=,db=,e=00000 DEBUG:  Latch status just after reset after interupt check: 0
2016-12-13 16:23:08 EST [] [7743]: [12-1] user=,db=,e=00000 DEBUG:  Dynamic bgw launch begun for keith (0)
2016-12-13 16:23:08 EST [] [7735]: [12-1] user=,db=,e=00000 DEBUG:  registering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:23:08 EST [] [7735]: [13-1] user=,db=,e=00000 DEBUG:  starting background worker process "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:23:08 EST [] [7754]: [1-1] user=,db=,e=00000 DEBUG:  Before parsing dbname GUC in dynamic main func: keith
2016-12-13 16:23:08 EST [] [7754]: [2-1] user=,db=,e=00000 DEBUG:  GUC rawstring copy: keith
2016-12-13 16:23:08 EST [] [7754]: [3-1] user=,db=,e=00000 DEBUG:  Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
2016-12-13 16:23:08 EST [] [7754]: [4-1] user=,db=,e=00000 DEBUG:  Parsing list: keith (0)
2016-12-13 16:23:08 EST [] [7754]: [5-1] user=,db=,e=00000 DEBUG:  Before run_maint initialize connection for db keith
2016-12-13 16:23:08 EST [] [7754]: [6-1] user=,db=,e=00000 DEBUG:  After run_maint initialize connection for db keith
2016-12-13 16:23:08 EST [] [7754]: [7-1] user=,db=,e=00000 DEBUG:  Checking if pg_partman extension is installed in database: keith
2016-12-13 16:23:08 EST [] [7743]: [13-1] user=,db=,e=00000 DEBUG:  Latch status just before wait call: 0
2016-12-13 16:23:08 EST [] [7754]: [8-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith) dynamic background worker initialized with role keith on database keith
2016-12-13 16:23:08 EST [] [7754]: [9-1] user=,db=,e=00000 LOG:  pg_partman dynamic background worker (dbname=keith): SELECT "partman".run_maintenance(p_analyze := true, p_jobmon := true) called by role keith on database keith
2016-12-13 16:23:08 EST [] [7754]: [10-1] user=,db=,e=00000 DEBUG:  pg_partman dynamic BGW shutting down gracefully for database keith.
2016-12-13 16:23:08 EST [] [7735]: [14-1] user=,db=,e=00000 DEBUG:  worker process: pg_partman dynamic background worker (dbname=keith) (PID 7754) exited with exit code 0
2016-12-13 16:23:08 EST [] [7735]: [15-1] user=,db=,e=00000 DEBUG:  unregistering background worker "pg_partman dynamic background worker (dbname=keith)"
2016-12-13 16:23:18 EST [] [7743]: [14-1] user=,db=,e=00000 DEBUG:  Latch status after wait call: 0
2016-12-13 16:23:18 EST [] [7743]: [15-1] user=,db=,e=00000 DEBUG:  Latch status inside for loop before reset: 0


I'm not quite sure exactly how to debug this further to see what is actually setting the latch in 9.6.

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [BUGS] Crash with a CUBE query on 9.6
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [BUGS] 9.6rc1 Background worker starting multiple times