Re: 9.6rc1 Background worker starting multiple times

Поиск
Список
Период
Сортировка
От Keith Fiske
Тема Re: 9.6rc1 Background worker starting multiple times
Дата
Msg-id CAG1_KcCCp9W6A-LynxxH8aawwK1yPJ1as=919KyeTZGP4t6UZg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 9.6rc1 Background worker starting multiple times  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [BUGS] 9.6rc1 Background worker starting multiple times
Список pgsql-bugs
On Thu, Sep 22, 2016 at 4:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> On Thu, Sep 15, 2016 at 6:17 PM, Keith Fiske <keith@omniti.com> wrote:
> > Running into an issue with pg_partman's (version 2.6.0) background worker
> > process with 9.6rc1.
> >
> > https://github.com/keithf4/pg_partman/tree/master/test/test_bgw
> >
> > Noticed the issue when a user ran the pgtap tests against it and I'm
> seeing
> > the same thing as well. It's working fine with 9.4 and 9.5. Looking in
> the
> > postgresql logs it seems to be trying to start the initial background
> worker
> > process multiple times and then one of them fails to start properly which
> > then causes issues when maintenance tries to run. I've attached the
> > postgresql log file from when 9.5 starts up
> > (95_postgresql-2016-09-15_180306.log) and things are working normally
> and
> > another log file from 9.6 (96_postgresql-2016-09-15_180131.log) that
> shows
> > the odd startup behavior
>
> I don't know what's happening here exactly, but I think your statement
> that it is trying to start the same background worker multiple times
> is not accurate.  There are a series of log entries like this:
>
> 2016-09-15 18:03:06 EDT [] [7688]: [8-1] user=,db=,e=00000 DEBUG:
> registering background worker "pg_partman dynamic background worker
> (dbname=keith)"
> 2016-09-15 18:03:06 EDT [] [7688]: [9-1] user=,db=,e=00000 DEBUG:
> starting background worker process "pg_partman dynamic background
> worker (dbname=keith)"
> 2016-09-15 18:03:06 EDT [] [7694]: [1-1] user=,db=,e=00000 LOG:
> autovacuum launcher started
> 2016-09-15 18:03:06 EDT [] [7697]: [1-1] user=,db=,e=00000 DEBUG:
> Before parsing dbname GUC in dynamic main func: keith
> 2016-09-15 18:03:06 EDT [] [7697]: [2-1] user=,db=,e=00000 DEBUG:  GUC
> rawstring copy: keith
> 2016-09-15 18:03:06 EDT [] [7697]: [3-1] user=,db=,e=00000 DEBUG:
> Entered foreach loop: name (keith), db_main_counter (0), dbcounter (0)
> 2016-09-15 18:03:06 EDT [] [7697]: [4-1] user=,db=,e=00000 DEBUG:
> Parsing list: keith (0)
> 2016-09-15 18:03:06 EDT [] [7697]: [5-1] user=,db=,e=00000 DEBUG:
> Before run_maint initialize connection for db keith
> 2016-09-15 18:03:06 EDT [] [7697]: [6-1] user=,db=,e=00000 DEBUG:
> After run_maint initialize connection for db keith
> 2016-09-15 18:03:06 EDT [] [7697]: [7-1] user=,db=,e=00000 DEBUG:
> Checking if pg_partman extension is installed in database: keith
> 2016-09-15 18:03:06 EDT [] [7697]: [8-1] user=,db=,e=00000 LOG:
> pg_partman dynamic background worker (dbname=keith) dynamic background
> worker initialized with role keith on database keith
> 2016-09-15 18:03:06 EDT [] [7697]: [9-1] user=,db=,e=00000 LOG:
> pg_partman dynamic background worker (dbname=keith): SELECT
> partman.run_maintenance(p_analyze := true, p_jobmon := true) called by
> role keith on database keith
> 2016-09-15 18:03:06 EDT [] [7697]: [10-1] user=,db=,e=00000 DEBUG:
> pg_partman dynamic BGW shutting down gracefully for database keith.
> 2016-09-15 18:03:06 EDT [] [7688]: [10-1] user=,db=,e=00000 DEBUG:
> worker process: pg_partman dynamic background worker (dbname=keith)
> (PID 7697) exited with exit code 0
> 2016-09-15 18:03:06 EDT [] [7688]: [11-1] user=,db=,e=00000 DEBUG:
> unregistering background worker "pg_partman dynamic background worker
> (dbname=keith)"
>
> Note that the worker is registered (by something) at the beginning of
> that sequence and then unregistered (because it exited with code 0) at
> the end of that sequence.  So the trouble isn't that the bgworker
> system itself is restarting the worker, but that your code keeps
> reregistering it.  I can't guess why your code behaves differently on
> 9.6 than on 9.5 and prior off the top of my head, but you might want
> to try to figure that out.  I don't rule out the possibility that
> there's some bug in 9.6 that is tickling your code to behave
> differently, but I don't think I want to try to go analyze your code
> and figure out what the issue is either...
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

I was mistaken about it starting multiple times being the actual problem. I
thought the master process was trying to start more than once, but that's
working normally. It's the dynamic portion that's starting multiple times,
and that's normal, along with the unregistering of it. The master
background worker process starts up a new dynamic background worker for
each database configured to run pg_partman.

So, someone forked pg_partman and submitted a patch that I thought may be
relevant here. Didn't fix things, but may have made the issue clearer.

https://github.com/keithf4/pg_partman/pull/140/

The unclosed if statement was causing a "continue" to always be run which
seemed to be causing the eventual conflict of advisory locks at some point.
Removing the "continue" caused the conflict to no longer occur, but what
now happens is that background worker loop executes extremely fast with no
"sleep" state. It seems the latch process to cause the main background
worker to wait the given number of seconds between creating the dynamic
one(s) is not working. I didn't write this latch wait process myself but
took it from the worker_spi example that comes with the pg source. I tested
the 9.4 worker_spi I based it on in 9.6, though, and it seems to work fine.
The worker_spi has changed a small bit of code in 9.5, but the change it
incorporates doesn't seem to have been backpatched to 9.4, which I'm trying
to stay compatible with.

Right now, it seems the latch wait is no longer working. I'll continue
looking myself and I've also asked the person that submitted that patch if
he may be able to look as well. Appreciate your time!

Keith

В списке pgsql-bugs по дате отправления:

Предыдущее
От: amitraj@zebra.com
Дата:
Сообщение: BUG #14342: Please restore pgdg-redhat92-9.2-7.noarch.rpm
Следующее
От: Keith Fiske
Дата:
Сообщение: Re: BUG #14342: Please restore pgdg-redhat92-9.2-7.noarch.rpm