Обсуждение: proposal: add 'waiting for replication' to pg_stat_activity.state
Hello Hackers, I recently analyzed an incident where a major lag in synchronous replication blocked a number of synchronous backends. I found myself looking at backends that, according to pg_stat_activity, were neither waiting nor idle but yet they didn't finish their work. As it turns out, the major waiting loop for syncrep updates the processtitle, but is silent within postgres and stat_activity. It seems misleading that commited but waiting backends are 'active' although there is little done apart from waiting. > # select pid, waiting, state, substr(query,1,6) from pg_stat_activity ; > pid | waiting | state | substr > -------+---------+--------+-------- > 26294 | f | active | END; > 26318 | f | active | create > 26323 | f | active | insert > 26336 | f | active | insert (output of waiting statements [vanilla]) While 'active' is technically correct for a backend that is commited but waiting for replication in terms of 'not beeing available for new tasks', it also implies that a backend is dealing with the issue at hand. The remote host however is out of our clusters control, hence all signs should be pointing to the standby-host. I suggest adding a new state to pg_stat_activity.state for backends that are waiting for their synchronous commit to be flushed on the remote host. I chose 'waiting for synchronous replication' for now. One should refrain from the waiting flag at this point as there is no waiting done on internal processes. Instead the backend waits for factors beyond our clusters control to change. > # select pid, waiting, state, substr(query,1,6) from pg_stat_activity ; > pid | waiting | state | substr > ------+---------+-------------------------------------+-------- > 3360 | f | waiting for synchronous replication | END; > 3465 | f | waiting for synchronous replication | create > 3477 | f | waiting for synchronous replication | insert > 3489 | f | waiting for synchronous replication | insert (output of waiting statements [patched]) patch attached regards, Julian Schauder
Вложения
On 3 December 2015 at 04:22, Julian Schauder <julian.schauder@credativ.de> wrote:
I suggest adding a new state to pg_stat_activity.state for backends that are
waiting for their synchronous commit to be flushed on the remote host.
Excellent idea. Anything that improves management and visibility into what the system is doing like this is really valuable.
I notice that you don't set the 'waiting' flag. 'waiting' is presently documented as:
<entry>True if this backend is currently waiting on a lock</entry>
... but I'm inclined to just widen its definition and set it here, since we most certainly are waiting, and the column isn't named 'waiting_on_a_lock'. It shouldn't upset various canned lock monitoring queries people have since they generally do an inner join on pg_locks anyway.
There are no test changes in this patch, but that's reasonable because we don't currently have a way to automate tests of sync rep.
I've attached a patch with minor wording/formatting changes, but I think I'd like to change 'waiting' to true as well. Opinions?
Вложения
On 3 December 2015 at 04:22, Julian Schauder <julian.schauder@credativ.de> wrote:
I suggest adding a new state to pg_stat_activity.state for backends that are
waiting for their synchronous commit to be flushed on the remote host.
I chose 'waiting for synchronous replication' for now.
I've added this to the next CF:
On 12/2/15 7:00 PM, Craig Ringer wrote: > I notice that you don't set the 'waiting' flag. 'waiting' is presently > documented as: > > <entry>True if this backend is currently waiting on a lock</entry> > > ... but I'm inclined to just widen its definition and set it here, since > we most certainly are waiting, and the column isn't named > 'waiting_on_a_lock'. It shouldn't upset various canned lock monitoring > queries people have since they generally do an inner join on pg_locks > anyway. I'm not so sure about that assumption.
On 3 December 2015 at 09:32, Peter Eisentraut <peter_e@gmx.net> wrote:
On 12/2/15 7:00 PM, Craig Ringer wrote:
> I notice that you don't set the 'waiting' flag. 'waiting' is presently
> documented as:
>
> <entry>True if this backend is currently waiting on a lock</entry>
>
> ... but I'm inclined to just widen its definition and set it here, since
> we most certainly are waiting, and the column isn't named
> 'waiting_on_a_lock'. It shouldn't upset various canned lock monitoring
> queries people have since they generally do an inner join on pg_locks
> anyway.
I'm not so sure about that assumption.
Even if it's an outer join, the worst that'll happen is that they'll get entries with nulls in pg_locks. I don't think it's worth worrying about too much.
We could always mitigate it by adding a pg_lock_status view to the system catalogs with a decent canned query over pg_stat_activity and pg_locks, so people can stop copying & pasting from the wiki or using buggy homebrew queries ;)
On Thu, Dec 3, 2015 at 9:02 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On 3 December 2015 at 09:32, Peter Eisentraut <peter_e@gmx.net> wrote:
>>
>> On 12/2/15 7:00 PM, Craig Ringer wrote:
>> > I notice that you don't set the 'waiting' flag. 'waiting' is presently
>> > documented as:
>> >
>> > <entry>True if this backend is currently waiting on a lock</entry>
>> >
>> > ... but I'm inclined to just widen its definition and set it here, since
>> > we most certainly are waiting, and the column isn't named
>> > 'waiting_on_a_lock'. It shouldn't upset various canned lock monitoring
>> > queries people have since they generally do an inner join on pg_locks
>> > anyway.
>>
>> I'm not so sure about that assumption.
>
>
> Even if it's an outer join, the worst that'll happen is that they'll get entries with nulls in pg_locks. I don't think it's worth worrying about too much.
>
That can be one way of dealing with it and another is that we
>
> On 3 December 2015 at 09:32, Peter Eisentraut <peter_e@gmx.net> wrote:
>>
>> On 12/2/15 7:00 PM, Craig Ringer wrote:
>> > I notice that you don't set the 'waiting' flag. 'waiting' is presently
>> > documented as:
>> >
>> > <entry>True if this backend is currently waiting on a lock</entry>
>> >
>> > ... but I'm inclined to just widen its definition and set it here, since
>> > we most certainly are waiting, and the column isn't named
>> > 'waiting_on_a_lock'. It shouldn't upset various canned lock monitoring
>> > queries people have since they generally do an inner join on pg_locks
>> > anyway.
>>
>> I'm not so sure about that assumption.
>
>
> Even if it's an outer join, the worst that'll happen is that they'll get entries with nulls in pg_locks. I don't think it's worth worrying about too much.
>
That can be one way of dealing with it and another is that we
keep the current column as it is and add another view related
wait stats, anyway we need something like that for other purposes
like lwlock waits etc. Basically something on lines what we
is being discussed in thread [1]
On 3 December 2015 at 22:58, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Dec 3, 2015 at 9:02 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On 3 December 2015 at 09:32, Peter Eisentraut <peter_e@gmx.net> wrote:
>>
>> On 12/2/15 7:00 PM, Craig Ringer wrote:
>> > I notice that you don't set the 'waiting' flag. 'waiting' is presently
>> > documented as:
>> >
>> > <entry>True if this backend is currently waiting on a lock</entry>
>> >
>> > ... but I'm inclined to just widen its definition and set it here, since
>> > we most certainly are waiting, and the column isn't named
>> > 'waiting_on_a_lock'. It shouldn't upset various canned lock monitoring
>> > queries people have since they generally do an inner join on pg_locks
>> > anyway.
>>
>> I'm not so sure about that assumption.
>
>
> Even if it's an outer join, the worst that'll happen is that they'll get entries with nulls in pg_locks. I don't think it's worth worrying about too much.
>
That can be one way of dealing with it and another is that wekeep the current column as it is and add another view relatedwait stats, anyway we need something like that for other purposeslike lwlock waits etc. Basically something on lines what weis being discussed in thread [1]
On December 4, 2015 9:48:55 AM GMT+01:00, Craig Ringer <craig@2ndquadrant.com> wrote: >On 3 December 2015 at 22:58, Amit Kapila <amit.kapila16@gmail.com> >wrote: > >> On Thu, Dec 3, 2015 at 9:02 AM, Craig Ringer <craig@2ndquadrant.com> >> wrote: >http://www.postgresql.org/message-id/CAA4eK1+=5Ex8-5NRr3u94=_t2p65v0kcjZ5rXddVmkx=LwAJ+w@mail.gmail.com >> >Good point. I'm not sure this shouldn't set 'waiting' anyway, though. No. Waiting corresponds to pg locks -setting it to true for other things would be even more confusing... Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund wrote: > On December 4, 2015 9:48:55 AM GMT+01:00, Craig Ringer <craig@2ndquadrant.com> wrote: > >On 3 December 2015 at 22:58, Amit Kapila <amit.kapila16@gmail.com> > >wrote: > > > >> On Thu, Dec 3, 2015 at 9:02 AM, Craig Ringer <craig@2ndquadrant.com> > >> wrote: > > >http://www.postgresql.org/message-id/CAA4eK1+=5Ex8-5NRr3u94=_t2p65v0kcjZ5rXddVmkx=LwAJ+w@mail.gmail.com > >> > >Good point. I'm not sure this shouldn't set 'waiting' anyway, though. > > No. Waiting corresponds to pg locks -setting it to true for other things would be even more confusing... Robert's opinion elsewhere is relevant for this patch, and contradicts Andres' opinion here: http://www.postgresql.org/message-id/CA+Tgmoaj+EPoO9qobvFH18F5kJg2tAEXhQ8_-VayWHUr-ZATMw@mail.gmail.com -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 3, 2015 at 1:00 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 3 December 2015 at 04:22, Julian Schauder <julian.schauder@credativ.de> > wrote: > >> >> I suggest adding a new state to pg_stat_activity.state for backends that >> are >> >> waiting for their synchronous commit to be flushed on the remote host. >> > > Excellent idea. Anything that improves management and visibility into what > the system is doing like this is really valuable. > > I notice that you don't set the 'waiting' flag. 'waiting' is presently > documented as: > > <entry>True if this backend is currently waiting on a lock</entry> > > ... but I'm inclined to just widen its definition and set it here, since we > most certainly are waiting, and the column isn't named 'waiting_on_a_lock'. > It shouldn't upset various canned lock monitoring queries people have since > they generally do an inner join on pg_locks anyway. > > There are no test changes in this patch, but that's reasonable because we > don't currently have a way to automate tests of sync rep. > > I've attached a patch with minor wording/formatting changes, but I think I'd > like to change 'waiting' to true as well. Opinions? A couple of minor things I noticed: + <literal>waiting for synchronous replication</>: The backend is waiting for its transaction to be flushed on a synchronous standby. I wouldn't say "flushed" here. If you set synchronous_commit = remote_write, then it's not actually waiting for it to be flushed (and there may eventually be other levels too). Maybe just "The backend is waiting for a response from a synchronous standby"? +#include <pgstat.h> I think this should use "" instead of <> and should come after #include "miscadmin.h". Thinking of other patches in flight, I think I'd want the proposed N-sync standbys feature to be able to explain in more detail what it's waiting for (and likewise my causal reads proposal which does that via the process title), though I realise that the details of how/where to do that are changing in the "replace pg_stat_activity.waiting" thread which this patch is waiting on. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro wrote: > Thinking of other patches in flight, I think I'd want the proposed > N-sync standbys feature to be able to explain in more detail what it's > waiting for (and likewise my causal reads proposal which does that via > the process title), though I realise that the details of how/where to > do that are changing in the "replace pg_stat_activity.waiting" thread > which this patch is waiting on. Right, interesting thought. I think that for the time being we should close this one as returned with feedback, since there's no point in keeping it open in commitfest. I encourage the author (and everyone else) to look at the other patch and help there, which is going to ultimately achieve the outcome desired with this patch. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services