Обсуждение: walsender vs. XLogBackgroundFlush during shutdown

Поиск
Список
Период
Сортировка

walsender vs. XLogBackgroundFlush during shutdown

От
Tomas Vondra
Дата:
Hi,

I see there's an ongoing discussion about race conditions in walreceiver
blocking shutdown, so let me start a new thread about a race condition in
walsender during shutdown.

The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
waiting for walsender processes to catch-up and terminate indefinitely.

The reason for that is pretty simple - the walsenders are doing logical
decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
But per XLogSendLogical() this only happens if

    if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
    {
        WalSndCaughtUp = true;
        ...
    }

That is, we need to get beyong GetFlushRecPtr(). But that may never
happen, because there may be incomplete (and unflushed) page in WAL
buffers, not forced out by any transaction. So if there's a WAL record
overflowing to the unflushed page, the walsender will never catch up.

Now, this situation is apparently expected, because WalSndWaitForWal()
does this:

    /*
     * If we're shutting down, trigger pending WAL to be written out,
     * otherwise we'd possibly end up waiting for WAL that never gets
     * written, because walwriter has shut down already.
     */
    if (got_STOPPING)
        XLogBackgroundFlush();

but unfortunately that does not actually do anything, because right at
the very beginning XLogBackgroundFlush() does this:

    /* back off to last completed page boundary */
    WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;

That is, it intentionally ignores the incomplete page, which means the
walsender can't read the record and reach GetFlushRecPtr().

XLogBackgroundFlush() does this since (at least) 2007, so how come we
never had issues with this before?

Firstly, walsenders used for physical replication don't have this issue,
because they don't need to decode the WAL record overflowing to the
incomplete/unflushed page. So it seems only walsenders used for logical
decoding are vulnerable to this.

Secondly, this seems to happen only when a bit of WAL is generated just
at the right moment during shutdown. I have initially ran into this issue
with the failover slots (i.e. the patch that was committed and reverted
in the 9.6 cycle, IIRC), which is doing exactly this - it's writing the
slot positions into WAL during shutdown. Which made it pretty trivial to
trigger this issue.

But I don't think we're safe without the failover slots patch, because
any output plugin can do something similar - say, LogLogicalMessage() or
something like that. I'm not aware of a plugin doing such things, but I
don't think it's illegal / prohibited either. (Of course, plugins that
generate WAL won't be useful for decoding on standby in the future.)

So what I think we should do is to tweak XLogBackgroundFlush() so that
during shutdown it skips the back-off to page boundary, and flushes even
the last piece of WAL. There are only two callers anyway, so something
like XLogBackgroundFlush(bool) would be simple enough.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: walsender vs. XLogBackgroundFlush during shutdown

От
Alexander Kukushkin
Дата:
 Hi Tomas,

On Wed, 1 May 2019 at 02:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> I see there's an ongoing discussion about race conditions in walreceiver
> blocking shutdown, so let me start a new thread about a race condition in
> walsender during shutdown.
>
> The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
> waiting for walsender processes to catch-up and terminate indefinitely.

I can confirm, during the past couple of years we observed such a
problem a few times and this is really annoying.

> The reason for that is pretty simple - the walsenders are doing logical
> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
> But per XLogSendLogical() this only happens if
>
>     if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
>     {
>         WalSndCaughtUp = true;
>         ...
>     }

After a couple of days investigating and debugging I came to a
slightly different conclusion, WalSndCaughtUp is set to true in my
case.
Since I am mostly a python person, I decided to use psycopg2 for my
investigation. I took an example from
http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start
as a starting point, created a slot and started the script.
I wasn't running any transactions, therefore the DemoConsumer.__call__
was never executed and cursor.send_feedback(flush_lsn=msg.data_start)
was never called either. Basically, the only what the psycopg2
internals was doing - periodically sending keepalive messages or
replying to keepalives sent by postgres. In the postgres debug log
they are visible as:

2019-05-01 12:58:32.785 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0

If you try to do a fast shutdown of postgres while the script is
running, it will never finish, and in the postgres log you will get
indefinite stream of messages:
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0

Actually, the same problem will happen even in the case when the
consumer script receives some message, but not very intensively, but
it is just a bit harder to reproduce it.

If you attach to the walsender with gdb, you'll see the following picture:
(gdb) bt
#0  0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
#1  0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:318
#2  0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:265
#3  0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
#4  0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
#5  0x000055cb93dac30b in send_message_to_frontend
(edata=0x55cb942b4380 <errordata>) at elog.c:3317
#6  0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
#7  0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
#8  0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
"sending replication keepalive") at elog.c:1376
#9  0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
walsender.c:3358
#10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29
<XLogSendLogical>) at walsender.c:2872
#11 0x000055cb93bc9155 in WalSndLoop (send_data=0x55cb93bc9e29
<XLogSendLogical>) at walsender.c:2194
#12 0x000055cb93bc7b11 in StartLogicalReplication (cmd=0x55cb95931cc0)
at walsender.c:1109
#13 0x000055cb93bc83d6 in exec_replication_command
(cmd_string=0x55cb958b2360 "START_REPLICATION SLOT \"test\" LOGICAL
0/00000000") at walsender.c:1541
#14 0x000055cb93c31653 in PostgresMain (argc=1, argv=0x55cb958deb68,
dbname=0x55cb958deb48 "postgres", username=0x55cb958deb28
"akukushkin") at postgres.c:4178
#15 0x000055cb93b95185 in BackendRun (port=0x55cb958d71e0) at postmaster.c:4361
#16 0x000055cb93b94824 in BackendStartup (port=0x55cb958d71e0) at
postmaster.c:4033
#17 0x000055cb93b90ccd in ServerLoop () at postmaster.c:1706
#18 0x000055cb93b90463 in PostmasterMain (argc=3, argv=0x55cb958ac710)
at postmaster.c:1379
#19 0x000055cb93abb08e in main (argc=3, argv=0x55cb958ac710) at main.c:228
(gdb) f 10
#10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29
<XLogSendLogical>) at walsender.c:2872
2872                    WalSndKeepalive(true);
(gdb) p WalSndCaughtUp
$1 = true
(gdb) p *MyWalSnd
$2 = {pid = 21845, state = WALSNDSTATE_STREAMING, sentPtr = 23586168,
needreload = false, write = 0, flush = 23586112, apply = 0, writeLag =
-1, flushLag = -1, applyLag = -1, mutex = 0 '\000', latch =
0x7fd66096b594, sync_standby_priority = 0}

As you can see, the value of WalSndCaughtUp is set to true! The
shutdown never finishes because the value of sentPtr is higher than
values of MyWalSnd->flush or MyWalSnd->write:
(gdb) l 2858
2853            /*
2854             * To figure out whether all WAL has successfully been
replicated, check
2855             * flush location if valid, write otherwise. Tools
like pg_receivewal will
2856             * usually (unless in synchronous mode) return an
invalid flush location.
2857             */
2858            replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
2859                    MyWalSnd->write : MyWalSnd->flush;
2860
2861            if (WalSndCaughtUp && sentPtr == replicatedPtr &&
2862                    !pq_is_send_pending())
2863            {
2864                    /* Inform the standby that XLOG streaming is done */
2865                    EndCommand("COPY 0", DestRemote);
2866                    pq_flush();
2867
2868                    proc_exit(0);
2869            }

What is more interesting, if one is using pg_recvlogical, it is not
possible to reproduce the issue. That happens because pg_recvlogical
sends the response on keepalive messages by sending the flush location
equals to walEnd which it got from keepalive.

As a next logical step I tried to do the same in python with psycopg2.
Unfortunately, the keepalive functionality is hidden in the C code and
it is not possible to change it without recompiling the psycopg2, but
there is an asynchronous interface available:
http://initd.org/psycopg/docs/extras.html#psycopg2.extras.ReplicationCursor.wal_end.
I just had to do just one minor adjustment:

        try:
            sel = select([cur], [], [], max(0, timeout))
            if not any(sel):
-               cur.send_feedback()  # timed out, send keepalive message
+               cur.send_feedback(flush_lsn=cur.wal_end)  # timed out,
send keepalive message
        except InterruptedError:
            pass  # recalculate timeout and continue

wal_end шы еру property of the cursor object and it is updated not
only for every message received, but also from keepalive messages.
Basically such a little change made the python example behavior very
similar to the pg_recvlogical.

All above text probably looks like a brain dump, but I don't think
that it conflicts with Tomas's findings it rather compliments them.
I am very glad that now I know how to mitigate the problem on the
client side, but IMHO it is also very important to fix the server
behavior if it is ever possible.

Regards,
--
Alexander Kukushkin



Re: walsender vs. XLogBackgroundFlush during shutdown

От
Tomas Vondra
Дата:
On Wed, May 01, 2019 at 02:13:10PM +0200, Alexander Kukushkin wrote:
> Hi Tomas,
>
>On Wed, 1 May 2019 at 02:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>> I see there's an ongoing discussion about race conditions in walreceiver
>> blocking shutdown, so let me start a new thread about a race condition in
>> walsender during shutdown.
>>
>> The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
>> waiting for walsender processes to catch-up and terminate indefinitely.
>
>I can confirm, during the past couple of years we observed such a
>problem a few times and this is really annoying.
>
>> The reason for that is pretty simple - the walsenders are doing logical
>> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
>> But per XLogSendLogical() this only happens if
>>
>>     if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
>>     {
>>         WalSndCaughtUp = true;
>>         ...
>>     }
>
>After a couple of days investigating and debugging I came to a
>slightly different conclusion, WalSndCaughtUp is set to true in my
>case.
>Since I am mostly a python person, I decided to use psycopg2 for my
>investigation. I took an example from
>http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start
>as a starting point, created a slot and started the script.
>I wasn't running any transactions, therefore the DemoConsumer.__call__
>was never executed and cursor.send_feedback(flush_lsn=msg.data_start)
>was never called either. Basically, the only what the psycopg2
>internals was doing - periodically sending keepalive messages or
>replying to keepalives sent by postgres.

OK, so that seems like a separate issue, somewhat unrelated to the issue I
reported. And I'm not sure it's a walsender issue - it seems it might be a
psycopg2 issue in not reporting the flush properly, no?

>Actually, the same problem will happen even in the case when the
>consumer script receives some message, but not very intensively, but
>it is just a bit harder to reproduce it.
>
>If you attach to the walsender with gdb, you'll see the following picture:
>(gdb) bt
>#0  0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
>len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
>#1  0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
>ptr=0x55cb958dca08, len=94) at be-secure.c:318
>#2  0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
>ptr=0x55cb958dca08, len=94) at be-secure.c:265
>#3  0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
>#4  0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
>#5  0x000055cb93dac30b in send_message_to_frontend
>(edata=0x55cb942b4380 <errordata>) at elog.c:3317
>#6  0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
>#7  0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
>#8  0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
>"sending replication keepalive") at elog.c:1376
>#9  0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
>walsender.c:3358

Is it stuck in the send() call forever, or did you happen to grab
this bracktrace?

>
>All above text probably looks like a brain dump, but I don't think
>that it conflicts with Tomas's findings it rather compliments them.
>I am very glad that now I know how to mitigate the problem on the
>client side, but IMHO it is also very important to fix the server
>behavior if it is ever possible.
>

I think having a report of an issue, with a way to reproduce it is a first
(and quite important) step. So thanks for doing that.

That being said, I think those are two separate issues, with different
causes and likely different fixes. I don't think fixing the xlog flush
will resolve your issue, and vice versa.

FWIW attached is a patch that I used to reliably trigger the xlog flush
issue - it simply adds LogLogicalMessage() to commit handler in the
built-in output plugin. So all you need to do is create a subscription,
start generating commit and trigger a restart. The message is 8kB, so it's
definitely long enough to overflow to the next xlog page.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Вложения

Re: walsender vs. XLogBackgroundFlush during shutdown

От
Andres Freund
Дата:
Hi,

On 2019-05-01 02:28:45 +0200, Tomas Vondra wrote:
> The reason for that is pretty simple - the walsenders are doing logical
> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
> But per XLogSendLogical() this only happens if
> 
>    if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
>    {
>        WalSndCaughtUp = true;
>        ...
>    }
> 
> That is, we need to get beyong GetFlushRecPtr(). But that may never
> happen, because there may be incomplete (and unflushed) page in WAL
> buffers, not forced out by any transaction. So if there's a WAL record
> overflowing to the unflushed page, the walsender will never catch up.
> 
> Now, this situation is apparently expected, because WalSndWaitForWal()
> does this:
> 
>    /*
>     * If we're shutting down, trigger pending WAL to be written out,
>     * otherwise we'd possibly end up waiting for WAL that never gets
>     * written, because walwriter has shut down already.
>     */
>    if (got_STOPPING)
>        XLogBackgroundFlush();
> 
> but unfortunately that does not actually do anything, because right at
> the very beginning XLogBackgroundFlush() does this:
> 
>    /* back off to last completed page boundary */
>    WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;

> That is, it intentionally ignores the incomplete page, which means the
> walsender can't read the record and reach GetFlushRecPtr().
> 
> XLogBackgroundFlush() does this since (at least) 2007, so how come we
> never had issues with this before?

I assume that's because of the following logic:
    /* if we have already flushed that far, consider async commit records */
    if (WriteRqst.Write <= LogwrtResult.Flush)
    {
        SpinLockAcquire(&XLogCtl->info_lck);
        WriteRqst.Write = XLogCtl->asyncXactLSN;
        SpinLockRelease(&XLogCtl->info_lck);
        flexible = false;        /* ensure it all gets written */
    }

and various pieces of the code doing XLogSetAsyncXactLSN() to force
flushing.  I wonder if the issue is that we're better at avoiding
unnecessary WAL to be written due to
6ef2eba3f57f17960b7cd4958e18aa79e357de2f


> But I don't think we're safe without the failover slots patch, because
> any output plugin can do something similar - say, LogLogicalMessage() or
> something like that. I'm not aware of a plugin doing such things, but I
> don't think it's illegal / prohibited either. (Of course, plugins that
> generate WAL won't be useful for decoding on standby in the future.)

FWIW, I'd consider such an output plugin outright broken.


> So what I think we should do is to tweak XLogBackgroundFlush() so that
> during shutdown it skips the back-off to page boundary, and flushes even
> the last piece of WAL. There are only two callers anyway, so something
> like XLogBackgroundFlush(bool) would be simple enough.

I think it then just ought to be a normal XLogFlush(). I.e. something
along the lines of:

        /*
         * If we're shutting down, trigger pending WAL to be written out,
         * otherwise we'd possibly end up waiting for WAL that never gets
         * written, because walwriter has shut down already.
         */
        if (got_STOPPING && !RecoveryInProgress())
            XLogFlush(GetXLogInsertRecPtr());

        /* Update our idea of the currently flushed position. */
        if (!RecoveryInProgress())
            RecentFlushPtr = GetFlushRecPtr();
        else
            RecentFlushPtr = GetXLogReplayRecPtr(NULL);


Greetings,

Andres Freund



Re: walsender vs. XLogBackgroundFlush during shutdown

От
Tomas Vondra
Дата:
On Wed, May 01, 2019 at 08:53:15AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-05-01 02:28:45 +0200, Tomas Vondra wrote:
>> The reason for that is pretty simple - the walsenders are doing logical
>> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
>> But per XLogSendLogical() this only happens if
>>
>>    if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
>>    {
>>        WalSndCaughtUp = true;
>>        ...
>>    }
>>
>> That is, we need to get beyong GetFlushRecPtr(). But that may never
>> happen, because there may be incomplete (and unflushed) page in WAL
>> buffers, not forced out by any transaction. So if there's a WAL record
>> overflowing to the unflushed page, the walsender will never catch up.
>>
>> Now, this situation is apparently expected, because WalSndWaitForWal()
>> does this:
>>
>>    /*
>>     * If we're shutting down, trigger pending WAL to be written out,
>>     * otherwise we'd possibly end up waiting for WAL that never gets
>>     * written, because walwriter has shut down already.
>>     */
>>    if (got_STOPPING)
>>        XLogBackgroundFlush();
>>
>> but unfortunately that does not actually do anything, because right at
>> the very beginning XLogBackgroundFlush() does this:
>>
>>    /* back off to last completed page boundary */
>>    WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
>
>> That is, it intentionally ignores the incomplete page, which means the
>> walsender can't read the record and reach GetFlushRecPtr().
>>
>> XLogBackgroundFlush() does this since (at least) 2007, so how come we
>> never had issues with this before?
>
>I assume that's because of the following logic:
>    /* if we have already flushed that far, consider async commit records */
>    if (WriteRqst.Write <= LogwrtResult.Flush)
>    {
>        SpinLockAcquire(&XLogCtl->info_lck);
>        WriteRqst.Write = XLogCtl->asyncXactLSN;
>        SpinLockRelease(&XLogCtl->info_lck);
>        flexible = false;        /* ensure it all gets written */
>    }
>
>and various pieces of the code doing XLogSetAsyncXactLSN() to force
>flushing.  I wonder if the issue is that we're better at avoiding
>unnecessary WAL to be written due to
>6ef2eba3f57f17960b7cd4958e18aa79e357de2f
>

I don't think so, because (a) there are no async commits involved, and (b)
we originally ran into the issue on 9.6 and 6ef2eba3f57f1 is only in 10+.

>
>> But I don't think we're safe without the failover slots patch, because
>> any output plugin can do something similar - say, LogLogicalMessage() or
>> something like that. I'm not aware of a plugin doing such things, but I
>> don't think it's illegal / prohibited either. (Of course, plugins that
>> generate WAL won't be useful for decoding on standby in the future.)
>
>FWIW, I'd consider such an output plugin outright broken.
>

Why? Is that prohibited somewhere, either explicitly or implicitly? I do
see obvious issues with generating WAL from plugin (infinite cycle and so
on), but I suppose that's more a regular coding issue than something that
would make all plugins doing that broken.

FWIW I don't see any particular need to generate WAL from output plugins,
I mentioned it mostly jsut as a convenient way to trigger the issue. I
suppose there are other ways to generate a bit of WAL during shutdown.

>
>> So what I think we should do is to tweak XLogBackgroundFlush() so that
>> during shutdown it skips the back-off to page boundary, and flushes even
>> the last piece of WAL. There are only two callers anyway, so something
>> like XLogBackgroundFlush(bool) would be simple enough.
>
>I think it then just ought to be a normal XLogFlush(). I.e. something
>along the lines of:
>
>        /*
>         * If we're shutting down, trigger pending WAL to be written out,
>         * otherwise we'd possibly end up waiting for WAL that never gets
>         * written, because walwriter has shut down already.
>         */
>        if (got_STOPPING && !RecoveryInProgress())
>            XLogFlush(GetXLogInsertRecPtr());
>
>        /* Update our idea of the currently flushed position. */
>        if (!RecoveryInProgress())
>            RecentFlushPtr = GetFlushRecPtr();
>        else
>            RecentFlushPtr = GetXLogReplayRecPtr(NULL);
>

Perhaps. That would work too, I guess.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: walsender vs. XLogBackgroundFlush during shutdown

От
Alexander Kukushkin
Дата:
Hi,

On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> OK, so that seems like a separate issue, somewhat unrelated to the issue I
> reported. And I'm not sure it's a walsender issue - it seems it might be a
> psycopg2 issue in not reporting the flush properly, no?

Agree, it is a different issue, but I am unsure what to blame,
postgres or psycopg2.
Right now in the psycopg2 we confirm more or less every XLogData
message, but at the same time LSN on the primary is moving forward and
we get updates with keepalive messages.
I perfectly understand the need to periodically send updates of flush
= walEnd (which comes from keepalive). It might happen that there is
no transaction activity but WAL is still generated and as a result
replication slot will prevent WAL from being cleaned up.
From the client side perspective, it confirmed everything that it
should, but from the postgres side, this is not enough to shut down
cleanly. Maybe it is possible to change the check (sentPtr ==
replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
it would be unsafe?

> >Actually, the same problem will happen even in the case when the
> >consumer script receives some message, but not very intensively, but
> >it is just a bit harder to reproduce it.
> >
> >If you attach to the walsender with gdb, you'll see the following picture:
> >(gdb) bt
> >#0  0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
> >len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
> >#1  0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
> >ptr=0x55cb958dca08, len=94) at be-secure.c:318
> >#2  0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
> >ptr=0x55cb958dca08, len=94) at be-secure.c:265
> >#3  0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
> >#4  0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
> >#5  0x000055cb93dac30b in send_message_to_frontend
> >(edata=0x55cb942b4380 <errordata>) at elog.c:3317
> >#6  0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
> >#7  0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
> >#8  0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
> >"sending replication keepalive") at elog.c:1376
> >#9  0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
> >walsender.c:3358
>
> Is it stuck in the send() call forever, or did you happen to grab
> this bracktrace?

No, it didn't stuck there. During the shutdown postgres starts sending
a few thousand keepalive messages per second and receives back so many
feedback message, therefore the chances of interrupting somewhere in
the send are quite high.
The loop never breaks because psycopg2 is always replying with the
same flush as very last time, which was set during processing of
XLogData message.

> I think having a report of an issue, with a way to reproduce it is a first
> (and quite important) step. So thanks for doing that.
>
> That being said, I think those are two separate issues, with different
> causes and likely different fixes. I don't think fixing the xlog flush
> will resolve your issue, and vice versa.

Agree, these are different issues.

Regards,
--
Alexander Kukushkin



Re: walsender vs. XLogBackgroundFlush during shutdown

От
Tomas Vondra
Дата:
On Wed, May 01, 2019 at 07:12:52PM +0200, Alexander Kukushkin wrote:
>Hi,
>
>On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>> OK, so that seems like a separate issue, somewhat unrelated to the issue I
>> reported. And I'm not sure it's a walsender issue - it seems it might be a
>> psycopg2 issue in not reporting the flush properly, no?
>
>Agree, it is a different issue, but I am unsure what to blame,
>postgres or psycopg2.
>Right now in the psycopg2 we confirm more or less every XLogData
>message, but at the same time LSN on the primary is moving forward and
>we get updates with keepalive messages.
>I perfectly understand the need to periodically send updates of flush
>= walEnd (which comes from keepalive). It might happen that there is
>no transaction activity but WAL is still generated and as a result
>replication slot will prevent WAL from being cleaned up.
>From the client side perspective, it confirmed everything that it
>should, but from the postgres side, this is not enough to shut down
>cleanly. Maybe it is possible to change the check (sentPtr ==
>replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
>it would be unsafe?
>

I don't know.

In general I think it's a bit strange that we're waiting for walsender
processes to catch up even in fast shutdown mode, instead of just aborting
them like other backends. But I assume there are reasons for that. OTOH it
makes us vulnerable to issues like this, when a (presumably) misbehaving
downstream prevents a shutdown.

>> >ptr=0x55cb958dca08, len=94) at be-secure.c:318
>> >#2  0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
>> >ptr=0x55cb958dca08, len=94) at be-secure.c:265
>> >#3  0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
>> >#4  0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
>> >#5  0x000055cb93dac30b in send_message_to_frontend
>> >(edata=0x55cb942b4380 <errordata>) at elog.c:3317
>> >#6  0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
>> >#7  0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
>> >#8  0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
>> >"sending replication keepalive") at elog.c:1376
>> >#9  0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
>> >walsender.c:3358
>>
>> Is it stuck in the send() call forever, or did you happen to grab
>> this bracktrace?
>
>No, it didn't stuck there. During the shutdown postgres starts sending
>a few thousand keepalive messages per second and receives back so many
>feedback message, therefore the chances of interrupting somewhere in
>the send are quite high.

Uh, that seems a bit broken, perhaps?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: walsender vs. XLogBackgroundFlush during shutdown

От
Simon Riggs
Дата:
On Wed, 1 May 2019 at 01:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
 
Now, this situation is apparently expected, because WalSndWaitForWal()
does this:

    /*
     * If we're shutting down, trigger pending WAL to be written out,
     * otherwise we'd possibly end up waiting for WAL that never gets
     * written, because walwriter has shut down already.
     */
    if (got_STOPPING)
        XLogBackgroundFlush();

but unfortunately that does not actually do anything, because right at
the very beginning XLogBackgroundFlush() does this:

    /* back off to last completed page boundary */
    WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;

That is, it intentionally ignores the incomplete page, which means the
walsender can't read the record and reach GetFlushRecPtr().

XLogBackgroundFlush() does this since (at least) 2007, so how come we
never had issues with this before?

Yeh, not quite what I originally wrote for that.

I think the confusion is that XLogBackgroundFlush() doesn't do quite what it says.

XLogWrite() kinda does with its "flexible" parameter. So I suggest we do the same on XLogBackgroundFlush() so callers can indicate whether they want it to be flexible or not. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: walsender vs. XLogBackgroundFlush during shutdown

От
Alexander Kukushkin
Дата:
On Thu, 2 May 2019 at 14:35, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >From the client side perspective, it confirmed everything that it
> >should, but from the postgres side, this is not enough to shut down
> >cleanly. Maybe it is possible to change the check (sentPtr ==
> >replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
> >it would be unsafe?
>
> I don't know.
>
> In general I think it's a bit strange that we're waiting for walsender
> processes to catch up even in fast shutdown mode, instead of just aborting
> them like other backends. But I assume there are reasons for that. OTOH it
> makes us vulnerable to issues like this, when a (presumably) misbehaving
> downstream prevents a shutdown.

IMHO waiting until remote side received and flushed all changes is a
right strategy, but physical and logical replication should be handled
slightly differently.
For a physical replication we want to make sure that remote side
received and flushed all changes, otherwise in case of switchover we
won't be able to join the former primary as a new standby.
Logical replication case is a bit different. I think we can safely
shutdown walsender when the client confirmed the last XLogData
message, while now we are waiting until the client confirms wal_end
received in the keepalive message. If we shutdown walsender too early,
and do a switchover, the client might miss some events, because
logical slots are not replicated :(


> >No, it didn't stuck there. During the shutdown postgres starts sending
> >a few thousand keepalive messages per second and receives back so many
> >feedback message, therefore the chances of interrupting somewhere in
> >the send are quite high.
>
> Uh, that seems a bit broken, perhaps?

Indeed,  this is broken psycopg2 behavior :(
I am thinking about submitting a patch fixing it.

Actually I quickly skimmed through the pgjdbc logical replication
source code and example
https://jdbc.postgresql.org/documentation/head/replication.html and I
think that it will also cause problems with the shutdown.

Regards,
--
Alexander Kukushkin