Обсуждение: Detecting skipped data from logical slots (data silently skipped)

Поиск
Список
Период
Сортировка

Detecting skipped data from logical slots (data silently skipped)

От
Craig Ringer
Дата:
Hi all

I think we have a bit of a problem with the behaviour specified for logical slots, one that makes it hard to prevent a outdated snapshot or backup of a logical-slot-using downstream from knowing it's missing a chunk of data that's been consumed from a slot. That's not great since slots are supposed to ensure a continuous, gapless data stream.

If the downstream requests that logical decoding restarts at an LSN older than the slot's confirmed_flush_lsn, we silently ignore the client's request and start replay at the confirmed_flush_lsn. That's by design and fine normally, since we know the gap LSNs contained no transactions of interest to the downstream.

But it's *bad* if the downstream is actually a copy of the original downstream that's been started from a restored backup/snapshot. In that case the downstream won't know that some other client, probably a newer instance of its self, consumed rows it should've seen. It'll merrily continue replaying and not know it isn't consistent.

The cause is an optimisation intended to allow the downstream to avoid having to do local writes and flushes when the upstream's activity isn't of interest to it and doesn't result in replicated rows. When the upstream does a bunch of writes to another database or otherwise produces WAL not of interest to the downstream we send the downstream keepalive messages that include the upstream's current xlog position and the client replies to acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on the downstream, we permit it to skip advancing its replication origin in response to those keepalives. We continue to advance the confirmed_flush_lsn and restart_lsn in the replication slot on the upstream so we can free WAL that's not needed and move the catalog_xmin up. The replication origin on the downstream falls behind the confirmed_flush_lsn on the upstream.

This means that if the downstream exits/crashes before receiving some new row, its replication origin will tell it that it last replayed some LSN older than what it really did, and older than what the server retains. Logical decoding doesn't allow the client to "jump backwards" and replay anything older than the confirmed_lsn. Since we "know" that the gap cannot contain rows of interest, otherwise we'd have updated the replication origin, we just skip and start replay at the confirmed_flush_lsn.

That means that if the downstream is restored from a backup it has no way of knowing it can't rejoin and become consistent because it can't tell the difference between "everything's fine, replication origin intentionally behind confirmed_flush_lsn due to activity not of interest" and "we've missed data consumed from this slot by some other peer and should refuse to continue replay".

The simplest fix would be to require downstreams to flush their replication origin when they get a hot standby feedback message, before they send a reply with confirmation. That could be somewhat painful for performance, but can be alleviated somewhat by waiting for the downstream postgres to get around to doing a flush anyway and only forcing it if we're getting close to the walsender timeout. That's pretty much what BDR and pglogical do when applying transactions to avoid having to do a disk flush for each and every applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently ignoring it.

This problem can also bite you if you restore a copy of a downstream (say, to look at since-deleted data) while the original happens to be disconnected for some reason. The copy connects to the upstream and consumes some data from the slot. Then when the original comes back on line it has no idea there's a gap in its time stream.

This came up when investigating issues with people using snapshot-based BDR and pglogical backup/restore. It's a real-world problem that can result in silent data inconsistency.

Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd still allow a downstream to find out if it's missed data?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Detecting skipped data from logical slots (data silently skipped)

От
Greg Stark
Дата:

I didn't follow all of that but I wonder if it isn't just that when you restore from backup you should be creating a new slot?


On 3 Aug 2016 14:39, "Craig Ringer" <craig@2ndquadrant.com> wrote:
Hi all

I think we have a bit of a problem with the behaviour specified for logical slots, one that makes it hard to prevent a outdated snapshot or backup of a logical-slot-using downstream from knowing it's missing a chunk of data that's been consumed from a slot. That's not great since slots are supposed to ensure a continuous, gapless data stream.

If the downstream requests that logical decoding restarts at an LSN older than the slot's confirmed_flush_lsn, we silently ignore the client's request and start replay at the confirmed_flush_lsn. That's by design and fine normally, since we know the gap LSNs contained no transactions of interest to the downstream.

But it's *bad* if the downstream is actually a copy of the original downstream that's been started from a restored backup/snapshot. In that case the downstream won't know that some other client, probably a newer instance of its self, consumed rows it should've seen. It'll merrily continue replaying and not know it isn't consistent.

The cause is an optimisation intended to allow the downstream to avoid having to do local writes and flushes when the upstream's activity isn't of interest to it and doesn't result in replicated rows. When the upstream does a bunch of writes to another database or otherwise produces WAL not of interest to the downstream we send the downstream keepalive messages that include the upstream's current xlog position and the client replies to acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on the downstream, we permit it to skip advancing its replication origin in response to those keepalives. We continue to advance the confirmed_flush_lsn and restart_lsn in the replication slot on the upstream so we can free WAL that's not needed and move the catalog_xmin up. The replication origin on the downstream falls behind the confirmed_flush_lsn on the upstream.

This means that if the downstream exits/crashes before receiving some new row, its replication origin will tell it that it last replayed some LSN older than what it really did, and older than what the server retains. Logical decoding doesn't allow the client to "jump backwards" and replay anything older than the confirmed_lsn. Since we "know" that the gap cannot contain rows of interest, otherwise we'd have updated the replication origin, we just skip and start replay at the confirmed_flush_lsn.

That means that if the downstream is restored from a backup it has no way of knowing it can't rejoin and become consistent because it can't tell the difference between "everything's fine, replication origin intentionally behind confirmed_flush_lsn due to activity not of interest" and "we've missed data consumed from this slot by some other peer and should refuse to continue replay".

The simplest fix would be to require downstreams to flush their replication origin when they get a hot standby feedback message, before they send a reply with confirmation. That could be somewhat painful for performance, but can be alleviated somewhat by waiting for the downstream postgres to get around to doing a flush anyway and only forcing it if we're getting close to the walsender timeout. That's pretty much what BDR and pglogical do when applying transactions to avoid having to do a disk flush for each and every applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently ignoring it.

This problem can also bite you if you restore a copy of a downstream (say, to look at since-deleted data) while the original happens to be disconnected for some reason. The copy connects to the upstream and consumes some data from the slot. Then when the original comes back on line it has no idea there's a gap in its time stream.

This came up when investigating issues with people using snapshot-based BDR and pglogical backup/restore. It's a real-world problem that can result in silent data inconsistency.

Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd still allow a downstream to find out if it's missed data?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Detecting skipped data from logical slots (data silently skipped)

От
Craig Ringer
Дата:
On 3 August 2016 at 21:55, Greg Stark <stark@mit.edu> wrote:

I didn't follow all of that but I wonder if it isn't just that when you restore from backup you should be creating a new slot?


Yes, you should. 

But when you restore a Pg instance from a disk snapshot or similar it can't tell it isn't the original of its self. It has no way to know "I shouldn't connect to this slot and consume data from it". Right now the upstream has no way to tell it "that data's gone, sorry" - it just ignores the downstream's requested start position and starts from wherever it thinks the downstream is up to.

With physical replication we'll detect such a problem. With logical replication we'll silently continue with an invisible gap in the history.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Detecting skipped data from logical slots (data silently skipped)

От
Robert Haas
Дата:
On Wed, Aug 3, 2016 at 9:39 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> I think we have a bit of a problem with the behaviour specified for logical
> slots, one that makes it hard to prevent a outdated snapshot or backup of a
> logical-slot-using downstream from knowing it's missing a chunk of data
> that's been consumed from a slot. That's not great since slots are supposed
> to ensure a continuous, gapless data stream.
>
> If the downstream requests that logical decoding restarts at an LSN older
> than the slot's confirmed_flush_lsn, we silently ignore the client's request
> and start replay at the confirmed_flush_lsn. That's by design and fine
> normally, since we know the gap LSNs contained no transactions of interest
> to the downstream.

Wow, that sucks.

> The cause is an optimisation intended to allow the downstream to avoid
> having to do local writes and flushes when the upstream's activity isn't of
> interest to it and doesn't result in replicated rows. When the upstream does
> a bunch of writes to another database or otherwise produces WAL not of
> interest to the downstream we send the downstream keepalive messages that
> include the upstream's current xlog position and the client replies to
> acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on
> the downstream, we permit it to skip advancing its replication origin in
> response to those keepalives. We continue to advance the confirmed_flush_lsn
> and restart_lsn in the replication slot on the upstream so we can free WAL
> that's not needed and move the catalog_xmin up. The replication origin on
> the downstream falls behind the confirmed_flush_lsn on the upstream.

This seems entirely too clever.  The upstream could safely remember
that if the downstream asks for WAL position X it's safe to begin
streaming from WAL position Y because nothing in the middle is
interesting, but it can hardly decide to unilaterally ignore the
request position.

> The simplest fix would be to require downstreams to flush their replication
> origin when they get a hot standby feedback message, before they send a
> reply with confirmation. That could be somewhat painful for performance, but
> can be alleviated somewhat by waiting for the downstream postgres to get
> around to doing a flush anyway and only forcing it if we're getting close to
> the walsender timeout. That's pretty much what BDR and pglogical do when
> applying transactions to avoid having to do a disk flush for each and every
> applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> you ask for a too-old LSN rather than silently ignoring it.

That's basically just proposing to revert this broken optimization,
IIUC, and instead just try not to flush too often on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Detecting skipped data from logical slots (data silently skipped)

От
Andres Freund
Дата:
On 2016-08-05 00:12:41 -0400, Robert Haas wrote:
> > The cause is an optimisation intended to allow the downstream to avoid
> > having to do local writes and flushes when the upstream's activity isn't of
> > interest to it and doesn't result in replicated rows. When the upstream does
> > a bunch of writes to another database or otherwise produces WAL not of
> > interest to the downstream we send the downstream keepalive messages that
> > include the upstream's current xlog position and the client replies to
> > acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on
> > the downstream, we permit it to skip advancing its replication origin in
> > response to those keepalives. We continue to advance the confirmed_flush_lsn
> > and restart_lsn in the replication slot on the upstream so we can free WAL
> > that's not needed and move the catalog_xmin up. The replication origin on
> > the downstream falls behind the confirmed_flush_lsn on the upstream.
> 
> This seems entirely too clever.  The upstream could safely remember
> that if the downstream asks for WAL position X it's safe to begin
> streaming from WAL position Y because nothing in the middle is
> interesting, but it can hardly decide to unilaterally ignore the
> request position.
> 
> > The simplest fix would be to require downstreams to flush their replication
> > origin when they get a hot standby feedback message, before they send a
> > reply with confirmation. That could be somewhat painful for performance, but
> > can be alleviated somewhat by waiting for the downstream postgres to get
> > around to doing a flush anyway and only forcing it if we're getting close to
> > the walsender timeout. That's pretty much what BDR and pglogical do when
> > applying transactions to avoid having to do a disk flush for each and every
> > applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> > you ask for a too-old LSN rather than silently ignoring it.
> 
> That's basically just proposing to revert this broken optimization,
> IIUC, and instead just try not to flush too often on the standby.

The effect of the optimization is *massive* if you are replicating a
less active database, or a less active subset of a database, in a
cluster with lots of other activity. I don't think that can just be
disregard, to protect against something with plenty of other failure
scenarios.

Andres



Re: Detecting skipped data from logical slots (data silently skipped)

От
Craig Ringer
Дата:
On 5 August 2016 at 14:07, Andres Freund <andres@anarazel.de> wrote:

> > The simplest fix would be to require downstreams to flush their replication
> > origin when they get a hot standby feedback message, before they send a
> > reply with confirmation. That could be somewhat painful for performance, but
> > can be alleviated somewhat by waiting for the downstream postgres to get
> > around to doing a flush anyway and only forcing it if we're getting close to
> > the walsender timeout. That's pretty much what BDR and pglogical do when
> > applying transactions to avoid having to do a disk flush for each and every
> > applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> > you ask for a too-old LSN rather than silently ignoring it.
>
> That's basically just proposing to revert this broken optimization,
> IIUC, and instead just try not to flush too often on the standby.

The effect of the optimization is *massive* if you are replicating a
less active database, or a less active subset of a database, in a
cluster with lots of other activity. I don't think that can just be
disregard, to protect against something with plenty of other failure
scenarios.


Right. Though if we flush lazily I'm surprised the effect is that big, you're the one who did the work and knows the significance of it.

All I'm trying to say is that I think the current behaviour is too dangerous. It doesn't just lead to failure, but easy, undetectable, silent failure when users perform common and simple tasks like starting a snapshot or filesystem-level pg_start_backup() copy of a DB. The only reason it can't happen for pg_basebackup too is that we omit slots during pg_basebackup . That inconsistency between snapshot/fs-level and pg_basebackup is unfortunate but understandable.

So I'm not saying "this whole idea must go". I'm saying I think it's to permissive and needs to be able to be stricter about what it's allowed to skip, so we can differentiate between "nothing interesting here" and "um, I think someone else consumed data I needed, I'd better bail out now". I've never been comfortable with the skipping behaviour and found it confusing right from the start, but now I have definite cases it can cause silent inconsistencies and really think it needs to be modified.

Robert's point that we could keep track of the skippable range is IMO a good one. An extra slot attribute with the last LSN that resulted in the output plugin doing a write to the client would be sufficient, at least at this point. To anticipate future needs where we might want to allow output plugins to ignore some things, control could be handed to the output plugin by allowing it to also make a function call for the position to be explicitly advanced even if it performs no writes.

That way we can safely skip ahead if the client asks us for an LSN equal to or after the last real data we sent them, but refuse to skip if we sent them data after the LSN they're asking for.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Detecting skipped data from logical slots (data silently skipped)

От
Andres Freund
Дата:
On 2016-08-08 10:59:20 +0800, Craig Ringer wrote:
> Right. Though if we flush lazily I'm surprised the effect is that big,
> you're the one who did the work and knows the significance of it.

It will be. Either you're increasing bloat (by not increasing the
slot's wal position / catalog xmin), or you're adding frequent syncs on
an idle connection.

Greetings,

Andres Freund



Re: Detecting skipped data from logical slots (data silently skipped)

От
Craig Ringer
Дата:
On 17 August 2016 at 05:18, Andres Freund <andres@anarazel.de> wrote:
On 2016-08-08 10:59:20 +0800, Craig Ringer wrote:
> Right. Though if we flush lazily I'm surprised the effect is that big,
> you're the one who did the work and knows the significance of it.

It will be. Either you're increasing bloat (by not increasing the
slot's wal position / catalog xmin), or you're adding frequent syncs on
an idle connection.


My thinking is that we should be able to do it lazily, like we do already with feedback during apply of changes.  The problem is that right now we can't tell the difference between confirmed_flush_lsn advances in response to keepalives when there's no interesting upstream activity, and advances when the client replays and confirms real activity of interest. So we can add a new field in logical slots that tracks the last confirmed_flush_lsn update that occurred as a result of an actual write to the client rather than keepalive responses. No new resource retention is required, no new client messages, no new protocol fields. Just one new field in a logical slot.



* Add a new field, say last_write_lsn, in slots. A logical slot updates this whenever an output plugin sends something to the client in response to a callback. last_write_lsn is not advanced along with confirmed_flush_lsn when we just skip over data that's not of interest like writes to other DBs or changes that are filtered out by the output plugin, only when the output plugin actually sends something to the client.

* A candidate_last_write_lsn type mechanism is needed to ensure we don't flush out advances of last_write_lsn before we've got client feedback to confirm it flushed the changes resulting from the output plugin writes. The same sort of logic as used for candidate_restart_lsn & restart_lsn will work fine, but we don't have to make sure it's flushed like we do with restart_lsn, we can just dirty the slot and wait for the next slot checkpoint - it's pretty harmless if candidate_last_write_lsn is older than reality, it just adds a small window where we won't detect lost changes.

* Clients like BDR and pglogical already send feedback lazily. They track the server's flush position and sending feedback for an upstream lsn when we know the corresponding downstream writes and associated replication origin advances have been flushed to disk. (As you know, having written it). Behaviour during normal apply doesn't need to change. Neither does behaviour during idle; clients don't have to advance their replication origin in response to server keepalives, though they may do so lazily.

*  When a client starts a new decoding session we check last_write_lsn against the client-requested LSN from the client's replication origin. We ERROR if last_write_lsn is newer than the LSN requested by the client, indicating that the client is trying to replay changes it or someone else using the same slot has already seen and confirmed.

*  catalog_xmin advances and WAL removal are NOT limited by last_write_lsn, we can freely remove WAL after last_write_lsn and vacuum catalogs. On reconnect we continue to skip to confirmed_flush_lsn if asked for an older LSN, just like we currently do. The difference is that now we know we're skipping data that wasn't of interest to the client so it didn't result in eager client side replication origin advances.


Think of last_write_lsn as "the value of confirmed_flush_lsn last time the client actually flushed something interesting". We can safely skip from any value >= last_write_lsn to the current slot confirmed_lsn if asked to start replay at any LSN within that range. We CANNOT safely skip from < last_write_lsn to confirmed_flush_lsn since we know the client would miss data it already received and confirmed but seems to have forgotten due to lying fsync(), restore from snapshot backup, etc.

We'd need more flushes on the upstream only if we were going to try to guarantee that we'd detect all lost changes from a client, since last_write_lsn would need flushing in response to every client feedback message during apply (but not idle). Even then the client could've flushed more changes we haven't got feedback for yet, so it's not really possible to totally prevent the problem. I don't think total prevention is too interesting though. A window since the last slot checkpoint where we don't detect problems *if* the server has also crashed and restarted isn't too bad and is a lot better than the current situation.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services