Re: Timeline following for logical slots

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: Timeline following for logical slots
Дата
Msg-id CAMsr+YHtG_4RpJyWYbg-aSTcwHrBnOQLCt03a1Q9ipZ_HqwwNw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Timeline following for logical slots  (Andres Freund <andres@anarazel.de>)
Ответы Re: Timeline following for logical slots  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On 15 March 2016 at 17:12, Andres Freund <andres@anarazel.de> wrote:
Hi

Thanks very much for the review.

This patch was split out from failover slots, which its self underwent quite a few revisions, so I'm really happy to have fresh eyes on it. Especially more experienced ones.
 
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
> diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
> index fcb0872..7b60b8f 100644
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -10,9 +10,11 @@
>   *
>   * NOTES
>   *           See xlogreader.h for more notes on this facility.
> + *
> + *           This file is compiled as both front-end and backend code, so it
> + *           may not use ereport, server-defined static variables, etc.
>   *-------------------------------------------------------------------------
>   */
> -

Huh?

I'm not sure what the concern here is. xlogreader *is* compiled as frontend code - the file gets linked into the tree for pg_xlogdump and pg_rewind, at least.

I found that really confusing when working on it and thought it merited a comment.
  
> @@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state)
>       pfree(state->errormsg_buf);
>       if (state->readRecordBuf)
>               pfree(state->readRecordBuf);
> +#ifndef FRONTEND
> +     if (state->timelineHistory)
> +             list_free_deep(state->timelineHistory);
> +#endif

Hm. So we don't support timelines following for frontend code, although
it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.

Yes, it would. I don't want to address that in the same patch though. It'd require making timeline.c frontend-clean, dealing with the absence of List on the frontend, etc, and I don't want to complicate this patch with that.

I've intentionally written the timeline logic so it can pretty easily be moved into xlogreader.c as a self-contained unit and used for those utilities once timeline.c can be compiled for frontend too.

>       pfree(state->readBuf);
>       pfree(state);
>  }
> @@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
>
>       if (RecPtr == InvalidXLogRecPtr)
>       {
> +             /* No explicit start point; read the record after the one we just read */
>               RecPtr = state->EndRecPtr;
>
>               if (state->ReadRecPtr == InvalidXLogRecPtr)
> -                     randAccess = true;
> +                     randAccess = true;      /* allow readPageTLI to go backwards */

randAccess is doing more than that, so I'm doubtful that comment is an
improvment.

Yeah, I have no idea what I was on about there, per response to Álvaro's post. 
 
 
> @@ -466,9 +482,7 @@ err:
>        * Invalidate the xlog page we've cached. We might read from a different
>        * source after failure.
>        */
> -     state->readSegNo = 0;
> -     state->readOff = 0;
> -     state->readLen = 0;
> +     XLogReaderInvalCache(state);

I don't think that "cache" is the right way to describe this.

Isn't that what it is? It reads a page, caches it, and reuses it for subsequent requests on the same page. The pre-existing comment even calls it a cache above.

I don't mind changing it, but don't have any better ideas.
 
>  #include <unistd.h>
>
> -#include "miscadmin.h"
> -

spurious change imo.

Added in Álvaro's rev; it puts the header in the correct sort order, but I'm not sure it should be bundled with this patch.
  
> -             if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
> +             /* Do we need to open a new xlog segment? */
> +             if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
> +                     sendTLI != tli)
>               {

s/open a new/open a different/?  New imo has connotations that we don't
really want here.

In my original patch this was:

/* Do we need to switch to a new xlog segment? */

but yeah, "open a different" is better than either.

 
>               /* Need to seek in the file? */
>               if (sendOff != startoff)
>               {
>                       if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
> -                     {
> -                             char            path[MAXPGPATH];
> -
> -                             XLogFilePath(path, tli, sendSegNo);
> -
>                               ereport(ERROR,
>                                               (errcode_for_file_access(),
>                                 errmsg("could not seek in log segment %s to offset %u: %m",
> -                                              path, startoff)));
> -                     }
> +                                              XLogFileNameP(tli, sendSegNo), startoff)));
>                       sendOff = startoff;
>               }

Not a serious issue, more a general remark: I'm doubtful that going for
palloc in error situations is good practice. This will be allocated in
the current memory context; without access to the emergency error
reserves.

I was getting pretty confused, since I was sure I didn't write that. My memory's bad enough that sometimes I go "huh, ok, guess I did"... but in this case it wasn't in my patch so I think Álvaro's added it.

Agree that it's unrelated and probably better how it was.
  
> +static void
> +XLogReadDetermineTimeline(XLogReaderState *state)
> +{
> +     /* Read the history on first time through */
> +     if (state->timelineHistory == NIL)
> +             state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
> +
> +     /*
> +      * Are we reading the record immediately following the one we read last
> +      * time?  If not, then don't use the cached timeline info.
> +      */
> +     if (state->currRecPtr != state->EndRecPtr)
> +     {
> +             state->currTLI = 0;
> +             state->currTLIValidUntil = InvalidXLogRecPtr;
> +     }


Hm. So we grow essentially a second version of the last end position and
the randAccess stuff in XLogReadRecord().

Yeah, and in an earlier version of this patch that's where it lived.

I landed up moving it into its own self-contained function mainly because the xlog read callback needs to be able to re-read the timeline info and re-check the timeline end if the current timeline becomes historical while it's waiting for new WAL, which could happen if it's a cascading standby and its parent got promoted. Hence the call to XLogReadDetermineTimeline from within read_local_xlog_page(...). That can't actually happen right now since logical decoding can't be done on a standby yet, but I didn't want to introduce new problems to fix for that when adding timeline following support.

XLogReadRecord(...) could clear this info instead of doing it in XLogReadDetermineTimeline(...), but I thought it made more sense to keep use of that state local to XLogReadDetermineTimeline(...) rather than scatter it.
 
> +     if (state->currTLI == 0)
> +     {
> +             /*
> +              * Something changed; work out what timeline this record is on. We
> +              * might read it from the segment on this TLI or, if the segment is
> +              * also contained by newer timelines, the copy from a newer TLI.
> +              */
> +             state->currTLI = tliOfPointInHistory(state->currRecPtr,
> +                                                                                      state->timelineHistory);
> +
> +             /*
> +              * Look for the most recent timeline that's on the same xlog segment
> +              * as this record, since that's the only one we can assume is still
> +              * readable.
> +              */
> +             while (state->currTLI != ThisTimeLineID &&
> +                        state->currTLIValidUntil == InvalidXLogRecPtr)
> +             {
> +                     XLogRecPtr      tliSwitch;
> +                     TimeLineID      nextTLI;
> +
> +                     tliSwitch = tliSwitchPoint(state->currTLI, state->timelineHistory,
> +                                                                        &nextTLI);
> +
> +                     /* round ValidUntil down to start of seg containing the switch */
> +                     state->currTLIValidUntil =
> +                             ((tliSwitch / XLogSegSize) * XLogSegSize);
> +
> +                     if (state->currRecPtr >= state->currTLIValidUntil)
> +                     {
> +                             /*
> +                              * The new currTLI ends on this WAL segment so check the next
> +                              * TLI to see if it's the last one on the segment.
> +                              *
> +                              * If that's the current TLI we'll stop searching.

I don't really understand how we're stopping searching here?

What I'm doing here is looking for the newest timeline that exists in this WAL segment. Each time through we advance currTLI if we find that there's a newer timeline on this segment then look again. The loop ends if we find that the newest timeline on the segment is the current timeline being written/replayed by the server (in which case we know that segment must still exist and not have been renamed to .partial) or until we find that currTLI's validity extends past this segment.

There is some explanation for this in the comments at the start of the function since it affects the overall logic.

On reading it again I think that testing against state->currRecPtr is confusing here. It relies on the fact that currTLIValidUntil has been rounded down to the LSN of the start of the segment, so if the current record pointer is greater than it we know the timeline ends somewhere on this segment.

I guess it could be clearer (but verbose) to define an XLogSegNoFromLSN macro then 

if ( XLogSegNoFromLSN(state->currRecPtr) >= XLogSegNoFromLSN(state->currTLIValidUntil))

but ... eh.

The alternative seems to be to search the timeline history info directly to find the most recent timeline in the segment, starting from the current timeline being read.


> +                              */
> +                             state->currTLI = nextTLI;
> +                             state->currTLIValidUntil = InvalidXLogRecPtr;
> +                     }
> +             }
> +}


XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
whether there's not a simpler way to write this.

If there is I'd be quite happy. It took me some time to figure out the wrinkles here. 

You can't do this where you'd expect, in XLogReadPage. Partly because it gets used by frontend code too, as discussed above. Partly because the xlogreader callback is responsible for waiting for new WAL rather than XLogReadPage, and as noted above that would cause issues when a cascading standby gets promoted while we're waiting for WAL. There's no way to pass the wanted timeline to the callback anyway, but if that were the only issue I'd have just added it (and did, in earlier versions of this patch). 

Additionally, xlogreader and XLogReadPage is used by the physical replication walsender which neither needs nor wants timeline following - it expects to return failure when it runs out of data on the timeline instead, so the client can manage the timeline switch after the CopyBoth data stream ends.

It's not desirable to do the timeline switch for logical decoding at a higher level, before calling into the walsender, because then it has to be done separately in the SQL interface and walsender interface for logical decoding, similarly to how the client does it for physical replication. I actually did it this way in the proof of concept version and it works fine, it's just more intrusive and ugly and duplicates logic in multiple places.

There's more to it than that but I'm tired after a long day. I'll try to write that timelines readme after I review my notes so I can explain better.

As for the actual mechanism by which XLogReadDetermineTimeline operates, the main thing I wonder is whether it can be usefully simplified by having it directly scan the loaded timeline history and determine the last timeline for a segment. I'm not convinced there's much of a way around  needing the rest of the logic.

> +/*
> + * XLogPageReadCB callback for reading local xlog files
>   *
>   * Public because it would likely be very helpful for someone writing another
>   * output method outside walsender, e.g. in a bgworker.
>   *
> - * TODO: The walsender has it's own version of this, but it relies on the
> + * TODO: The walsender has its own version of this, but it relies on the
>   * walsender's latch being set whenever WAL is flushed. No such infrastructure
>   * exists for normal backends, so we have to do a check/sleep/repeat style of
>   * loop for now.
> @@ -754,46 +897,88 @@ int
>  read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
>       int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
>  {
> -     XLogRecPtr      flushptr,
> +     XLogRecPtr      read_upto,
>                               loc;
>       int                     count;
>
>       loc = targetPagePtr + reqLen;
> +
> +     /* Make sure enough xlog is available... */
>       while (1)
>       {
>               /*
> -              * TODO: we're going to have to do something more intelligent about
> -              * timelines on standbys. Use readTimeLineHistory() and
> -              * tliOfPointInHistory() to get the proper LSN? For now we'll catch
> -              * that case earlier, but the code and TODO is left in here for when
> -              * that changes.
> +              * Check which timeline to get the record from.
> +              *
> +              * We have to do it each time through the loop because if we're in
> +              * recovery as a cascading standby, the current timeline might've
> +              * become historical.
>                */
> -             if (!RecoveryInProgress())
> +             XLogReadDetermineTimeline(state);
> +
> +             if (state->currTLI == ThisTimeLineID)
>               {
> -                     *pageTLI = ThisTimeLineID;
> -                     flushptr = GetFlushRecPtr();
> +                     /*
> +                      * We're reading from the current timeline so we might have to
> +                      * wait for the desired record to be generated (or, for a standby,
> +                      * received & replayed)
> +                      */
> +                     if (!RecoveryInProgress())
> +                     {
> +                             *pageTLI = ThisTimeLineID;
> +                             read_upto = GetFlushRecPtr();
> +                     }
> +                     else
> +                             read_upto = GetXLogReplayRecPtr(pageTLI);
> +
> +                     if (loc <= read_upto)
> +                             break;
> +
> +                     CHECK_FOR_INTERRUPTS();
> +                     pg_usleep(1000L);
>               }
>               else
> -                     flushptr = GetXLogReplayRecPtr(pageTLI);
> +             {
> +                     /*
> +                      * We're on a historical timeline, so limit reading to the switch
> +                      * point where we moved to the next timeline.
> +                      */
> +                     read_upto = state->currTLIValidUntil;

Hm. Is it ok to not check GetFlushRecPtr/GetXLogReplayRecPtr() here? If
so, how come?

We're reading from a segment where the newest timeline on that segment is historical, i.e. the server has since replayed WAL from a newer timeline on a more recent segment. We therefore know that there must be a complete segment and we can't ever need to wait for new WAL.

An assertion that  loc <= GetFlushRecPtr()  wouldn't hurt.
 
> -     /* more than one block available */
> -     if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
> +     if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
> +     {
> +             /*
> +              * more than one block available; read only that block, have caller
> +              * come back if they need more.
> +              */
>               count = XLOG_BLCKSZ;
> -     /* not enough data there */
> -     else if (targetPagePtr + reqLen > flushptr)
> +     }
> +     else if (targetPagePtr + reqLen > read_upto)
> +     {
> +             /* not enough data there */
>               return -1;
> -     /* part of the page available */
> +     }
>       else
> -             count = flushptr - targetPagePtr;
> +     {
> +             /* enough bytes available to satisfy the request */
> +             count = read_upto - targetPagePtr;
> +     }
>
> -     XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
> +     XLogRead(cur_page, *pageTLI, targetPagePtr, count);

When are we reading less than a page? That should afaik never be required.

Because pages are pre-allocated and zeroed it's safe to read from the last page past the point we've actually written WAL to. But in that case why do we bother determining 'count' in the first place, only to ignore it?

This is largely cosmetic TBH.

> +             /*
> +              * We start reading xlog from the restart lsn, even though in
> +              * CreateDecodingContext we set the snapshot builder up using the
> +              * slot's candidate_restart_lsn. This means we might read xlog we
> +              * don't actually decode rows from, but the snapshot builder might
> +              * need it to get to a consistent point. The point we start returning
> +              * data to *users* at is the candidate restart lsn from the decoding
> +              * context.
> +              */

Uh? Where are we using candidate_restart_lsn that way?

That's a major brain-fart - it should've been confirmed_flush. I'm glad you caught that, since it took me some time to figure out why logical decoding was trying to read WAL I hadn't asked for, but adding an explanatory comment that's *wrong* sure doesn't help the next person.

I'm quite impressed and disturbed that I managed to write out as "candidate restart lsn" instead of "confirmed lsn" in full as well. (Now, where were my glasses, I swear I had them a minute ago... oh, they're on my face!)

The important bit is that where we start reading xlog is the restart_lsn of the slot, even though we won't return anything we decoded to the user until we reach confirmed_flush, which is looked up by CreateDecodingContext completely independently of what pg_logical_slot_get_changes_guts does, because we passed InvalidXLogRecPtr to CreateDecodingContext to tell it to automagically look up the confirmed_lsn. Maybe it'd be clearer to just pass the confirmed_lsn to CreateDecodingContext from pg_logical_slot_get_changes_guts. When I was debugging this stuff I found it pretty confusing that the LSN we started reading xlog at was different to the LSN the user wants decoded changes from, hence the comment.

> @@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
>                       CHECK_FOR_INTERRUPTS();
>               }
>
> +             /* Make sure timeline lookups use the start of the next record */
> +             startptr = ctx->reader->EndRecPtr;

Huh? startptr isn't used after this, so I'm not sure what this even
means?

That snuck in from a prior revision where timeline following was done in pg_logical_slot_get_changes_guts, before I moved it down into the xlogreader callback so it could be shared between the walsender and SQL interfaces for logical decoding. It's bogus.
 
> +             /*
> +              * The XLogReader will read a page past the valid end of WAL because
> +              * it doesn't know about timelines. When we switch timelines and ask
> +              * it for the first page on the new timeline it will think it has it
> +              * cached, but it'll have the old partial page and say it can't find
> +              * the next record. So flush the cache.
> +              */
> +             XLogReaderInvalCache(ctx->reader);
> +

dito.

Ditto ;)

I really should've caught that. The problem is I've spent way, way too long staring at this code over too many revisions as I figured out what T.F. was going on with timeline following and slots. That's exactly why I really appreciate the fresh eyes.
 
 
> + * Create a new logical slot, with invalid LSN and xid, directly. This does not
> + * use the snapshot builder or logical decoding machinery. It's only intended
> + * for creating a slot on a replica that mirrors the state of a slot on an
> + * upstream master.
> + *
> + * You should immediately decoding_failover_advance_logical_slot(...) it
> + * after creation.
> + */

Uh. I doubt we want this, even if it's formally located in
src/test/modules.  These comments make it appear not to be only intended
for that, and I have serious doubts about the validity of the concept as
is.

As I expressed to Álvaro, I don't really mind if this test module and the associated round of t/ tests that rely on it get cut, so long as the first round of t/ tests that use a filesystem level copy to clone the slot are kept so there's some test coverage for timeline following in logical slots.

I wrote it as a PoC to show that it worked, and the best way to do that was to write it as a new test suite component. Since it seemed useful to test logical decoding timeline following for a slot created after a base backup is made or cloned after pg_basebackup discarded the slots, I included it.

The approach isn't ideal. It's what I plan to do for pglogical if failover slots doesn't make the cut for 9.6, though. The main problem is that because the slot updates don't come through WAL they can lag behind the catalog_xmin the master's using to make decisions about vacuuming the catalogs. If the master advances catalog_xmin on a slot then starts writing the resulting vacuum activity to WAL, and if we replay that WAL *before* we see the slot advance and sync it to the replica, the replica's slot will have an incorrect catalog_xmin that does not reflect the actual on-disk state. Not great. However, if the client is keeping track of its confirmed_lsn (as it must, for correctness) then we know it'll never ask to replay anything older than what it already sent confirmation to the old master for, before failover. Since that's how the slot got advanced and got a new catalog_xmin. That means we won't be attempting to decode anything in the range where the recorded catalog_xmin on the promoted standby after failover would be a problem. The slot will advance to a sensible position when the client specifies the new start lsn.

At least, that's my reading of things, and that's what my tests have shown so far. We do start decoding from restart_lsn, so we'll be decoding WAL in the range where catalog_xmin is lying about what's really in the heap, but I don't see anywhere where we're looking at it. It's just collecting up transaction information at that point, right?

This same issue will occur if we attempt to do failover slots v2 for 9.7 using non-WAL transport to allow decoding from a replica with failover to cascading standbys, as you mentioned wanting earlier. We'd have to have a way to make sure the slot state on the replica was updated before we replayed past the point in WAL where that slot was updated to the same state on the master.

To that end I've thought about proposing a hook to let plugins intercept slot write-out. That way they can take note of the current server LSN and slot state, then make sure they sync that over to the replica before it replays WAL past that LSN. I was really hoping to get failover slots in place so this wouldn't be necessary but it's not looking too promising, and this would provide a somewhat safer way to capture slot advances than just peeking at pg_replication_slots but without having to get the full failover slots stuff in. Having this would at least eliminate the possibility of catalog_xmin being wrong on the replica.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: Minor bug affecting ON CONFLICT lock wait log messages
Следующее
От: Thomas Reiss
Дата:
Сообщение: Re: RFC: replace pg_stat_activity.waiting with something more descriptive