Re: [HACKERS] Causal reads take II
От | Thomas Munro |
---|---|
Тема | Re: [HACKERS] Causal reads take II |
Дата | |
Msg-id | CAEepm=2K+px_KvfNmUA_c52R5W0_YK48wwemTNgDRG1DooZvUA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] Causal reads take II (Dmitry Dolgov <9erthalion6@gmail.com>) |
Ответы |
Re: [HACKERS] Causal reads take II
(Thomas Munro <thomas.munro@enterprisedb.com>)
|
Список | pgsql-hackers |
On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I looked through the code of `synchronous-replay-v1.patch` a bit and ran a > few > tests. I didn't manage to break anything, except one mysterious error that > I've > got only once on one of my replicas, but I couldn't reproduce it yet. > Interesting thing is that this error did not affect another replica or > primary. > Just in case here is the log for this error (maybe you can see something > obvious, that I've not noticed): > > ``` > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": > Directory not empty > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > LOG: directories for tablespace 47733 could not be removed > HINT: You can remove the directories manually if necessary. > CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 > FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732": > File exists > CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE: > pg_tblspc/47734/PG_10_201707211/47732/47736 > LOG: startup process (PID 8034) exited with exit code 1 > LOG: terminating any other active server processes > LOG: database system is shut down > ``` Hmm. The first error ("could not remove directory") could perhaps be explained by temporary files from concurrent backends, leaked files from earlier crashes or copying a pgdata directory over the top of an existing one as a way to set it up, leaving behind some files from an earlier test? The second error ("could not create directory") is a bit stranger though... I think this must come from TablespaceCreateDbspace(): it must have stat()'d the file and got ENOENT, decided to create the directory, acquired TablespaceCreateLock, stat()'d the file again and found it still absent, then run mkdir() on the parents and got EEXIST and finally on the directory to be created, and surprisingly got EEXIST. That means that someone must have concurrently created the directory. Perhaps in your testing you accidentally copied a pgdata directory over the top of it while it was running? In any case I'm struggling to see how anything in this patch would affect anything at the REDO level. > And speaking about the code, so far I have just a few notes (some of them > merely questions): > > * In general the idea behind this patch sounds interesting for me, but it > relies heavily on time synchronization. As mentioned in the documentation: > "Current hardware clocks, NTP implementations and public time servers are > unlikely to allow the system clocks to differ more than tens or hundreds > of > milliseconds, and systems synchronized with dedicated local time servers > may > be considerably more accurate." But as far as I remember from my own > experience sometimes it maybe not that trivial on something like AWS > because > of virtualization. Maybe it's an unreasonable fear, but is it possible to > address this problem somehow? Oops, I had managed to lose an important hunk that deals with detecting excessive clock drift (ie badly configured servers) while rebasing a couple of versions back. Here is a version to put it back. With that change, if you disable NTP and manually set your standby's clock to be more than 1.25s (assuming synchronous_replay_lease_time is set to the default of 5s) behind the primary, the synchronous_replay should be unavailable and you should see this error in the standby's log: ereport(LOG, (errmsg("the primary server's clock time is too far ahead for synchronous_replay"), errhint("Check your servers' NTP configuration or equivalent."))); One way to test this without messing with your NTP setting or involving two different computers is to modify this code temporarily in WalSndKeepalive: now = GetCurrentTimestamp() + 1250100; This is a best effort intended to detect a system not running ntpd at all or talking to an insane time server. Fundamentally this proposal is based on the assumption that you can get your system clocks into sync within a tolerance that we feel confident estimating an upper bound for. It does appear that some Amazon OS images come with NTP disabled; that's a problem if you want to use this feature, but if you're running a virtual server without an ntpd you'll pretty soon drift seconds to minutes off UTC time and get "unavailable for synchronous replay" errors from this patch (and possibly the LOG message above, depending on the direction of drift). > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25% > time shift when we're dealing with leases). Does it make sense to move > them > out and make them configurable? These numbers are interrelated, and I think they're best fixed in that ratio. You could make it more adjustable, but I think it's better to keep it simple with just a single knob. Let me restate that logic to explain how I came up with those ratios. There are two goals: 1. The primary needs to be able to wait for a lease to expire if connectivity is lost, even if the the standby's clock is behind the primary's clock by max_clock_skew. (Failure to do so could allow stale query results from a zombie standby that is somehow still handling queries after connectivity with the primary is lost.) 2. The primary needs to be able to replace leases often enough so that there are no gaps between them, even if the standby's clock is ahead of the primary's clock by max_clock_skew. (Failure to replace leases fast enough could cause spurious "unavailable" errors, but not incorrect query results. Technically it's max_clock_skew - network latency since it takes time for new leases to arrive but that's a minor detail). A solution that maximises tolerable clock skew and as an added bonus doesn't require the standby to have access to the primary's GUCs is to tell the standby that the expiry time is 25% earlier that the time the primary will really wait until, and replace leases when they still have 50% of their time to go. To illustrate using fixed-width ASCII-art, here's how the primary perceives the stream of leases it sends. In this diagram, '+' marks halfway, '!' marks the time the primary will send to the standby as the expiry time, and the final '|' marks the time the primary will really wait until if it has to, just in case the standby's clock is 'slow' (behind). The '!' is 25% earlier, and represents the maximum tolerable clock skew. |---+-!-| |---+-!-| |---+-!-| Here's how a standby with a clock that is 'slow' (behind) by max_clock_skew = 25% perceives this stream of leases: |-------! |-------! |-------! You can see that the primary server is able to wait just long enough for the lease to expire and the error to begin to be raised on this standby server, if it needs to. Here's how a standby with a clock that is 'fast' (ahead) by max_clock_skew = 25% perceives this stream of leases: |---! |---! |---! If it's ahead by more than that, we'll get gaps where the error may be raised spuriously in between leases. > * Judging from the `SyncReplayPotentialStandby` function, it's possible to > have > `synchronous_replay_standby_names = "*, server_name"`, which is basically > an > equivalent for just `*`, but it looks confusing. Is it worth it to prevent > this behaviour? Hmm. Seems harmless to me! > * In the same function `SyncReplayPotentialStandby` there is this code: > > ``` > if (!SplitIdentifierString(rawstring, ',', &elemlist)) > { > /* syntax error in list */ > pfree(rawstring); > list_free(elemlist); > /* GUC machinery will have already complained - no need to do again */ > return false; > } > ``` > > Am I right that ideally this (a situation when at this point in the code > `synchronous_replay_standby_names` has incorrect value) should not happen, > because GUC will prevent us from that? If yes, then it looks for me that > it > still makes sense to put here a log message, just to give more information > in > a potentially weird situation. Yes. That's exactly the coding that was used for synchronous_commit, before it was upgraded to support a new fancy syntax. I was trying to do things the established way. > * In the function `SyncRepReleaseWaiters` there is a commentary: > > ``` > /* > * If the number of sync standbys is less than requested or we aren't > * managing a sync standby or a standby in synchronous replay state that > * blocks then just leave. > * / > if ((!got_recptr || !am_sync) && !walsender_sr_blocker) > ``` > > Is this commentary correct? If I understand everything right `!got_recptr` > - > the number of sync standbys is less than requested (a), `!am_sync` - we > aren't > managing a sync standby (b), `walsender_sr_blocker` - a standby in > synchronous > replay state that blocks (c). Looks like condition is `(a or b) and not > c`. This code is trying to decide whether to leave early, rather than potentially blocking. The change in my patch is: - if (!got_recptr || !am_sync) + if ((!got_recptr || !am_sync) && !walsender_sr_blocker) The old coding said "if there aren't enough sync commit standbys, or I'm not a sync standby, then I can leave now". The coding with my patch is the same, except that in any case it won't leave early this walsender is managing a standby that potentially blocks commit. That said, it's a terribly named and documented function argument, so I have fixed that in the attached version; I hope that's better! To put it another way, with this patch there are two different reasons a transaction might need to wait: because of synchronous_commit and because of synchronous_replay. They're both forms of 'synchronous replication' I suppose. That if statement is saying 'if I don't need to wait for synchronous_commit, and I don't need to wait for synchronous_replay, then we can return early'. > * In the function `ProcessStandbyReplyMessage` there is a code that > implements > this: > > ``` > * If the standby reports that it has fully replayed the WAL for at > least > * 10 seconds, then let's clear the lag times that were measured when it > * last wrote/flushed/applied a WAL record. This way we avoid displaying > * stale lag data until more WAL traffic arrives. > ``` > but I never found any mention of this 10 seconds in the documentation. Is > it > not that important? Also, question 2 is related to this one. Hmm. Yeah that does seem a bit arbitrary. The documentation in master does already say that it's cleared without being saying exactly when: [...] If the standby server has entirely caught up with the sending server and there is no more WAL activity, the most recently measured lag times will continue to be displayed for a short time and then show NULL. The v1 patch changeed it from being based on wal_receiver_status_interval (sort of implicitly) to being 10 seconds, and yeah that is not a good change. In this v2 it's using wal_receiver_status_interval (though it has to do it explicitly, since this patch increases the amount of chit chat between the servers due to lease replacement). > * In the function `WalSndGetSyncReplayStateString` all the states are in > lower > case except `UNKNOWN`, is there any particular reason for that? This state should never exist, so no user will ever see it; I used "UNKNOWN" following the convention established by the function WalSndGetStateString() immediately above. > There are also few more not that important notes (mostly about some typos > and > few confusing names), but I'm going to do another round of review and > testing > anyway so I'll just send them all next time. Thanks for the review! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Ashutosh BapatДата:
Сообщение: Re: [HACKERS] Adding support for Default partition in partitioning