Обсуждение: [HACKERS] Causal reads take II

Поиск
Список
Период
Сортировка

[HACKERS] Causal reads take II

От
Thomas Munro
Дата:
Hi hackers,

Here is a new version of my "causal reads" patch (see the earlier
thread from the 9.6 development cycle[1]), which provides a way to
avoid stale reads when load balancing with streaming replication.

To try it out:

Set up a primary and some standbys, and put "causal_reads_timeout =
4s" in the primary's postgresql.conf.  Then SET causal_reads = on and
experiment with various workloads, watching your master's log and
looking at pg_stat_replication.  For example you could try out
test-causal-reads.c with --causal-reads --check (from the earlier
thread) or write something similar, and verify the behaviour while
killing, pausing, overwhelming servers etc.

Here's a brief restatement of the problem I'm trying to address and
how this patch works:

In 9.6 we got a new synchronous_commit level "remote_apply", which
causes committing transactions to block until the commit record has
been applied on the current synchronous standby server.  In 10devel
can now be servers plural.  That's useful because it means that a
client can run tx1 on the primary and then run tx2 on an appropriate
standby, or cause some other client to do so, and be sure that tx2 can
see tx1.  Tx2 can be said to be "causally dependent" on tx1 because
clients expect tx2 to see tx1, because they know that tx1 happened
before tx2.

In practice there are complications relating to failure and
transitions.  How should you find an appropriate standby?  Suppose you
have a primary and N standbys, you set synchronous_standby_names to
wait for all N standbys, and you set synchronous_commit to
remote_apply.  Then the above guarantee of visibility of tx1 by tx2
works, no matter which server you run tx2 on.  Unfortunately, if one
of your standby servers fails or there is a network partition, all
commits will block until you fix that.  So you probably want to set
synchronous_standby_names to wait for a subset of your set of
standbys.  Now you can lose some number of standby servers without
holding up commits on the primary, but the visibility guarantee for
causal dependencies is lost!  How can a client know for certain
whether tx2 run on any given standby can see a transaction tx1 that it
has heard about?  If you're using the new "ANY n" mode then the subset
of standbys that have definitely applied tx1 is not known to any
client; if you're using the traditional FIRST mode it's complicated
during transitions (you might be talking to a standby that has
recently lost its link to the primary and the primary could have
decided to wait for the next highest priority standby instead and then
returned from COMMIT successfully).

This patch provides the following guarantee:  if causal_reads is on
for both tx1 and tx2, then after tx1 returns, tx2 will either see tx1
or fail with an error indicating that the server is currently
unavailable for causal reads.  This guarantee is upheld even if there
is a network partition and the standby running tx2 is unable to
communicate with the primary server, but requires the system clocks of
all standbys to differ from the primary's by less than a certain
amount of allowable skew that is accounted for in the algorithm
(causal_reads_timeout / 4, see README.causal_reads for gory details).

It works by sending a stream of "leases" to standbys that are applying
fast enough.  These leases grant the standby the right to assume that
all transactions that were run with causal_reads = on and have
returned control have already been applied locally, without doing any
communication or waiting, for a limited time.  Leases are promises
made by the primary that it will wait for all such transactions to be
applied on each 'available' standby or for available standbys' leases
to be revoked because they're lagging too much, and for any revoked
leases to expire.

As discussed in the earlier thread, there are other ways that tx2 run
on a standby could get a useful guarantee about the visibility of an
early transaction tx1 that the client knows about.  (1)  User-managed
"causality tokens":  Clients could somehow obtain the LSN of commit
tx1 (or later), and then tx2 could explicitly wait for that LSN to be
applied, as proposed by Ivan Kartyshov[2] and others; if you aren't
also using sync rep for data loss avoidance, then tx1 will return from
committing without waiting for standbys, and by the time tx2 starts on
a standby it may find that the LSN has already been applied and not
have to wait at all.  That is definitely good.  Unfortunately it also
transfers the problem of tracking causal dependencies between
transactions to client code, which is a burden on the application
developer and difficult to retrofit.  (2)  Middleware-managed
causality tokens:  Something like pgpool or pgbouncer or some other
proxy could sit in front of all of your PostgreSQL servers and watch
all transactions and do the LSN tracking for you, inserting waits
where appropriate so that no standby query ever sees a snapshot that
doesn't include any commit that any client has heard about; that
requires tx2 to wait for transactions that may be later than tx1 to be
applied potentially slowing down every read query, and requires
pushing all transactions through a single central process thereby
introducing its own failover problem with associated transition
failure mode that could break our guarantee if somehow two of these
proxies are ever active at once.

Don't get me wrong, I think those are good ideas: let's do those too.
I guess that people working on logical multi-master replication might
eventually want a general concept of causality tokens which could
include some kind of vector clock.  But I don't see this proposal as
conflicting with any of that.  It's a set of trade-offs that provides
a simple solution for users who want to be able to talk directly to
any PostgreSQL standby server out of the box without pushing
everything through a central observer, and who want to be able to
enable this for existing applications without having to rewrite them
to insert complicated code to track and communicate LSNs.

Some assorted thoughts and things I'd love to hear your ideas on:

I admit that it has a potentially confusing relationship with
synchronous replication.  It is presented as a separate feature, and
you can use both features together or use them independently:
synchronous_standby_names and synchronous_commit are for controlling
your data loss risk, and causal_reads_standby_names and causal_reads
are for controlling distributed read consistency.  Perhaps the
causal_reads GUC should support different levels rather that using
on/off; the mode described above could be enabled with something =
'causal_read_lease', leaving room for other modes.  Maybe the whole
feature needs a better name: I borrowed "causal reads" from Galera's
wsrep_causal_reads/wsrep_sync_wait.  That system makes readers (think
standbys) wait for the global end of WAL to be applied locally at the
start of every transaction, which could also be a potential future
mode for us, but I thought it was much more interesting to have
wait-free reads on standbys, especially if you already happen to be
waiting on the primary because you want to avoid data loss with
syncrep.  To achieve that I added system-clock-based leases.  I
suspect some people will dislike that part: the guarantee includes the
caveat about the maximum difference between system clocks, and the
patch doesn't do anything as clever as Google's Spanner/Truetime
system or come with a free built-in atomic clock, so it relies on
setting the max clock skew conservatively and making sure you have NTP
set up correctly (for example, things reportedly got a bit messy for a
short time after the recent leap second if you happened to have only
one server from pool.ntp.org in your ntpd.conf and were unlucky).  I
considered ways to make causal reads an extension, but it'd need
fairly invasive hooks including the ability to change replication wire
protocol messages.

Long term, I think it would be pretty cool if we could develop a set
of features that give you distributed sequential consistency on top of
streaming replication.  Something like (this | causality-tokens) +
SERIALIZABLE-DEFERRABLE-on-standbys[3] +
distributed-dirty-read-prevention[4].

The patch:

The replay lag tracking patch this depends on is in the current
commitfest[1] and is presented as an independent useful feature.
Please find two patches to implement causal reads for the open CF
attached.  First apply replay-lag-v16.patch, then
refactor-syncrep-exit-v16.patch, then causal-reads-v16.patch.

Thanks for reading!

[1] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru#0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru
[3]
https://www.postgresql.org/message-id/flat/CAEepm%3D2b9TV%2BvJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ%40mail.gmail.com
[4] https://www.postgresql.org/message-id/flat/CAEepm%3D1GNCriNvWhPkWCqrsbXWGtWEEpvA-KnovMbht5ryzbmg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Ants Aasma
Дата:
On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here is a new version of my "causal reads" patch (see the earlier
> thread from the 9.6 development cycle[1]), which provides a way to
> avoid stale reads when load balancing with streaming replication.

Thanks for working on this. It will let us do something a lot of
people have been asking for.

> Long term, I think it would be pretty cool if we could develop a set
> of features that give you distributed sequential consistency on top of
> streaming replication.  Something like (this | causality-tokens) +
> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
> distributed-dirty-read-prevention[4].

Is it necessary that causal writes wait for replication before making
the transaction visible on the master? I'm asking because the per tx
variable wait time between logging commit record and making
transaction visible makes it really hard to provide matching
visibility order on master and standby. In CSN based snapshot
discussions we came to the conclusion that to make standby visibility
order match master while still allowing for async transactions to
become visible before they are durable we need to make the commit
sequence a vector clock and transmit extra visibility ordering
information to standby's. Having one more level of delay between wal
logging of commit and making it visible would make the problem even
worse.

One other thing that might be an issue for some users is that this
patch only ensures that clients observe forwards progress of database
state after a writing transaction. With two consecutive read only
transactions that go to different servers a client could still observe
database state going backwards. It seems that fixing that would
require either keeping some per client state or a global agreement on
what snapshots are safe to provide, both of which you tried to avoid
for this feature.

Regards,
Ants Aasma



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Long term, I think it would be pretty cool if we could develop a set
>> of features that give you distributed sequential consistency on top of
>> streaming replication.  Something like (this | causality-tokens) +
>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>> distributed-dirty-read-prevention[4].
>
> Is it necessary that causal writes wait for replication before making
> the transaction visible on the master? I'm asking because the per tx
> variable wait time between logging commit record and making
> transaction visible makes it really hard to provide matching
> visibility order on master and standby.

Yeah, that does seem problematic.  Even with async replication or no
replication, isn't there already a race in CommitTransaction() where
two backends could reach RecordTransactionCommit() in one order but
ProcArrayEndTransaction() in the other order?  AFAICS using
synchronous replication in one of the transactions just makes it more
likely you'll experience such a visibility difference between the DO
and REDO histories (!), by making RecordTransactionCommit() wait.
Nothing prevents you getting a snapshot that can see t2 but not t1 in
the DO history, while someone doing PITR or querying an asynchronous
standby gets a snapshot that can see t1 but not t2 because those
replay the REDO history.

> In CSN based snapshot
> discussions we came to the conclusion that to make standby visibility
> order match master while still allowing for async transactions to
> become visible before they are durable we need to make the commit
> sequence a vector clock and transmit extra visibility ordering
> information to standby's. Having one more level of delay between wal
> logging of commit and making it visible would make the problem even
> worse.

I'd like to read that... could you please point me at the right bit of
that discussion?

> One other thing that might be an issue for some users is that this
> patch only ensures that clients observe forwards progress of database
> state after a writing transaction. With two consecutive read only
> transactions that go to different servers a client could still observe
> database state going backwards.

True.  This patch is about "read your writes", not, erm, "read your
reads".  That may indeed be problematic for some users.  It's not a
very satisfying answer but I guess you could run a dummy write query
on the primary every time you switch between standbys, or before
telling any other client to run read-only queries after you have done
so, in order to convert your "r r" sequence into a "r w r" sequence...

> It seems that fixing that would
> require either keeping some per client state or a global agreement on
> what snapshots are safe to provide, both of which you tried to avoid
> for this feature.

Agreed.  You briefly mentioned this problem in the context of pairs of
read-only transactions a while ago[1].  As you said then, it does seem
plausible to do that with a token system that gives clients the last
commit LSN from the snapshot used by a read only query, so that you
can ask another standby to make sure that LSN has been replayed before
running another read-only transaction.  This could be handled
explicitly by a motivated client that is talking to multiple nodes.  A
more general problem is client A telling client B to go and run
queries and expecting B to see all transactions that A has seen; it
now has to pass the LSN along with that communication, or rely on some
kind of magic proxy that sees all transactions, or a radically
different system with a GTM.

[1] https://www.postgresql.org/message-id/CA%2BCSw_u4Vy5FSbjVc7qms6PuZL7QV90%2BonBEtK9PFqOsNj0Uhw@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Ants Aasma
Дата:
On Thu, Jan 19, 2017 at 2:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
>> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> Long term, I think it would be pretty cool if we could develop a set
>>> of features that give you distributed sequential consistency on top of
>>> streaming replication.  Something like (this | causality-tokens) +
>>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>>> distributed-dirty-read-prevention[4].
>>
>> Is it necessary that causal writes wait for replication before making
>> the transaction visible on the master? I'm asking because the per tx
>> variable wait time between logging commit record and making
>> transaction visible makes it really hard to provide matching
>> visibility order on master and standby.
>
> Yeah, that does seem problematic.  Even with async replication or no
> replication, isn't there already a race in CommitTransaction() where
> two backends could reach RecordTransactionCommit() in one order but
> ProcArrayEndTransaction() in the other order?  AFAICS using
> synchronous replication in one of the transactions just makes it more
> likely you'll experience such a visibility difference between the DO
> and REDO histories (!), by making RecordTransactionCommit() wait.
> Nothing prevents you getting a snapshot that can see t2 but not t1 in
> the DO history, while someone doing PITR or querying an asynchronous
> standby gets a snapshot that can see t1 but not t2 because those
> replay the REDO history.

Yes there is a race even with all transactions having the same
synchronization level. But nobody will mind if we some day fix that
race. :) With different synchronization levels it is much trickier to
fix as either async commits must wait behind sync commits before
becoming visible, return without becoming visible or visibility order
must differ from commit record LSN order. The first makes the async
commit feature useless, second seems a no-go for semantic reasons,
third requires extra information sent to standby's so they know the
actual commit order.

>> In CSN based snapshot
>> discussions we came to the conclusion that to make standby visibility
>> order match master while still allowing for async transactions to
>> become visible before they are durable we need to make the commit
>> sequence a vector clock and transmit extra visibility ordering
>> information to standby's. Having one more level of delay between wal
>> logging of commit and making it visible would make the problem even
>> worse.
>
> I'd like to read that... could you please point me at the right bit of
> that discussion?

Some of that discussion was face to face at pgconf.eu, some of it is
here: https://www.postgresql.org/message-id/CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com

Let me know if you have any questions.

>> It seems that fixing that would
>> require either keeping some per client state or a global agreement on
>> what snapshots are safe to provide, both of which you tried to avoid
>> for this feature.
>
> Agreed.  You briefly mentioned this problem in the context of pairs of
> read-only transactions a while ago[1].  As you said then, it does seem
> plausible to do that with a token system that gives clients the last
> commit LSN from the snapshot used by a read only query, so that you
> can ask another standby to make sure that LSN has been replayed before
> running another read-only transaction.  This could be handled
> explicitly by a motivated client that is talking to multiple nodes.  A
> more general problem is client A telling client B to go and run
> queries and expecting B to see all transactions that A has seen; it
> now has to pass the LSN along with that communication, or rely on some
> kind of magic proxy that sees all transactions, or a radically
> different system with a GTM.

If/when we do CSN based snapshots, adding a GTM could be relatively
straightforward. It's basically not all that far from what Spanner is
doing by using a timestamp as the snapshot. But this is all relatively
independent of this patch.

Regards,
Ants Aasma



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Fri, Jan 20, 2017 at 3:01 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> Yes there is a race even with all transactions having the same
> synchronization level. But nobody will mind if we some day fix that
> race. :)

We really should fix that!

> With different synchronization levels it is much trickier to
> fix as either async commits must wait behind sync commits before
> becoming visible, return without becoming visible or visibility order
> must differ from commit record LSN order. The first makes the async
> commit feature useless, second seems a no-go for semantic reasons,
> third requires extra information sent to standby's so they know the
> actual commit order.

Thought experiment:

1.  Log commit and make visible atomically (so DO and REDO agree on
visibility order).
2.  Introduce flag 'not yet visible' to commit record for sync rep commits.
3.  Introduce a new log record 'make all invisible commits up to LSN X
visible', which is inserted when enough sync rep standbys reply.  Log
this + make visible on primary atomically (again, so DO and REDO agree
on visibility order).
4.  Teach GetSnapshotData to deal with this using <insert magic here>.

Now standby and primary agree on visibility order of async and sync
transactions, and no standby will allow you to see transactions that
the primary doesn't yet consider to be durable (ie flushed on a quorum
of standbys etc).  But... sync rep has to flush xlog twice on primary,
and standby has to wait to make things visible, and remote_apply would
either need to be changed or supplemented with a new level
remote_apply_and_visible, and it's not obvious how to actually do
atomic visibility + logging (I heard ProcArrayLock is kinda hot...).
Hmm.  Doesn't sound too popular...

>>> In CSN based snapshot
>>> discussions we came to the conclusion that to make standby visibility
>>> order match master while still allowing for async transactions to
>>> become visible before they are durable we need to make the commit
>>> sequence a vector clock and transmit extra visibility ordering
>>> information to standby's. Having one more level of delay between wal
>>> logging of commit and making it visible would make the problem even
>>> worse.
>>
>> I'd like to read that... could you please point me at the right bit of
>> that discussion?
>
> Some of that discussion was face to face at pgconf.eu, some of it is
> here: https://www.postgresql.org/message-id/CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com
>
> Let me know if you have any questions.

Thanks!  That may take me some time...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Dmitry Dolgov
Дата:
Hi

I'm wondering about status of this patch and how can I try it out?

> On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> The replay lag tracking patch this depends on is in the current commitfest

I assume you're talking about this patch [1] (at least it's the only thread
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
returned with feedback, so what's the status of this one (`causal reads`)?

> First apply replay-lag-v16.patch, then refactor-syncrep-exit-v16.patch, then
> causal-reads-v16.patch.

It would be nice to have all three of them attached (for some reason I see only
last two of them in this thread). But anyway there are a lot of failed hunks
when I'm trying to apply `replay-lag-v16.patch` and `refactor-syncrep-exit-v16.patch`,
`causal-reads-v16.patch` (or last two of them separately).

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I'm wondering about status of this patch and how can I try it out?

Hi Dmitry, thanks for your interest.

>> On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com>
>> wrote:
>> The replay lag tracking patch this depends on is in the current commitfest
>
> I assume you're talking about this patch [1] (at least it's the only thread
> where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
> returned with feedback, so what's the status of this one (`causal reads`)?

Right, replay lag tracking was committed.  I'll post a rebased causal
reads patch later today.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Mon, May 22, 2017 at 6:32 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> I'm wondering about status of this patch and how can I try it out?
>
> Hi Dmitry, thanks for your interest.
>
>>> On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com>
>>> wrote:
>>> The replay lag tracking patch this depends on is in the current commitfest
>>
>> I assume you're talking about this patch [1] (at least it's the only thread
>> where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
>> returned with feedback, so what's the status of this one (`causal reads`)?
>
> Right, replay lag tracking was committed.  I'll post a rebased causal
> reads patch later today.

I ran into a problem while doing this, and it may take a couple more
days to fix it since I am at pgcon this week.  More soon.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Wed, May 24, 2017 at 3:58 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>> I'm wondering about status of this patch and how can I try it out?
>
> I ran into a problem while doing this, and it may take a couple more
> days to fix it since I am at pgcon this week.  More soon.

Apologies for the extended delay.  Here is the rebased patch, now with
a couple of improvements (see below).  To recap, this is the third
part of the original patch series[1], which had these components:

1.  synchronous_commit = remote_apply, committed in PostgreSQL 9.6
2.  replication lag tracking, committed in PostgreSQL 10
3.  causal_reads, the remaining part, hereby proposed for PostgreSQL 11

The goal is to allow applications to move arbitrary read-only
transactions to physical replica databases and still know that they
can see all preceding write transactions or get an error.  It's
something like regular synchronous replication with synchronous_commit
= remote_apply, except that it limits the impact on the primary and
handles failure transitions with defined semantics.

The inspiration for this kind of distributed read-follows-write
consistency using read leases was a system called Comdb2[2][3], whose
designer encouraged me to try to extend Postgres's streaming
replication to do something similar.  Read leases can also be found in
some consensus systems like Google Megastore, albeit in more ambitious
form IIUC.  The name is inspired by a MySQL Galera feature
(approximately the same feature but the approach is completely
different; Galera adds read latency, whereas this patch does not).
Maybe it needs a better name.

Is this is a feature that people want to see in PostgreSQL?

IMPROVEMENTS IN V17

The GUC to enable the feature is now called
"causal_reads_max_replay_lag".  Standbys listed in
causal_reads_standby_names whose pg_stat_replication.replay_lag
doesn't exceed that time are "available" for causal reads and will be
waited for by the primary when committing.  When they exceed that
threshold they are briefly in "revoking" state and then "unavailable",
and when the go return to an acceptable level they are briefly in
"joining" state before reaching "available".  CR states appear in
pg_stat_replication and transitions are logged at LOG level.

A new GUC called "causal_reads_lease_time" controls the lifetime of
read leases sent from the primary to the standby.  This affects the
frequency of lease replacement messages, and more importantly affects
the worst case of commit stall that can be introduced if connectivity
to a standby is lost and we have to wait for the last sent lease to
expire.  In the previous version, one single GUC controlled both
maximum tolerated replay lag and lease lifetime, which was good from
the point of view that fewer GUCs are better, but bad because it had
to be set fairly high when doing both jobs to be conservative about
clock skew.  The lease lifetime must be at least 4 x maximum tolerable
clock skew.  After the recent botching of a leap-second transition on
a popular public NTP network (TL;DR OpenNTP is not a good choice of
implementation to add to a public time server pool) I came to the
conclusion that I wouldn't want to recommend a default max clock skew
under 1.25s, to allow for some servers to be confused about leap
seconds for a while or to be running different smearing algorithms.  A
reasonable causal_reads_lease_time recommendation for people who don't
know much about the quality of their time source might therefore be
5s.  I think it's reasonable to want to set the maximum tolerable
replay lag to lower time than that, or in fact as low as you like,
depending on your workload and hardware.  Therefore I decided to split
the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag"
and "causal_reads_lease_time".

This new version introduces fast lease revocation.  Whenever the
primary decides that a standby is not keeping up, it kicks it out of
the set of CR-available standbys and revokes its lease, so that anyone
trying to run causal reads transactions there will start receiving a
new error.  In the previous version, it always did that by blocking
commits while waiting for the most recently sent lease to expire,
which I now call "slow revocation" because it could take several
seconds.  Now it blocks commits only until the standby acknowledges
that it is no longer available for causal reads OR the lease expires:
ideally that takes the time of a network a round trip.  Slow
revocation is still needed in various failure cases such as lost
connectivity.

TESTING

Apply the patch after first applying a small bug fix for replication
lag tracking[4].  Then:

1.  Set up some streaming replicas.
2.  Stick causal_reads_max_replay_lag = 2s (or any time you like) in
the primary's postgresql.conf.
3.  Set causal_reads = on in some transactions on various nodes.
4.  Try to break it!

As long as your system clocks don't disagree by more than 1.25s
(causal_reads_lease_time / 4), the causal reads guarantee will be
upheld: standbys will either see transactions that have completed on
the primary or raise an error to indicate that they are not available
for causal reads transactions.  You should not be able to break this
guarantee, no matter what you do: unplug the network, kill arbitrary
processes, etc.

If you mess with your system clocks so they differ by more than
causal_reads_lease_time / 4, you should see that a reasonable effort
is made to detect that so it's still very unlikely you can break it
(you'd need clocks to differ by more than causal_reads_lease_time / 4
but less than causal_reads_lease_time / 4 + network latency so that
the excessive skew is not detected, and then you'd need a very well
timed pair of transactions and loss of connectivity).

[1] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
[2] https://github.com/bloomberg/comdb2
[3] http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf
[4] https://www.postgresql.org/message-id/CAEepm%3D3tJX_0kSeDi8OYTMp8NogrqPxgP1%2B2uzsdePz9i0-V0Q%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Apply the patch after first applying a small bug fix for replication
> lag tracking[4].  Then:

That bug fix was committed, so now causal-reads-v17.patch can be
applied directly on top of master.

> 1.  Set up some streaming replicas.
> 2.  Stick causal_reads_max_replay_lag = 2s (or any time you like) in
> the primary's postgresql.conf.
> 3.  Set causal_reads = on in some transactions on various nodes.
> 4.  Try to break it!

Someone asked me off-list how to set this up quickly and easily for
testing.  Here is a shell script that will start up a primary server
(port 5432) and 3 replicas (ports 5441 to 5443).  Set the two paths at
the top of the file before running in.  Log in with psql postgres [-p
<port>], then SET causal_reads = on to test its effect.
causal_reads_max_replay_lay is set to 2s and depending on your
hardware you might find that stuff like CREATE TABLE big_table AS
SELECT generate_series(1, 10000000) or a large COPY data load causes
replicas to be kicked out of the set after a while; you can also pause
replay on the replicas with SELECT pg_wal_replay_pause() and
pg_wal_replay_resume(), kill -STOP/-CONT or -9 the walreceiver
processes to similar various failure modes, or run the replicas
remotely and unplug the network.  SELECT application_name, replay_lag,
causal_reads_state FROM pg_state_replication to see the current
situation, and also monitor the primary's LOG messages about
transitions.  You should find that the
"read-your-writes-or-fail-explicitly" guarantee is upheld, no matter
what you do, and furthermore than failing or lagging replicas don't
hold hold the primary up very long: in the worst case
causal_reads_lease_time for lost contact, and in the best case the
time to exchange a couple of messages with the standby to tell it its
lease is revoked and it should start raising an error.  You might find
test-causal-reads.c[1] useful for testing.

> Maybe it needs a better name.

Ok, how about this: the feature could be called "synchronous replay".
The new column in pg_stat_replication could be called sync_replay
(like the other sync_XXX columns).  The GUCs could be called
synchronous replay, synchronous_replay_max_lag and
synchronous_replay_lease_time.  The language in log messages could
refer to standbys "joining the synchronous replay set".

Restating the purpose of the feature with that terminology:  If
synchronous_replay is set to on, then you see the effects of all
synchronous_replay = on transactions that committed before your
transaction began, or an error is raised if that is not possible on
the current node.  This allows applications to direct read-only
queries to read-only replicas for load balancing without seeing stale
data.  Is that clearer?

Restating the relationship with synchronous replication with that
terminology: while synchronous_commit and synchronous_standby_names
are concerned with distributed durability, synchronous_replay is
concerned with distributed visibility.  While the former prevents
commits from returning if the configured level of durability isn't met
(for example "must be flushed on master + any 2 standbys"), the latter
will simply drop any standbys from the synchronous replay set if they
fail or lag more than synchronous_replay_max_lag.  It is reasonable to
want to use both features at once:  my policy on distributed
durability might be that I want all transactions to be flushed to disk
on master + any of three servers before I report information to users,
and my policy on distributed visibility might be that I want to be
able to run read-only queries on any of my six read-only replicas, but
don't want to wait for any that lag by more than 1 second.

Thoughts?

[1] https://www.postgresql.org/message-id/CAEepm%3D3NF%3D7eLkVR2fefVF9bg6RxpZXoQFmOP3RWE4r4iuO7vg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Simon Riggs
Дата:
On 3 January 2017 at 01:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

> Here is a new version of my "causal reads" patch (see the earlier
> thread from the 9.6 development cycle[1]), which provides a way to
> avoid stale reads when load balancing with streaming replication.

I'm very happy that you are addressing this topic.

I noticed you didn't put in links my earlier doubts about this
specific scheme, though I can see doubts from myself and Heikki at
least in the URLs. I maintain those doubts as to whether this is the
right way forwards.

This patch presumes we will load balance writes to a master and reads
to a pool of standbys. How will we achieve that?

1. We decorate the application with additional info to indicate
routing/write concerns.
2. We get middleware to do routing for us, e.g. pgpool style read/write routing

The explicit premise of the patch is that neither of the above options
are practical, so I'm unclear how this makes sense. Is there some use
case that you have in mind that has not been fully described? If so,
lets get it on the table.

What I think we need is a joined up plan for load balancing, so that
we can understand how it will work. i.e. explain the whole use case
and how the solution works.

I'm especially uncomfortable with any approaches that treat all
sessions as one pool. For me, a server should support multiple pools.
Causality seems to be a property of a particular set of pools. e.g.
PoolS1 supports causal reads against writes to PoolM1 but not PoolM2,
yet PoolS2 does not provide causal reads against PoolM1 orPoolM2.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'm very happy that you are addressing this topic.
>
> I noticed you didn't put in links my earlier doubts about this
> specific scheme, though I can see doubts from myself and Heikki at
> least in the URLs. I maintain those doubts as to whether this is the
> right way forwards.

One technical problem was raised in the earlier thread by Ants Aasma
that I concede may be fatal to this design (see note below about
read-follows-read below), but I'm not sure.  All the other discussion
seemed to be about trade-offs between writer-waits and reader-waits
schemes, both of which I still view as reasonable options for an end
user to have in the toolbox.  Your initial reaction was:

> What we want is to isolate the wait only to people performing a write-read
> sequence, so I think it should be readers that wait.

I agree with you 100%, up to the comma.  The difficulty is identifying
which transactions are part of a write-read sequence.  An
application-managed LSN tracking system allows for waits to occur
strictly in reads that are part of a write-read sequence because the
application links them explicitly, and nobody is arguing that we
shouldn't support that for hard working expert users.  But to support
applications that don't deal with LSNs (or some kind of "causality
tokens") explicitly I think we'll finish up having to make readers
wait for incidental later transactions too, not just the write that
your read is dependent on, as I'll show below.  When you can't
identify write-read sequences perfectly, it comes down to a choice
between taxing writers or taxing readers, and I'm not sure that one
approach is universally better.  Let me summarise my understanding of
that trade-off.

I'm going to use this terminology:

synchronous replay = my proposal: ask the primary server to wait until
standbys have applied tx1, a bit like 9.6 synchronous_commit =
remote_apply, but with a system of leases for graceful failure.

causality tokens = the approach Heikki mentioned: provide a way for
tx1 to report its commit LSN to the client, then provide a way for a
client to wait for the LSN to be replayed before taking a snapshot for
tx2.

tx1, tx2 = a pair of transactions with a causal dependency; we want
tx2 to see tx1 because tx1 caused tx2 in some sense so must be seen to
precede it.

A poor man's causality token system can be cobbled together today with
pg_current_wal_lsn() and a polling loop that checks
pg_last_wal_replay_lsn().  It's a fairly obvious thing to want to do.
Several people including Heikki Linnakangas, Craig Ringer, Ants Aasma,
Ivan Kartyshov and probably many others have discussed better ways to
do that[1], and a patch for the wait-for-LSN piece appeared in a
recent commitfest[2].  I reviewed Ivan's patch and voted -1 only
because it didn't work for higher isolation levels.  If he continues
to develop that I will be happy to review and help get it into
committable shape, and if he doesn't I may try to develop it myself.
In short, I think this is a good tool to have in the toolbox and
PostgreSQL 11 should have it!  But I don't think it necessarily
invalidates my synchronous replay proposal: they represent different
sets of trade-offs and might appeal to different users.  Here's why:

To actually use a causality token system you either need a carefully
designed application that keeps track of causal dependencies and
tokens, in which case the developer works harder but can benefit from
from an asynchronous pipelining effect (by the time tx2 runs we hope
that tx1 has been applied, so neither transaction had to wait).  Let's
call that "application-managed causality tokens".  That's great for
those users -- let's make that possible -- but most users don't want
to write code like that.  So I see approximately three choices for
transparent middleware (or support built into standbys), which I'll
name and describe as follows:

1.  "Panoptic" middleware: Sees all queries so that it can observe
commit LSNs and inject wait directives into all following read-only
transactions.  Since all queries now need to pass through a single
process, you have a new bottleneck, an extra network hop, and a single
point of failure so you'll probably want a failover system with
split-brain defences.

2.  "Hybrid" middleware: The middleware (or standby itself) listens to
the replication stream so that it knows about commit LSNs (rather than
spying on commits).  The primary server waits until all connected
middleware instances acknowledge commit LSNs before it releases
commits, and then the middleware inserts wait-for-LSN directives into
read-only transactions.  Now there is no single point problem, but
writers are impacted too.  I mention this design because I believe
this is conceptually similar to how Galera wsrep_sync_wait (AKA
wsrep_causal_reads) works.  (I call this "hybrid" because it splits
the waiting between tx1 and tx2.  Since it's synchronous, dealing with
failure gracefully is tricky, probably needing a lease system like SR.
I acknowledge that comparisons between our streaming replication and
Galera are slightly bogus because Galera is a synchronous multi-master
system.)

3.  "Forward-only" middleware (insert better name): The middleware (or
standby itself) asks the primary server for the latest committed LSN
at the start of every transaction, and then tells the standby to wait
for that LSN to be applied.

There are probably some other schemes involving communicating
middleware instances, but I don't think they'll be better in ways that
matter -- am I wrong?

Here's a trade-off table:
                                          SR      AT     PT     HT     FT tx1 commit waits?                        yes
  no     no     yes    no tx2 snapshot waits?                      no      yes    yes    yes    yes tx2 waits for
incidentaltransactions?   no      no     yes    yes    yes tx2 has round-trip to primary?           no      no     no
 no     yes can tx2's wait be pipelined?                     yes    no*    no*    no*
 
 SR = synchronous replay AT = application-managed causality tokens PT = panoptic middleware-managed causality tokens HT
=hybrid middleware-managed or standby-managed causality tokens FT = forward-only middleware-managed causality tokens
 

*Note that only synchronous replay and application-managed causality
tokens track the actual causal dependency tx2->tx1.  SR does it by
making tx1 wait for replay so that tx2 doesn't have to wait at all and
AT does it by making tx2 wait specifically for tx1 to be applied.  PT,
HT and FT don't actually know anything about tx1, so they make every
read query wait until *all known transactions* are applied
("incidental transactions" above), throwing away the pipelining
benefits of causality token systems (hence "no*" above).  I haven't
used it myself but I have heard that that is why read latency is a
problem on Galera with causal reads mode enabled: due to lack of
better information you have to wait for the replication system to
drain its current apply queue before every query is processed, even if
tx1 in your causally dependent transaction pair was already visible on
the current node.

So far I think that SR and AT are sweet spots.  AT for people who are
prepared to juggle causality tokens in their applications and SR for
people who want to remain oblivious to all this stuff and who can
tolerate a reduction in single-client write TPS.  I also think AT's
pipelining advantage over SR and SR's single-client TPS impact are
diminished if you also choose to enable syncrep for durability, which
isn't a crazy thing to want to do if you're doing anything important
with your data.  The other models where all readers wait for
incidental transactions don't seem terribly attractive to me,
especially if the motivating premise of load balancing with read-only
replicas is (to steal a line) "ye [readers] are many, they are few".

One significant blow my proposal received in the last thread was a
comment from Ants about read-follows-read[3].  What do you think?  I
suspect the same problem applies to causality token based systems as
discussed so far (except perhaps FT, the slowest and probably least
acceptable to anyone).  On the other hand, I think it's at least
possible to fix that problem with causality tokens.  You'd have to
expose and capture the last-commit-LSN for every snapshot used in
every single read query, and wait for it at the moment ever following
read query takes a new snapshot.  This would make AT even harder work
for developers, make PT even slower, and make HT unworkable (it only
knows about commits, not reads).  I also suspect that a sizeable class
of applications cares a lot less about read-follows-read than
read-follows-write, but I could be wrong about that.

> This patch presumes we will load balance writes to a master and reads
> to a pool of standbys. How will we achieve that?
>
> 1. We decorate the application with additional info to indicate
> routing/write concerns.
> 2. We get middleware to do routing for us, e.g. pgpool style read/write routing
>
> The explicit premise of the patch is that neither of the above options
> are practical, so I'm unclear how this makes sense. Is there some use
> case that you have in mind that has not been fully described? If so,
> lets get it on the table.

I don't think that pgpool routing is impractical, just that it's not a
great place to put transparent causality token tracking for the
reasons I've explained above -- you'll introduce a ton of latency for
all readers because you can't tell which earlier transactions they
might be causally dependent on.  I think it's also nice to be able to
support the in-process connection pooling and routing that many
application developers use to avoid extra hops, so it'd be nice to
avoid making pooling/routing/proxy servers strictly necessary.

> What I think we need is a joined up plan for load balancing, so that
> we can understand how it will work. i.e. explain the whole use case
> and how the solution works.

Here are some ways you could set a system up:

1.  Use middleware like pgpool or pgbouncer-rr to route queries
automatically; this is probably limited to single-statement queries,
since multi-statement queries can't be judged by their first statement
alone.  (Those types of systems could be taught to understand a
request for a connection with causal reads enabled, and look at the
current set of usable standbys by looking at the pg_stat_replication
table.)

2.  Use the connection pooling inside your application server or
application framework/library: for example Hibernate[4], Django[5] and
many other libraries offer ways to configure multiple database
connection pools and route queries appropriately at a fairly high
level.  Such systems could probably be improved to handle 'synchronous
replay not available' errors by throwing away the connection and
retrying automatically on another connection, much as they do for
serialization failures and deadlocks.

3.  Modify your application to deal with separate connection pools
directly wherever it runs database transactions.

Perhaps I'm not thinking big enough: I tried to come up with an
incremental improvement to PostgreSQL that would fix a problem that I
know people have with their current hot standby deployment.  I
deliberately avoided proposing radical architectural projects such as
moving cluster management, discovery, proxying, pooling and routing
responsibilities into PostgreSQL.  Perhaps those working on GTM type
systems which effectively present a seamless single system find this
whole discussion to be aiming too low and dealing with the wrong
problems.

> I'm especially uncomfortable with any approaches that treat all
> sessions as one pool. For me, a server should support multiple pools.
> Causality seems to be a property of a particular set of pools. e.g.
> PoolS1 supports causal reads against writes to PoolM1 but not PoolM2,
> yet PoolS2 does not provide causal reads against PoolM1 orPoolM2.

Interesting, but I don't immediately see any fundamental difficulty
for any of the designs discussed.  For example, maybe tx1 should be
able to set synchronous_replay = <group name>, rather than just 'on',
to refer to a ground of standbys defined in some GUC.

Just by the way, while looking for references I found
PinningMasterSlaveRouter which provides a cute example of demand for
causal reads (however implemented) in the Django community:

https://github.com/jbalogh/django-multidb-router

It usually sends read-only transactions to standbys, but keeps your
web session temporarily pinned to the primary database to give you 15
seconds' worth of read-your-writes after each write transaction.

[1] https://www.postgresql.org/message-id/flat/53E2D346.9030806%402ndquadrant.com
[2] https://www.postgresql.org/message-id/flat/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru
[3] https://www.postgresql.org/message-id/CAEepm%3D15WC7A9Zdj2Qbw3CUDXWHe69d%3DnBpf%2BjXui7OYXXq11w%40mail.gmail.com
[4] https://stackoverflow.com/questions/25911359/read-write-splitting-hibernate
[5] https://github.com/yandex/django_replicated (for example; several
similar extensions exist)

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Sat, Jun 24, 2017 at 4:05 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Maybe it needs a better name.
>
> Ok, how about this: the feature could be called "synchronous replay".
> The new column in pg_stat_replication could be called sync_replay
> (like the other sync_XXX columns).  The GUCs could be called
> synchronous replay, synchronous_replay_max_lag and
> synchronous_replay_lease_time.  The language in log messages could
> refer to standbys "joining the synchronous replay set".

Feature hereby renamed that way.  It seems a lot more
self-explanatory.  Please see attached.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Tue, Jun 27, 2017 at 12:20 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> What I think we need is a joined up plan for load balancing, so that
>> we can understand how it will work. i.e. explain the whole use case
>> and how the solution works.

Here's a proof-of-concept hack of the sort of routing and retry logic
that I think should be feasible with various modern application stacks
(given the right extensions):

https://github.com/macdice/py-pgsync/blob/master/DemoSyncPool.py

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Dmitry Dolgov
Дата:
> On 23 June 2017 at 13:48, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> Apologies for the extended delay.  Here is the rebased patch, now with a
> couple of improvements (see below).

Thank you. I started to play with it a little bit, since I think it's an
interesting idea. And there are already few notes:

* I don't see a CF item for that, where is it?

* Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:

```
+#causal_reads_standy_names = '*' # standby servers that can potentially become
+ # available for causal reads; '*' = all
+
```

it should be `causal_reads_standby_names`. Also I hope in the nearest future I
can provide a full review.

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Thu, Jul 13, 2017 at 2:51 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Thank you. I started to play with it a little bit, since I think it's an
> interesting idea. And there are already few notes:

Thanks Dmitry.

> * I don't see a CF item for that, where is it?

https://commitfest.postgresql.org/14/951/

The latest version of the patch is here:

https://www.postgresql.org/message-id/CAEepm%3D0YigNQczAF-%3Dx_SxT6cJv77Yb0EO%2BcAFnqRyVu4%2BbKFw%40mail.gmail.com

I renamed it to "synchronous replay", because "causal reads" seemed a
bit too arcane.

> * Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:
>
> ```
> +#causal_reads_standy_names = '*' # standby servers that can potentially
> become
> + # available for causal reads; '*' = all
> +
> ```
>
> it should be `causal_reads_standby_names`.

Fixed in latest version (while renaming).

> Also I hope in the nearest future
> I
> can provide a full review.

Great news, thanks!

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 3 January 2017 at 01:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
>> Here is a new version of my "causal reads" patch (see the earlier
>> thread from the 9.6 development cycle[1]), which provides a way to
>> avoid stale reads when load balancing with streaming replication.
>
> I'm very happy that you are addressing this topic.
>
> I noticed you didn't put in links my earlier doubts about this
> specific scheme, though I can see doubts from myself and Heikki at
> least in the URLs. I maintain those doubts as to whether this is the
> right way forwards.
>
> This patch presumes we will load balance writes to a master and reads
> to a pool of standbys. How will we achieve that?
>
> 1. We decorate the application with additional info to indicate
> routing/write concerns.
> 2. We get middleware to do routing for us, e.g. pgpool style read/write routing
>
> The explicit premise of the patch is that neither of the above options
> are practical, so I'm unclear how this makes sense. Is there some use
> case that you have in mind that has not been fully described? If so,
> lets get it on the table.
>
> What I think we need is a joined up plan for load balancing, so that
> we can understand how it will work. i.e. explain the whole use case
> and how the solution works.

Simon,

Here's a simple proof-of-concept Java web service using Spring Boot
that demonstrates how load balancing could be done with this patch.
It show two different techniques for routing: an "adaptive" one that
learns which transactional methods need to run on the primary server
by intercepting errors, and a "declarative" one that respects Spring's
@Transactional(readOnly=true) annotations (inspired by the way people
use MySQL Connector/J with Spring to do load balancing).  Whole
transactions are automatically retried at the service request level on
transient failures using existing techniques (Spring Retry, as used
for handling deadlocks and serialisation failures etc), and the
"TransactionRouter" avoids servers that have recently raised the
"synchronous_replay not available" error.  Aside from the optional
annotations, the application code in KeyValueController.java is
unaware of any of this.

https://github.com/macdice/syncreplay-spring-demo

I suspect you could find ways to do similar things with basically any
application development stack that supports some kind of container
managed transactions.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Causal reads take II

От
Dmitry Dolgov
Дата:
> On 12 July 2017 at 23:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> I renamed it to "synchronous replay", because "causal reads" seemed a bit too
> arcane.

Hi

I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few
tests. I didn't manage to break anything, except one mysterious error that I've
got only once on one of my replicas, but I couldn't reproduce it yet.
Interesting thing is that this error did not affect another replica or primary.
Just in case here is the log for this error (maybe you can see something
obvious, that I've not noticed):

```
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211": Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211": Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  directories for tablespace 47733 could not be removed
HINT:  You can remove the directories manually if necessary.
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
FATAL:  could not create directory "pg_tblspc/47734/PG_10_201707211/47732": File exists
CONTEXT:  WAL redo at 0/125F5768 for Storage/CREATE: pg_tblspc/47734/PG_10_201707211/47732/47736
LOG:  startup process (PID 8034) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  database system is shut down
```

And speaking about the code, so far I have just a few notes (some of them
merely questions):

* In general the idea behind this patch sounds interesting for me, but it
  relies heavily on time synchronization. As mentioned in the documentation:
  "Current hardware clocks, NTP implementations and public time servers are
  unlikely to allow the system clocks to differ more than tens or hundreds of
  milliseconds, and systems synchronized with dedicated local time servers may
  be considerably more accurate." But as far as I remember from my own
  experience sometimes it maybe not that trivial on something like AWS because
  of virtualization. Maybe it's an unreasonable fear, but is it possible to
  address this problem somehow?

* Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
  time shift when we're dealing with leases). Does it make sense to move them
  out and make them configurable?

* Judging from the `SyncReplayPotentialStandby` function, it's possible to have
  `synchronous_replay_standby_names = "*, server_name"`, which is basically an
  equivalent for just `*`, but it looks confusing. Is it worth it to prevent
  this behaviour?

* In the same function `SyncReplayPotentialStandby` there is this code:

```
if (!SplitIdentifierString(rawstring, ',', &elemlist))
{
/* syntax error in list */
pfree(rawstring);
list_free(elemlist);
/* GUC machinery will have already complained - no need to do again */
return false;
}
```

  Am I right that ideally this (a situation when at this point in the code
  `synchronous_replay_standby_names` has incorrect value) should not happen,
  because GUC will prevent us from that? If yes, then it looks for me that it
  still makes sense to put here a log message, just to give more information in
  a potentially weird situation.

* In the function `SyncRepReleaseWaiters` there is a commentary:

```
  /*
  * If the number of sync standbys is less than requested or we aren't
* managing a sync standby or a standby in synchronous replay state that
* blocks then just leave.
  * /
if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
```

  Is this commentary correct? If I understand everything right `!got_recptr` -
  the number of sync standbys is less than requested (a), `!am_sync` - we aren't
  managing a sync standby (b), `walsender_sr_blocker` - a standby in synchronous
  replay state that blocks (c). Looks like condition is `(a or b) and not c`.

* In the function `ProcessStandbyReplyMessage` there is a code that implements
  this:

```
     * If the standby reports that it has fully replayed the WAL for at least
* 10 seconds, then let's clear the lag times that were measured when it
* last wrote/flushed/applied a WAL record.  This way we avoid displaying
* stale lag data until more WAL traffic arrives.
```
  but I never found any mention of this 10 seconds in the documentation. Is it
  not that important? Also, question 2 is related to this one.

* In the function `WalSndGetSyncReplayStateString` all the states are in lower
  case except `UNKNOWN`, is there any particular reason for that?

There are also few more not that important notes (mostly about some typos and
few confusing names), but I'm going to do another round of review and testing
anyway so I'll just send them all next time.

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a
> few
> tests. I didn't manage to break anything, except one mysterious error that
> I've
> got only once on one of my replicas, but I couldn't reproduce it yet.
> Interesting thing is that this error did not affect another replica or
> primary.
> Just in case here is the log for this error (maybe you can see something
> obvious, that I've not noticed):
>
> ```
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  directories for tablespace 47733 could not be removed
> HINT:  You can remove the directories manually if necessary.
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> FATAL:  could not create directory "pg_tblspc/47734/PG_10_201707211/47732":
> File exists
> CONTEXT:  WAL redo at 0/125F5768 for Storage/CREATE:
> pg_tblspc/47734/PG_10_201707211/47732/47736
> LOG:  startup process (PID 8034) exited with exit code 1
> LOG:  terminating any other active server processes
> LOG:  database system is shut down
> ```

Hmm.  The first error ("could not remove directory") could perhaps be
explained by temporary files from concurrent backends, leaked files
from earlier crashes or copying a pgdata directory over the top of an
existing one as a way to set it up, leaving behind some files from an
earlier test?  The second error ("could not create directory") is a
bit stranger though... I think this must come from
TablespaceCreateDbspace(): it must have stat()'d the file and got
ENOENT, decided to create the directory, acquired
TablespaceCreateLock, stat()'d the file again and found it still
absent, then run mkdir() on the parents and got EEXIST and finally on
the directory to be created, and surprisingly got EEXIST.  That means
that someone must have concurrently created the directory.  Perhaps in
your testing you accidentally copied a pgdata directory over the top
of it while it was running?  In any case I'm struggling to see how
anything in this patch would affect anything at the REDO level.

> And speaking about the code, so far I have just a few notes (some of them
> merely questions):
>
> * In general the idea behind this patch sounds interesting for me, but it
>   relies heavily on time synchronization. As mentioned in the documentation:
>   "Current hardware clocks, NTP implementations and public time servers are
>   unlikely to allow the system clocks to differ more than tens or hundreds
> of
>   milliseconds, and systems synchronized with dedicated local time servers
> may
>   be considerably more accurate." But as far as I remember from my own
>   experience sometimes it maybe not that trivial on something like AWS
> because
>   of virtualization. Maybe it's an unreasonable fear, but is it possible to
>   address this problem somehow?

Oops, I had managed to lose an important hunk that deals with
detecting excessive clock drift (ie badly configured servers) while
rebasing a couple of versions back.  Here is a version to put it back.

With that change, if you disable NTP and manually set your standby's
clock to be more than 1.25s (assuming synchronous_replay_lease_time is
set to the default of 5s) behind the primary, the synchronous_replay
should be unavailable and you should see this error in the standby's
log:

            ereport(LOG,
                    (errmsg("the primary server's clock time is too
far ahead for synchronous_replay"),
                     errhint("Check your servers' NTP configuration or
equivalent.")));

One way to test this without messing with your NTP setting or
involving two different computers is to modify this code temporarily
in WalSndKeepalive:

    now = GetCurrentTimestamp() + 1250100;

This is a best effort intended to detect a system not running ntpd at
all or talking to an insane time server.  Fundamentally this proposal
is based on the assumption that you can get your system clocks into
sync within a tolerance that we feel confident estimating an upper
bound for.

It does appear that some Amazon OS images come with NTP disabled;
that's a problem if you want to use this feature, but if you're
running a virtual server without an ntpd you'll pretty soon drift
seconds to minutes off UTC time and get "unavailable for synchronous
replay" errors from this patch (and possibly the LOG message above,
depending on the direction of drift).

> * Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
>   time shift when we're dealing with leases). Does it make sense to move
> them
>   out and make them configurable?

These numbers are interrelated, and I think they're best fixed in that
ratio.  You could make it more adjustable, but I think it's better to
keep it simple with just a single knob.  Let me restate that logic to
explain how I came up with those ratios.  There are two goals:

1.  The primary needs to be able to wait for a lease to expire if
connectivity is lost, even if the the standby's clock is behind the
primary's clock by max_clock_skew.  (Failure to do so could allow
stale query results from a zombie standby that is somehow still
handling queries after connectivity with the primary is lost.)

2.  The primary needs to be able to replace leases often enough so
that there are no gaps between them, even if the standby's clock is
ahead of the primary's clock by max_clock_skew.  (Failure to replace
leases fast enough could cause spurious "unavailable" errors, but not
incorrect query results.  Technically it's max_clock_skew - network
latency since it takes time for new leases to arrive but that's a
minor detail).

A solution that maximises tolerable clock skew and as an added bonus
doesn't require the standby to have access to the primary's GUCs is to
tell the standby that the expiry time is 25% earlier that the time the
primary will really wait until, and replace leases when they still
have 50% of their time to go.

To illustrate using fixed-width ASCII-art, here's how the primary
perceives the stream of leases it sends.  In this diagram, '+' marks
halfway, '!' marks the time the primary will send to the standby as
the expiry time, and the final '|' marks the time the primary will
really wait until if it has to, just in case the standby's clock is
'slow' (behind).  The '!' is 25% earlier, and represents the maximum
tolerable clock skew.

  |---+-!-|
      |---+-!-|
          |---+-!-|

Here's how a standby with a clock that is 'slow' (behind) by
max_clock_skew = 25% perceives this stream of leases:

  |-------!
      |-------!
          |-------!

You can see that the primary server is able to wait just long enough
for the lease to expire and the error to begin to be raised on this
standby server, if it needs to.

Here's how a standby with a clock that is 'fast' (ahead) by
max_clock_skew = 25% perceives this stream of leases:

  |---!
      |---!
          |---!

If it's ahead by more than that, we'll get gaps where the error may be
raised spuriously in between leases.


> * Judging from the `SyncReplayPotentialStandby` function, it's possible to
> have
>   `synchronous_replay_standby_names = "*, server_name"`, which is basically
> an
>   equivalent for just `*`, but it looks confusing. Is it worth it to prevent
>   this behaviour?

Hmm.  Seems harmless to me!

> * In the same function `SyncReplayPotentialStandby` there is this code:
>
> ```
> if (!SplitIdentifierString(rawstring, ',', &elemlist))
> {
> /* syntax error in list */
> pfree(rawstring);
> list_free(elemlist);
> /* GUC machinery will have already complained - no need to do again */
> return false;
> }
> ```
>
>   Am I right that ideally this (a situation when at this point in the code
>   `synchronous_replay_standby_names` has incorrect value) should not happen,
>   because GUC will prevent us from that? If yes, then it looks for me that
> it
>   still makes sense to put here a log message, just to give more information
> in
>   a potentially weird situation.

Yes.  That's exactly the coding that was used for synchronous_commit,
before it was upgraded to support a new fancy syntax.  I was trying to
do things the established way.

> * In the function `SyncRepReleaseWaiters` there is a commentary:
>
> ```
>   /*
>   * If the number of sync standbys is less than requested or we aren't
> * managing a sync standby or a standby in synchronous replay state that
> * blocks then just leave.
>   * /
> if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
> ```
>
>   Is this commentary correct? If I understand everything right `!got_recptr`
> -
>   the number of sync standbys is less than requested (a), `!am_sync` - we
> aren't
>   managing a sync standby (b), `walsender_sr_blocker` - a standby in
> synchronous
>   replay state that blocks (c). Looks like condition is `(a or b) and not
> c`.

This code is trying to decide whether to leave early, rather than
potentially blocking.  The change in my patch is:

-       if (!got_recptr || !am_sync)
+       if ((!got_recptr || !am_sync) && !walsender_sr_blocker)

The old coding said "if there aren't enough sync commit standbys, or
I'm not a sync standby, then I can leave now".  The coding with my
patch is the same, except that in any case it won't leave early this
walsender is managing a standby that potentially blocks commit.  That
said, it's a terribly named and documented function argument, so I
have fixed that in the attached version; I hope that's better!

To put it another way, with this patch there are two different reasons
a transaction might need to wait: because of synchronous_commit and
because of synchronous_replay.  They're both forms of 'synchronous
replication' I suppose.  That if statement is saying 'if I don't need
to wait for synchronous_commit, and I don't need to wait for
synchronous_replay, then we can return early'.

> * In the function `ProcessStandbyReplyMessage` there is a code that
> implements
>   this:
>
> ```
>      * If the standby reports that it has fully replayed the WAL for at
> least
> * 10 seconds, then let's clear the lag times that were measured when it
> * last wrote/flushed/applied a WAL record.  This way we avoid displaying
> * stale lag data until more WAL traffic arrives.
> ```
>   but I never found any mention of this 10 seconds in the documentation. Is
> it
>   not that important? Also, question 2 is related to this one.

Hmm.  Yeah that does seem a bit arbitrary.  The documentation in
master does already say that it's cleared without being saying exactly
when:

   [...]  If the standby
   server has entirely caught up with the sending server and there is no more
   WAL activity, the most recently measured lag times will continue to be
   displayed for a short time and then show NULL.

The v1 patch changeed it from being based on
wal_receiver_status_interval (sort of implicitly) to being 10 seconds,
and yeah that is not a good change.  In this v2 it's using
wal_receiver_status_interval (though it has to do it explicitly, since
this patch increases the amount of chit chat between the servers due
to lease replacement).

> * In the function `WalSndGetSyncReplayStateString` all the states are in
> lower
>   case except `UNKNOWN`, is there any particular reason for that?

This state should never exist, so no user will ever see it; I used
"UNKNOWN" following the convention established by the function
WalSndGetStateString() immediately above.

> There are also few more not that important notes (mostly about some typos
> and
> few confusing names), but I'm going to do another round of review and
> testing
> anyway so I'll just send them all next time.

Thanks for the review!

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Mon, Jul 31, 2017 at 5:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>  Here is a version to put it back.

Rebased after conflicting commit 030273b7.  Now using format-patch
with a commit message to keep track of review/discussion history.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Thu, Aug 10, 2017 at 2:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Rebased after conflicting commit 030273b7.  Now using format-patch
> with a commit message to keep track of review/discussion history.

TAP test 006_logical_decoding.pl failed with that version.  I had
missed some places that know how to decode wire protocol messages I
modified.  Fixed in the attached version.

It might be a good idea to consolidate the message encoding/decoding
logic into reusable routines, independently of this work.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Causal reads take II

От
Dmitry Dolgov
Дата:
> On 31 July 2017 at 07:49, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>
>> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few
>> tests. I didn't manage to break anything, except one mysterious error that I've
>> got only once on one of my replicas, but I couldn't reproduce it yet.
>> Interesting thing is that this error did not affect another replica or primary.
>> Just in case here is the log for this error (maybe you can see something
>> obvious, that I've not noticed):
>>
>> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>> Directory not empty
>> ...
>
> Hmm.  The first error ("could not remove directory") could perhaps be
> explained by temporary files from concurrent backends.
> ...
> Perhaps in your testing you accidentally copied a pgdata directory over the
> top of it while it was running?  In any case I'm struggling to see how
> anything in this patch would affect anything at the REDO level.

Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong with
this approach, am I right?). This particular error was caused by `tablespace`
test which was failed in this case:

```
INSERT INTO testschema.foo VALUES(1);
ERROR:  could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390": No such file or directory
```

I tried few more times, and I've got it two times from four attempts on a fresh
installation (when all instances were on the same machine). But anyway I'll try
to investigate, maybe it has something to do with my environment.

> > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
> >   time shift when we're dealing with leases). Does it make sense to move
> >   them out and make them configurable?
>
> These numbers are interrelated, and I think they're best fixed in that
> ratio.  You could make it more adjustable, but I think it's better to
> keep it simple with just a single knob.

Ok, but what do you think about converting them to constants to make them more
self explanatory? Like:

```
/*
+ * Since this timestamp is being sent to the standby where it will be
+ * compared against a time generated by the standby's system clock, we
+ * must consider clock skew.  We use 25% of the lease time as max
+ * clock skew, and we subtract that from the time we send with the
+ * following reasoning:
+ */
+int max_clock_skew = synchronous_replay_lease_time * MAX_CLOCK_SKEW_PORTION;
```

Also I have another question. I tried to test this patch little bit more, and
I've got some strange behaviour after pgbench (here is the full output [1]):

```
# primary

$ ./bin/pgbench -s 100 -i test

NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)
200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)
300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)
2017-09-30 15:47:26.884 CEST [6035] LOG:  revoking synchronous replay lease for standby "walreceiver"...
2017-09-30 15:47:26.900 CEST [6035] LOG:  standby "walreceiver" is no longer available for synchronous replay
2017-09-30 15:47:26.903 CEST [6197] LOG:  revoking synchronous replay lease for standby "walreceiver"...
400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)
2017-09-30 15:47:27.979 CEST [6197] LOG:  standby "walreceiver" is no longer available for synchronous replay
```

```
# replica

2017-09-30 15:47:51.802 CEST [6034] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number 0000 in log segment 000000010000000000000020, offset 10092544
2017-09-30 15:47:55.257 CEST [10508] LOG:  started streaming WAL from primary at 0/20000000 on timeline 1
2017-09-30 15:48:09.622 CEST [10508] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```

Is it something well known or unrelated to the patch itself?

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Sun, Oct 1, 2017 at 9:05 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>>> Directory not empty
>>> ...
>>
>> Hmm.  The first error ("could not remove directory") could perhaps be
>> explained by temporary files from concurrent backends.
>> ...
>> Perhaps in your testing you accidentally copied a pgdata directory over
>> the
>> top of it while it was running?  In any case I'm struggling to see how
>> anything in this patch would affect anything at the REDO level.
>
> Hmm...no, I don't think so. Basically what I was doing is just running
> `installcheck` against a primary instance (I assume there is nothing wrong
> with
> this approach, am I right?). This particular error was caused by
> `tablespace`
> test which was failed in this case:
>
> ```
> INSERT INTO testschema.foo VALUES(1);
> ERROR:  could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390":
> No such file or directory
> ```
>
> I tried few more times, and I've got it two times from four attempts on a
> fresh
> installation (when all instances were on the same machine). But anyway I'll
> try
> to investigate, maybe it has something to do with my environment.
>
> ...
>
> 2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number 0000 in log
> segment 000000010000000000000020, offset 10092544

Hi Dmitry,

Thanks for testing.  Yeah, it looks like the patch may be corrupting
the WAL stream in some case that I didn't hit in my own testing
procedure.  I will try to reproduce these failures.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Sun, Oct 1, 2017 at 10:03 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> I tried few more times, and I've got it two times from four attempts on a
>> fresh
>> installation (when all instances were on the same machine). But anyway I'll
>> try
>> to investigate, maybe it has something to do with my environment.
>>
>> ...
>>
>> 2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number 0000 in log
>> segment 000000010000000000000020, offset 10092544
>
> Hi Dmitry,
>
> Thanks for testing.  Yeah, it looks like the patch may be corrupting
> the WAL stream in some case that I didn't hit in my own testing
> procedure.  I will try to reproduce these failures.

Hi Dmitry,

I managed to reproduce something like this on one of my home lab
machines running a different OS.  Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest.  I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Causal reads take II

От
Michael Paquier
Дата:
On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I managed to reproduce something like this on one of my home lab
> machines running a different OS.  Not sure why yet and it doesn't
> happen on my primary development box which is how I hadn't noticed it.
> I will investigate and aim to get a fix posted in time for the
> Commitfest.  I'm also hoping to corner Simon at PGDay Australia in a
> couple of weeks to discuss this proposal...

This leads me to think that returned with feedback is adapted for now.
So done this way. Feel free to correct things if you think this is not
adapted of course.
-- 
Michael


Re: [HACKERS] Causal reads take II

От
Thomas Munro
Дата:
On Wed, Nov 29, 2017 at 2:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> I managed to reproduce something like this on one of my home lab
>> machines running a different OS.  Not sure why yet and it doesn't
>> happen on my primary development box which is how I hadn't noticed it.
>> I will investigate and aim to get a fix posted in time for the
>> Commitfest.  I'm also hoping to corner Simon at PGDay Australia in a
>> couple of weeks to discuss this proposal...
>
> This leads me to think that returned with feedback is adapted for now.
> So done this way. Feel free to correct things if you think this is not
> adapted of course.

Thanks.  I'll be back.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Causal reads take II

От
Michael Paquier
Дата:
On Wed, Nov 29, 2017 at 11:04 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Nov 29, 2017 at 2:58 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> I managed to reproduce something like this on one of my home lab
>>> machines running a different OS.  Not sure why yet and it doesn't
>>> happen on my primary development box which is how I hadn't noticed it.
>>> I will investigate and aim to get a fix posted in time for the
>>> Commitfest.  I'm also hoping to corner Simon at PGDay Australia in a
>>> couple of weeks to discuss this proposal...
>>
>> This leads me to think that returned with feedback is adapted for now.
>> So done this way. Feel free to correct things if you think this is not
>> adapted of course.
>
> Thanks.  I'll be back.

Please do not target Sarah Connor then.
(Sorry, too many patches.)
-- 
Michael