Обсуждение: [HACKERS] Slow synchronous logical replication

Поиск
Список
Период
Сортировка

[HACKERS] Slow synchronous logical replication

От
konstantin knizhnik
Дата:
In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard
copies).
Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous
logicalreplication. 
Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20
timesslower then without logical replication. 
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally
developedfor streaming replication, when all replicas have the same content and LSNs. When it is used for logical
replication,it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in
"synchronous_standby_names"list. So we are waiting not only for our own single logical replication standby, but all
otherstandbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To
provideuniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create
100shards. As a result we get awful performance and blocking of any replication channel blocks all backends. 

So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used
insuch manner. 
If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication
moreefficient and are there some plans to  work in this direction? 

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Andres Freund
Дата:
Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:
> In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard
copies).
> Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous
logicalreplication.
 
> Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20
timesslower then without logical replication.
 
> With asynchronous replication it is "only" two times slower.
> 
> As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally
developedfor streaming replication, when all replicas have the same content and LSNs. When it is used for logical
replication,it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in
"synchronous_standby_names"list. So we are waiting not only for our own single logical replication standby, but all
otherstandbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To
provideuniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create
100shards. As a result we get awful performance and blocking of any replication channel blocks all backends.
 
> 
> So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used
insuch manner.
 
> If so, the next question is how difficult it will be to make synchronous replication mechanism for logical
replicationmore efficient and are there some plans to  work in this direction?
 

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Konstantin Knizhnik
Дата:
On 10/07/2017 10:42 PM, Andres Freund wrote:
> Hi,
>
> On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:
>> In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard
copies).
>> Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous
logicalreplication.
 
>> Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20
timesslower then without logical replication.
 
>> With asynchronous replication it is "only" two times slower.
>>
>> As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally
developedfor streaming replication, when all replicas have the same content and LSNs. When it is used for logical
replication,it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in
"synchronous_standby_names"list. So we are waiting not only for our own single logical replication standby, but all
otherstandbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To
provideuniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create
100shards. As a result we get awful performance and blocking of any replication channel blocks all backends.
 
>>
>> So my question is whether my understanding is correct and synchronous logical replication can not be efficiently
usedin such manner.
 
>> If so, the next question is how difficult it will be to make synchronous replication mechanism for logical
replicationmore efficient and are there some plans to  work in this direction?
 
> This seems to be a question that is a) about a commercial project we
> don't know much about b) hasn't received a lot of investigation.
>
Sorry, If I was not clear.
The question was about logical replication mechanism in mainstream version of Postgres.
I think that most of people are using asynchronous logical replication and synchronous LR is something exotic and not
welltested and investigated.
 
It will be great if I am wrong:)

Concerning our sharded cluster (pg_shardman) - it is not a commercial product yet, it is in development phase.
We are going to open its sources when it will be more or less stable.
But unlike multimaster, this sharded cluster is mostly built from existed components: pg_pathman  + postgres_fdw +
logicalreplication.
 
So we are just trying to combine them all into some integrated system.
But currently the most obscure point is logical replication.

And the main goal of my e-mail was to know the opinion of authors and users of LR whether it is good idea to use LR to
providefault tolerance in sharded cluster.
 
Or some other approaches, for example sharding with redundancy or using streaming replication are preferable?


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Craig Ringer
Дата:
On 8 October 2017 at 03:58, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

> The question was about logical replication mechanism in mainstream version
> of Postgres.

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

> I think that most of people are using asynchronous logical replication and
> synchronous LR is something exotic and not well tested and investigated.
> It will be great if I am wrong:)

I doubt it's widely used. That said, a lot of people use synchronous
replication with BDR and pglogical, which are ancestors of the core
logical rep code and design.

I think you actually need to collect some proper timings and
diagnostics here, rather than hand-waving about it being "slow". A
good starting point might be setting some custom 'perf' tracepoints,
or adding some 'elog()'ing for timestamps. Then scrape the results and
build a latency graph.

That said, if I had to guess why it's slow, I'd say that you're facing
a number of factors:

* By default, logical replication in PostgreSQL does not do an
immediate flush to disk after downstream commit. In the interests of
faster apply performance it instead delays sending flush confirmations
until the next time WAL is flushed out. See the docs for CREATE
SUBSCRIPTION, notably the synchronous_commit option. This will
obviously greatly increase latencies on sync commit.

* Logical decoding doesn't *start* streaming a transaction until the
origin node finishes the xact and writes a COMMIT, then the xlogreader
picks it up.

* As a consequence of the above, a big xact holds up commit
confirmations of smaller ones by a LOT more than is the case for
streaming physical replication.

Hopefully that gives you something to look into, anyway. Maybe you'll
be inspired to work on parallelized logical decoding :)

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Konstantin Knizhnik
Дата:
Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:
> I think it'd be helpful if you provided reproduction instructions,
> test programs, etc, making it very clear when things are / aren't
> related to your changes.

It will be not so easy to provide some reproducing scenario, because 
actually it involves many components (postgres_fdw, pg_pasthman, 
pg_shardman, LR,...)
and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote 
tables using pg_pathman and postgres_fdw.
It means that insert or update of parent table  cause insert or update 
of some derived partitions which is forwarded by postgres_fdw to the 
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 
5 nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical 
replication to one or more nodes. Right now we considered only 
redundancy level 1 - each shard has only one replica.
So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so 
we are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10 
subscriptions in synchronous_standby_names list.

In this setup commit latency is very large (about 100msec and most of 
the time is actually spent in commit) and performance is very bad - 
pgbench shows about 300 TPS for optimal number of clients (about 10, for 
larger number performance is almost the same). Without logical 
replication at the same setup we get about 6000 TPS.

I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr 
function. Each wal sender independently calculates minimal LSN among all 
synchronous replicas and wakeup backends waiting for this LSN. It means 
that transaction performing update of data in one shard will actually 
wait confirmation from replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for 
example because communication channels between this node is broken), 
then all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also 
can be a contention point.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Masahiko Sawada
Дата:
On Mon, Oct 9, 2017 at 4:37 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Thank you for explanations.
>
> On 08.10.2017 16:00, Craig Ringer wrote:
>>
>> I think it'd be helpful if you provided reproduction instructions,
>> test programs, etc, making it very clear when things are / aren't
>> related to your changes.
>
>
> It will be not so easy to provide some reproducing scenario, because
> actually it involves many components (postgres_fdw, pg_pasthman,
> pg_shardman, LR,...)
> and requires multinode installation.
> But let me try to explain what going on:
> So we have implement sharding - splitting data between several remote tables
> using pg_pathman and postgres_fdw.
> It means that insert or update of parent table  cause insert or update of
> some derived partitions which is forwarded by postgres_fdw to the
> correspondent node.
> Number of shards is significantly larger than number of nodes, i.e. for 5
> nodes we have 50 shards. Which means that at each onde we have 10 shards.
> To provide fault tolerance each shard is replicated using logical
> replication to one or more nodes. Right now we considered only redundancy
> level 1 - each shard has only one replica.
> So from each node we establish 10 logical replication channels.
>
> We want commit to wait until data is actually stored at all replicas, so we
> are using synchronous replication:
> So we set synchronous_commit option to "on" and include all ten 10
> subscriptions in synchronous_standby_names list.
>
> In this setup commit latency is very large (about 100msec and most of the
> time is actually spent in commit) and performance is very bad - pgbench
> shows about 300 TPS for optimal number of clients (about 10, for larger
> number performance is almost the same). Without logical replication at the
> same setup we get about 6000 TPS.
>
> I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function.
> Each wal sender independently calculates minimal LSN among all synchronous
> replicas and wakeup backends waiting for this LSN. It means that transaction
> performing update of data in one shard will actually wait confirmation from
> replication channels for all shards.
> If some shard is updated rarely than other or is not updated at all (for
> example because communication channels between this node is broken), then
> all backens will stuck.
> Also all backends are competing for the single SyncRepLock, which also can
> be a contention point.
>

IIUC, I guess you meant to say that in current synchronous logical
replication a transaction has to wait for updated table data to be
replicated even on servers that don't subscribe for the table. If we
change it so that a transaction needs to wait for only the server that
are subscribing for the table it would be more efficiency, for at
least your use case.
We send at least the begin and commit data to all subscriptions and
then wait for the reply from them but can we skip to wait them, for
example, when the walsender actually didn't send any data modified by
the transaction?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Craig Ringer
Дата:
On 9 October 2017 at 15:37, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Thank you for explanations.
>
> On 08.10.2017 16:00, Craig Ringer wrote:
>>
>> I think it'd be helpful if you provided reproduction instructions,
>> test programs, etc, making it very clear when things are / aren't
>> related to your changes.
>
>
> It will be not so easy to provide some reproducing scenario, because
> actually it involves many components (postgres_fdw, pg_pasthman,
> pg_shardman, LR,...)

So simplify it to a test case that doesn't.

> I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function.
> Each wal sender independently calculates minimal LSN among all synchronous
> replicas and wakeup backends waiting for this LSN. It means that transaction
> performing update of data in one shard will actually wait confirmation from
> replication channels for all shards.

That's expected for the current sync rep design, yes. Because it's
based on lsn, and was designed for physical rep where there's no
question about whether we're sending some data to some peers and not
others.

So all backends will wait for the slowest-responding peer, including
peers that don't need to actually do anything for this xact. You could
possibly hack around that by having the output plugin advance the slot
position when it sees that it just processed an empty xact.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Andres Freund
Дата:
Hi,

On 2017-10-09 10:37:01 +0300, Konstantin Knizhnik wrote:
> So we have implement sharding - splitting data between several remote tables
> using pg_pathman and postgres_fdw.
> It means that insert or update of parent table  cause insert or update of
> some derived partitions which is forwarded by postgres_fdw to the
> correspondent node.
> Number of shards is significantly larger than number of nodes, i.e. for 5
> nodes we have 50 shards. Which means that at each onde we have 10 shards.
> To provide fault tolerance each shard is replicated using logical
> replication to one or more nodes. Right now we considered only redundancy
> level 1 - each shard has only one replica.
> So from each node we establish 10 logical replication channels.

Isn't that part of the pretty fundamental problem? There shouldn't be 10
different replication channels per node. There should be one.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Konstantin Knizhnik
Дата:

On 11.10.2017 10:07, Craig Ringer wrote:
> On 9 October 2017 at 15:37, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Thank you for explanations.
>>
>> On 08.10.2017 16:00, Craig Ringer wrote:
>>> I think it'd be helpful if you provided reproduction instructions,
>>> test programs, etc, making it very clear when things are / aren't
>>> related to your changes.
>>
>> It will be not so easy to provide some reproducing scenario, because
>> actually it involves many components (postgres_fdw, pg_pasthman,
>> pg_shardman, LR,...)
> So simplify it to a test case that doesn't.
The simplest reproducing scenario is the following:
1. Start two Posgtgres instances: synchronous_commit=on, fsync=off
2. Initialize pgbench database at both instances: pgbench -i
3. Create publication for pgbench_accounts table at one node
4. Create correspondent subscription at another node with 
copy_data=false parameter
5. Add subscription to synchronous_standby_names at first node.
6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems 
results are the following:    standalone postgres:         8600 TPS    asynchronous replication: 6600 TPS
synchronousreplication:   5600 TPS    Quite good results.
 
7. Create some dummy table and perform bulk insert in it:    create table dummy(x integer primary key);    insert into
dummyvalues (generate_series(1,10000000));
 
    pgbench almost stuck: until end of insert performance drops almost 
to zero.

The reason of such behavior is obvious: wal sender has to decode huge 
transaction generate by insert although it has no relation to this 
publication.
Filtering of insert records of this transaction is done only inside 
output plug-in.
Unfortunately it is not quite clear how to make wal-sender smarter and 
let him skip transaction not affecting its publication.
Once of the possible solutions is to let backend inform wal-sender about 
smallest LSN it should wait for (backend knows which table is affected 
by current operation,
so which publications are interested in this operation and so can point 
wal -sender to the proper LSN without decoding huge part of WAL.
But it seems to be not so easy to implement.



-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Craig Ringer
Дата:
On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

> The reason of such behavior is obvious: wal sender has to decode huge
> transaction generate by insert although it has no relation to this
> publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction? Because
that's a known pathological case for logical replication, it has to do
a LOT of extra work when it's in a transaction that has done DDL. I'm
sure there's room for optimisation there, but the general
recommendation for now is "don't do that".

> Filtering of insert records of this transaction is done only inside output
> plug-in.

Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.

I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

> Unfortunately it is not quite clear how to make wal-sender smarter and let
> him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.

It's not so simple by table since we don't know early enough whether
the xact affects tables of interest or not. But you could definitely
do some selective skipping. Making it efficient could be the
challenge.

> Once of the possible solutions is to let backend inform wal-sender about
> smallest LSN it should wait for (backend knows which table is affected by
> current operation,
> so which publications are interested in this operation and so can point wal
> -sender to the proper LSN without decoding huge part of WAL.
> But it seems to be not so easy to implement.

Sounds like confusing layering violations to me.


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

От
Konstantin Knizhnik
Дата:


On 12.10.2017 04:23, Craig Ringer wrote:
On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.
It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction?
No. Table was create in separate transaction.
Moreover  the same effect will take place if table is create before start of replication.
The problem in this case seems to be caused by spilling decoded transaction to the file by ReorderBufferSerializeTXN.
Attached please find two profiles: lr1.svg corresponds to normal work if pgbench with synchronous replication to one replica,
lr2.svg - the with concurrent execution of huge insert statement.

And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat 1111.578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce performance, but not so much.
Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k



Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.
Yes I know about origin filtering mechanism (and we are using it in multimaster).
But I am speaking about standard pgoutput.c output plugin. it's pgoutput_origin_filter
always returns false.



I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.
As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.

The problem is that before end of transaction we do not know whether it touch this publication or not.
So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the particular problem with invalidation records etc  can be solved by always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record which always should be processed by WAL sender, it can always promote LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip (insert/update/delete records not affecting this publication).

I winder if there can be some other problems with skipping part of transaction by WAL sender.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Вложения

Re: [HACKERS] Slow synchronous logical replication

От
Konstantin Knizhnik
Дата:


On 12.10.2017 04:23, Craig Ringer wrote:
On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.
It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction? 

No. Table was create in separate transaction.
Moreover  the same effect will take place if table is create before start of replication.
The problem in this case seems to be caused by spilling decoded transaction to the file by ReorderBufferSerializeTXN.
Please look at two profiles:
http://garret.ru/lr1.svg  corresponds to normal work if pgbench with synchronous replication to one replica,
http://garret.ru/lr2.svg - the with concurrent execution of huge insert statement.

And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat 1111.578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce performance, but not so much.
Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k



Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.
Yes I know about origin filtering mechanism (and we are using it in multimaster).
But I am speaking about standard pgoutput.c output plugin. it's pgoutput_origin_filter
always returns false.



I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.
As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.
The problem is that before end of transaction we do not know whether it touch this publication or not.
So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the particular problem with invalidation records etc  can be solved by always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record which always should be processed by WAL sender, it can always promote LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip (insert/update/delete records not affecting this publication).

I wonder if there can be some other problems with skipping part of transaction by WAL sender.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Slow synchronous logical replication

От
Craig Ringer
Дата:
On 12 October 2017 at 16:09, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
>

> Is the CREATE TABLE and INSERT done in the same transaction?
>
> No. Table was create in separate transaction.
> Moreover  the same effect will take place if table is create before start of replication.
> The problem in this case seems to be caused by spilling decoded transaction to the file by
ReorderBufferSerializeTXN.

Yeah. That's known to perform sub-optimally, and it also uses way more
memory than it should.

Your design compounds that by spilling transactions it will then
discard, and doing so multiple times.

To make your design viable you likely need some kind of cache of
serialized reorder buffer transactions, where you don't rebuild one if
it's already been generated. And likely a fair bit of optimisation on
the serialisation.

Or you might want a table- and even a row-filter that can be run
during decoding, before appending to the ReorderBuffer, to let you
skip changes early. Right now this can only be done at the transaction
level, based on replication origin. Of course, if you do this you
can't do the caching thing.

> Unfortunately it is not quite clear how to make wal-sender smarter and let
> him skip transaction not affecting its publication.

You'd need more hooks to be implemented by the output plugin.

> I really not sure that it is possible to skip over WAL. But the particular problem with invalidation records etc  can
besolved by always processing this records by WAl sender.
 
> I.e. if backend is inserting invalidation record or some other record which always should be processed by WAL sender,
itcan always promote LSN of this record to WAL sender.
 
> So WAl sender will skip only those WAl records which is safe to skip (insert/update/delete records not affecting this
publication).

That sounds like a giant layering violation too.

I suggest focusing on reducing the amount of work done when reading
WAL, not trying to jump over whole ranges of WAL.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers