On 2017-02-25 00:40, Petr Jelinek wrote:
> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
Here are some results. There is improvement although it's not an
unqualified success.
Several repeat-runs of pgbench_derail2.sh, with different parameters for
number-of-client yielded an output file each.
Those show that logrep is now pretty stable when there is only 1 client
(pgbench -c 1). But it starts making mistakes with 4, 8, 16 clients.
I'll just show a grep of the output files; I think it is
self-explicatory:
Output-files (lines counted with grep | sort | uniq -c):
-- out_20170225_0129.txt 250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n 250 -- All is well.
-- out_20170225_0654.txt 25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n 24 -- All is well. 1 -- Not good, but
breakingout of wait (waited more than 60s)
-- out_20170225_0711.txt 25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n 23 -- All is well. 2 -- Not good, but
breakingout of wait (waited more than 60s)
-- out_20170225_0803.txt 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n 11 -- All is well. 14 -- Not good, but
breakingout of wait (waited more than 60s)
So, that says:
1 clients: 250x success, zero fail (250 not a typo, ran this overnight)
4 clients: 24x success, 1 fail
8 clients: 23x success, 2 fail
16 clients: 11x success, 14 fail
I want to repeat what I said a few emails back: problems seem to
disappear when a short wait state is introduced (directly after the
'alter subscription sub1 enable' line) to give the logrep machinery time
to 'settle'. It makes one think of a timing error somewhere (now don't
ask me where..).
To show that, here is pgbench_derail2.sh output that waited 10 seconds
(INIT_WAIT in the script) as such a 'settle' period works faultless
(with 16 clients):
-- out_20170225_0852.txt 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n 25 -- All is well.
QED.
(By the way, no hanged sessions so far, so that's good)
thanks
Erik Rijkers