Обсуждение: Unexpected trouble from pg_basebackup
I recently updated my systems from pg 9.1.8 to 9.5.3. A pg_dumpall was used to migrate the data. Now I'm trying to re-establish replication between master and slave. I'm getting stuck.
--
When I run pg_basebackup (via a script which worked flawlessly on 9.1.8, AND via command line, ala "manual mode"), it creates the PGDATA directory structure, some files, but soon gets stuck. It writes fewer than 40 MB of a 20GB database. Further, using tcpdump to watch packets on the wire, I confirm no data is going across.
However, it's clear a connection has been made and the process was started. On the master, I see the process table:
postgres: wal sender process pgsync A.B.C.D(42821) sending backup
Meanwhile, on the client, I see no output updates. Here's the command:
pg_basebackup -x -P -v -D $PGDATA -w
PGUSER, PGPASSWORD and PGHOST are set accordingly. The User is a user with the REPLICATION attribute. Just in case, I dropped and re-created the user.
So that's question A.
Question B is related. In attempting to verify that the permissions and HBA were set correctly, I attempted to do a pg_dump using the same PGUSER and PGHOST. What I got surprised me:
$ pg_dump -s onyxeditor
pg_dump: [archiver (db)] query failed: ERROR: permission denied for relation licence
pg_dump: [archiver (db)] query was: LOCK TABLE public.licence IN ACCESS SHARE MODE
This is the replication user. Other schemas worked just fine, but not this one. Is this expected behavior?
Kind regards,
Otheus
After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?
On Tue, Oct 4, 2016 at 10:42 PM, otheus uibk <otheus.uibk@gmail.com> wrote:
After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?
Sounds to me like it's doing a CHECKPOINT with spreading, which make it take time. Try with "-c fast" and see if the problem goes away.
On Wed, Oct 5, 2016 at 3:55 PM, otheus uibk <otheus.uibk@gmail.com> wrote:
pg_basebackup will always send in a checkpoint. PostgreSQL will time that to take approximately checkpoint_timeout * checkpoint_completion_target time. So with checkpoint_timeout set to a very high value, it will by default target something like 30 minutes before it even gets started. The only reason it takes as *little* as it does is that your system is lightly loaded.
-- On Tue, Oct 4, 2016 at 10:49 PM, Magnus Hagander <magnus@hagander.net> wrote:On Tue, Oct 4, 2016 at 10:42 PM, otheus uibk <otheus.uibk@gmail.com> wrote:After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?
Sounds to me like it's doing a CHECKPOINT with spreading, which make it take time. Try with "-c fast" and see if the problem goes away.Maybe not too far off.2016-10-04 17:43:40.620 GMT 57eb90a0.6e07 402 00000 LOG: checkpoint complete: wrote 12799 buffers (1.0%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=1389.348 s, sync=0.033 s, total=1389.400 s; sync files=240, longest=0.003 s, average=0.000 s; distance=92915 kB, estimate=129373 kB2016-10-04 18:20:31.714 GMT 57eb90a0.6e07 403 00000 LOG: checkpoint starting: time2016-10-04 18:39:19.870 GMT 57eb90a0.6e07 404 00000 LOG: checkpoint complete: wrote 10265 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=1128.118 s, sync=0.023 s, total=1128.155 s; sync files=190, longest=0.002 s, average=0.000 s; distance=73419 kB, estimate=123778 kB2016-10-04 19:07:12.647 GMT [unknown] pgsync 57f3fde0.52e3 2 00000 LOG: replication connection authorized: user=pgsync2016-10-04 19:07:12.703 GMT 57eb90a0.6e07 405 00000 LOG: checkpoint starting: force wait2016-10-04 19:20:32.879 GMT [unknown] pgsync 57f40100.5891 2 00000 LOG: replication connection authorized: user=pgsync2016-10-04 19:23:05.249 GMT 57eb90a0.6e07 406 00000 LOG: checkpoint complete: wrote 8638 buffers (0.7%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=952.514 s, sync=0.016 s, total=952.546 s; sync files=238, longest=0.002 s, average=0.000 s; distance=68257 kB, estimate=118226 kB2016-10-04 19:23:05.249 GMT 57eb90a0.6e07 407 00000 LOG: checkpoint starting: force wait2016-10-04 19:28:52.232 GMT 57eb90a0.6e07 408 00000 LOG: checkpoint complete: wrote 3102 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=346.957 s, sync=0.018 s, total=346.982 s; sync files=149, longest=0.002 s, average=0.000 s; distance=36016 kB, estimate=110005 kB2016-10-04 19:30:31.922 GMT [unknown] pgsync 57f40357.5c70 2 00000 LOG: replication connection authorized: user=pgsync2016-10-04 19:30:31.932 GMT 57eb90a0.6e07 409 00000 LOG: checkpoint starting: force wait2016-10-04 19:33:40.857 GMT 57eb90a0.6e07 410 00000 LOG: checkpoint complete: wrote 1763 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=188.886 s, sync=0.030 s, total=188.924 s; sync files=125, longest=0.004 s, average=0.000 s; distance=13135 kB, estimate=100318 kBOK, so what was happening is that the checkpoints were taking 5 to 15 minutes minutes, and aborted basebackups were triggering new checkpoints which waited on the previous ones.Is it possible the new environment is not high-performance enough??possibly relevant configuration change:checkpoint_timeout=1h
pg_basebackup will always send in a checkpoint. PostgreSQL will time that to take approximately checkpoint_timeout * checkpoint_completion_target time. So with checkpoint_timeout set to a very high value, it will by default target something like 30 minutes before it even gets started. The only reason it takes as *little* as it does is that your system is lightly loaded.
You can do a fast checkpoint with the parameter I suggested before, or you can tune your checkpoint_timeout to be something that gives you a more reasonable time.