Обсуждение: Unexpected trouble from pg_basebackup

Поиск
Список
Период
Сортировка

Unexpected trouble from pg_basebackup

От
otheus uibk
Дата:
I recently updated my systems from pg 9.1.8 to 9.5.3. A pg_dumpall was used to migrate the data. Now I'm trying to re-establish replication between master and slave. I'm getting stuck.

When I run pg_basebackup (via a script which worked flawlessly on 9.1.8, AND via command line, ala "manual mode"), it creates the PGDATA directory structure, some files, but soon gets stuck. It writes fewer than 40 MB of a 20GB database. Further, using tcpdump to watch packets on the wire, I confirm no data is going across. 

However, it's clear a connection has been made and the process was started. On the master, I see the process table:

    postgres: wal sender process pgsync A.B.C.D(42821) sending backup

Meanwhile, on the client, I see no output updates. Here's the command:

    pg_basebackup -x -P -v -D $PGDATA -w

PGUSER, PGPASSWORD and PGHOST are set accordingly. The User is a user with the REPLICATION attribute. Just in case, I dropped and re-created the user.

So that's question A.

Question B is related. In attempting to verify that the permissions and HBA were set correctly, I attempted to do a pg_dump using the same PGUSER and PGHOST. What I got surprised me:

  $ pg_dump -s onyxeditor
  pg_dump: [archiver (db)] query failed: ERROR:  permission denied for relation licence
  pg_dump: [archiver (db)] query was: LOCK TABLE public.licence IN ACCESS SHARE MODE

This is the replication user. Other schemas worked just fine, but not this one. Is this expected behavior?

 Kind regards,
 Otheus

--

Re: Unexpected trouble from pg_basebackup

От
otheus uibk
Дата:
After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?

Re: Unexpected trouble from pg_basebackup

От
Magnus Hagander
Дата:


On Tue, Oct 4, 2016 at 10:42 PM, otheus uibk <otheus.uibk@gmail.com> wrote:
After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?


Sounds to me like it's doing a CHECKPOINT with spreading, which make it take time. Try with "-c fast" and see if the problem goes away.

--

Re: Unexpected trouble from pg_basebackup

От
Magnus Hagander
Дата:
On Wed, Oct 5, 2016 at 3:55 PM, otheus uibk <otheus.uibk@gmail.com> wrote:


On Tue, Oct 4, 2016 at 10:49 PM, Magnus Hagander <magnus@hagander.net> wrote:


On Tue, Oct 4, 2016 at 10:42 PM, otheus uibk <otheus.uibk@gmail.com> wrote:
After a 3 to 4 minute delay, pg_basebackup started doing it's thing and finished within a few minutes. So now the question is: why the startup delay?


Sounds to me like it's doing a CHECKPOINT with spreading, which make it take time. Try with "-c fast" and see if the problem goes away.

Maybe not too far off. 


2016-10-04 17:43:40.620 GMT                             57eb90a0.6e07   402     00000   LOG:  checkpoint complete: wrote 12799 buffers (1.0%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=1389.348 s, sync=0.033 s, total=1389.400 s; sync files=240, longest=0.003 s, average=0.000 s; distance=92915 kB, estimate=129373 kB
2016-10-04 18:20:31.714 GMT                             57eb90a0.6e07   403     00000   LOG:  checkpoint starting: time
2016-10-04 18:39:19.870 GMT                             57eb90a0.6e07   404     00000   LOG:  checkpoint complete: wrote 10265 buffers (0.8%); 0 transaction log file(s) added, 0 removed, 4 recycled; write=1128.118 s, sync=0.023 s, total=1128.155 s; sync files=190, longest=0.002 s, average=0.000 s; distance=73419 kB, estimate=123778 kB
2016-10-04 19:07:12.647 GMT     [unknown]       pgsync  57f3fde0.52e3   2       00000   LOG:  replication connection authorized: user=pgsync
2016-10-04 19:07:12.703 GMT                             57eb90a0.6e07   405     00000   LOG:  checkpoint starting: force wait
2016-10-04 19:20:32.879 GMT     [unknown]       pgsync  57f40100.5891   2       00000   LOG:  replication connection authorized: user=pgsync
2016-10-04 19:23:05.249 GMT                             57eb90a0.6e07   406     00000   LOG:  checkpoint complete: wrote 8638 buffers (0.7%); 0 transaction log file(s) added, 0 removed, 5 recycled; write=952.514 s, sync=0.016 s, total=952.546 s; sync files=238, longest=0.002 s, average=0.000 s; distance=68257 kB, estimate=118226 kB
2016-10-04 19:23:05.249 GMT                             57eb90a0.6e07   407     00000   LOG:  checkpoint starting: force wait
2016-10-04 19:28:52.232 GMT                             57eb90a0.6e07   408     00000   LOG:  checkpoint complete: wrote 3102 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=346.957 s, sync=0.018 s, total=346.982 s; sync files=149, longest=0.002 s, average=0.000 s; distance=36016 kB, estimate=110005 kB
2016-10-04 19:30:31.922 GMT     [unknown]       pgsync  57f40357.5c70   2       00000   LOG:  replication connection authorized: user=pgsync
2016-10-04 19:30:31.932 GMT                             57eb90a0.6e07   409     00000   LOG:  checkpoint starting: force wait
2016-10-04 19:33:40.857 GMT                             57eb90a0.6e07   410     00000   LOG:  checkpoint complete: wrote 1763 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=188.886 s, sync=0.030 s, total=188.924 s; sync files=125, longest=0.004 s, average=0.000 s; distance=13135 kB, estimate=100318 kB



OK, so what was happening is that the checkpoints were taking 5 to 15 minutes minutes, and aborted basebackups were triggering new checkpoints which waited on the previous ones.

Is it possible the new environment is not high-performance enough??

possibly relevant configuration change:
   checkpoint_timeout=1h

 


pg_basebackup will always send in a checkpoint. PostgreSQL will time that to take approximately checkpoint_timeout * checkpoint_completion_target time. So with checkpoint_timeout set to a very high value, it will by default target something like 30 minutes before it even gets started. The only reason it takes as *little* as it does is that your system is lightly loaded.

You can do a fast checkpoint with the parameter I suggested before, or you can tune your checkpoint_timeout to be something that gives you a more reasonable time.

--