Обсуждение: pg_start_backup('label',true) why do I need 2nd parameter?
Hi all,
A few days back, I faced a problem where I pg_start_backup('label') was hang in the server forever.
I stopped the process and then used pg_start_backup('label',true) and it worked.
Now I am trying to investigate why I need to use true as second parameter and read the doc
"There is an optional second parameter of type boolean. If true, it specifies executing
pg_start_backup
as quickly as possible. This forces an immediate checkpoint which will cause a spike in I/O operations, slowing any concurrently executing queries." I tried to regenerate the scenario where pg_start_backup('label') hanged and I failed.
Any idea, how can I regenerate that issue to investigate.
I am using Postgresql 9.2.
Thanks.
rummandba wrote > A few days back, I faced a problem where I *pg_start_backup('label') *was > hang in the server forever. so, the server is still hung? can you define what you mean by "hung" - what did and did not work? > I stopped the process and then used *pg_start_backup('label',true) *and > it > worked. Ok, so not forever :) How long then? Any guess you give will be more accurate than "forever". So when you told it to "start now" it did but when you told it "start whenever you are ready" is said "ok, I'll get back to ya." ... > This forces an > immediate checkpoint which will cause a spike in I/O operations, slowing > any concurrently executing queries." From the same documentation (section 24.3.3): > By default, pg_start_backup can take a long time to finish. This is > because it performs a checkpoint, and the I/O required for the checkpoint > will be spread out over a significant period of time, by default half your > inter-checkpoint interval (see the configuration parameter > checkpoint_completion_target). This is usually what you want, because it > minimizes the impact on query processing. If you want to start the backup > as soon as possible, use: [the true parameter] So assuming defaults you would be expected to wait about 2.5 minutes: 5 minutes default checkpoint_timeout times .5 default checkpoint_completion_target [note, I am confused by the wording above if there should be another .50 multiplier] That is significantly less than forever but it may be more time that you gave it before cancelling the process...and that assumes you are using defaults. Without knowing the setting for "checkpoint_timeout" (and completion target, but you are less likely to have changed that) it is impossible to know whether your finite wait was longer than could be expected. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/pg-start-backup-label-true-why-do-I-need-2nd-parameter-tp5777145p5777153.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Tue, 5 Nov 2013 15:30:19 -0800 AI Rumman <rummandba@gmail.com> wrote: > Hi all, > > A few days back, I faced a problem where I *pg_start_backup('label') *was > hang in the server forever. > I stopped the process and then used *pg_start_backup('label',true) *and it > worked. > > Now I am trying to investigate why I need to use true as second parameter > and read the doc > "There is an optional second parameter of type boolean. If true, it > specifies executing pg_start_backup as quickly as possible. This forces an > immediate checkpoint which will cause a spike in I/O operations, slowing > any concurrently executing queries." To add to what David said in his email ... essentially with the second parameter at false (which is the same as omitting it) you are telling the system to not interfere with other concurrently running processes and take as much time as necessary to prep the system without interfering. How long that takes is a factor of other settings (as David mentioned) and also depedent on what other transactions may be running. With the second parameter at true, you're telling Postgres to execute the start backup as fast as possible and if other queries are temporarily slow or unresponsive while it's getting there, that's OK. As far as exactly replicating the scenerio, you would need to know what else was running on the server at the time to exactly replicate it. > > I tried to regenerate the scenario where *pg_start_backup('label') > *hanged and > I failed. > Any idea, how can I regenerate that issue to investigate. > > I am using Postgresql 9.2. > > Thanks. -- Bill Moran <wmoran@potentialtech.com>
Bill Moran wrote > How long that takes is a factor of other settings (as David mentioned) and > also depedent on what other transactions may be running. While I am inclined to believe this is true the documentation is unclear that "other transactions" have any bearing on the delay. All the documentation says is that the checkpoint I/O will be spread out over time. Period. I could see where if there is no pending checkpoint I/O to perform that it will return immediately but does having 100MB of I/O to perform, versus 10MB of I/O to perform, cause the delay to increase 9-fold up to a maximum of whatever timeframe is configured? The wording implies that the delay, say 2.5 minutes by default (if I am reading that right), will be used regardless so the system will incur a rate of 4MB/min of checkpoint I/O in the better case and 40MB/min of checkpoint I/O for the worse case. The other possibility is that there is a floor of 10MB/min of checkpoint I/O so the first example only takes 1 minute to return (not 2.5) while the second uses the entire allotted time and also must increase the I/O rate. I'm not sure the precise algorithm needs to be documented but "can take a long time to finish" seems to be taking to the other extreme. Assuming one of the two examples above is correct including such an example in the documentation (i.e., comparing 0MB, 10MB, and 100MB of pending checkpoint I/O) is a thought. Also, is there a way to query how much checkpoint I/O currently is outstanding? If so, and the value is meaningful to this determination, a cross-reference to said information would be useful. Also, assuming the algorithm is fairly constant having it documented at this level, with maybe an example query, would allow people to calculate roughly the amount of time the "false" call will take to return. And yes, I am getting a little carried away here... David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/pg-start-backup-label-true-why-do-I-need-2nd-parameter-tp5777145p5777173.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Tue, 5 Nov 2013 19:27:52 -0800 (PST) David Johnston <polobo@yahoo.com> wrote: > Bill Moran wrote > > How long that takes is a factor of other settings (as David mentioned) and > > also depedent on what other transactions may be running. > > While I am inclined to believe this is true the documentation is unclear > that "other transactions" have any bearing on the delay. All the > documentation says is that the checkpoint I/O will be spread out over time. > Period. I could see where if there is no pending checkpoint I/O to perform > that it will return immediately but does having 100MB of I/O to perform, > versus 10MB of I/O to perform, cause the delay to increase 9-fold up to a > maximum of whatever timeframe is configured? > > The wording implies that the delay, say 2.5 minutes by default (if I am > reading that right), will be used regardless so the system will incur a rate > of 4MB/min of checkpoint I/O in the better case and 40MB/min of checkpoint > I/O for the worse case. > > The other possibility is that there is a floor of 10MB/min of checkpoint I/O > so the first example only takes 1 minute to return (not 2.5) while the > second uses the entire allotted time and also must increase the I/O rate. > > I'm not sure the precise algorithm needs to be documented but "can take a > long time to finish" seems to be taking to the other extreme. Assuming one > of the two examples above is correct including such an example in the > documentation (i.e., comparing 0MB, 10MB, and 100MB of pending checkpoint > I/O) is a thought. Also, is there a way to query how much checkpoint I/O > currently is outstanding? If so, and the value is meaningful to this > determination, a cross-reference to said information would be useful. > > Also, assuming the algorithm is fairly constant having it documented at this > level, with maybe an example query, would allow people to calculate roughly > the amount of time the "false" call will take to return. Obviously, opinions may differ, but ... I don't understand, in the slightest, your focus on this. In my experience, if you're running backups, you're willing to wait a little while. Especially if you're handling the extra administrative overhead of doing wal-logged backups, you're probably waiting hours for them to complete, so what's an extra minute or two? I have trouble believing that the second parameter is ever necessary at all. That being said, details of the algorithm alone is not going to tell anyone how long it's going to take. On a quiet server with fast disks, the wait is (in my experience) not noticable. Of course, at any time a transaction running could cause it to be very noticable, and a busy server could cause it to always take a while. But the question you post sounds frighteningly like questions I get from developers on a regular basis: why does my query run fast when I test it but slow in production. The answer is complicated and is actually different every time it's asked, but the high-level answer that nobody seems to accept is this: on a computer system that is doing many tasks, the different tasks impact each other, and it's frequently difficult to understand the nature of those interactions and their impact. I mean, who makes a backup and sits and watches it? I "make" dozens of backups each day, and I never bother to watch any of them -- I get an email if one of them fails and once a quarter we run through a recovery drill to make sure everything is working for really-reals. These are the kinds of things that (in my experience) are important and deserve my attention ... how long it takes for the backup process to windup is not, unless it's such an incredibly long time (on the order of hours) that it interferes with scheduling. I guess what I'm saying, is that from my standpoint I can't imagine a way of improving the documentation that wouldn't either become horribly wordy without actually helping, or even more confusing. Perhaps someone smarter than me can come up with something, though. -- Bill Moran <wmoran@potentialtech.com>