Обсуждение: [RFC] Should we fix postmaster to avoid slow shutdown?
Hello, Please let me ask you about possible causes of a certain problem, slow shutdown of postmaster when a backend crashes, andwhether to fix PostgreSQL. Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4. Yes, the PostgreSQL version is rather old but there's no relevantbug fix in later 9.2.x releases. PROBLEM ============================== One backend process (postgres) for an application session crashed due to a segmentation fault and dumped a core file. Thecause is a bug of pg_dbms_stats. Another note is that restart_after_crash is off to make failover happen. The problem here is that postmaster took as long as 15 seconds to terminate after it had detected a crashed backend. Themessages were output as follows: 20:12:35.004に LOG: server process (PID 31894) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: DELETE...(snip) LOG: terminating any other active server processes From 20:12:35.013 to 20:12:39.074, the following message was output 80 times. FATAL: the database system is in recovery mode 20:12:50 The custom monitoring system detected the death of postmaster as a result of running "pg_ctl status". That's it. You may say the following message should also have been emitted, but there's not. This is because we commentedout the ereport() call in quickdie() in tcop.c. That ereport() call can hang depending on the timing, which isfixed in 9.4. WARNING: terminating connection because of crash of another server process The customer insists that PostgreSQL takes longer to shut down than expected, which risks exceeding their allowed failovertime. CAUSE ============================== There's no apparent evidence to indicate the cause, but I could guess a few reasons. What do you think these are correctand should fix PostgreSQL? (I think so) 1) postmaster should close the listening ports earlier As cited above, for 4 seconds, postmaster created 80 dead-end child processes which just output "FATAL: the database systemis in recovery mode". This indicates that postmaster is busy handling re-connection requests from disconnected applications,preventing postmaster from reaping dead children as fast as possible. This is a waste because postmaster willonly shut down. I think the listening ports should be closed in HandleChildCrash() when the condition "(RecoveryError || !restart_after_crash)"is true. 2) make stats collector terminate immediately stats collector seems to write the permanent stats file even when it receives SIGQUIT. But it's useless because the statfile is reset during recovery. And Tom claimed that writing stats file can take long: https://www.postgresql.org/message-id/11800.1455135203@sss.pgh.pa.us 3) Anything else? While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports open but doesn't call select()/accept(). Asa result, incoming connection requests are accumulated in the listen queue of the sockets. Does the OS have any bug toslow the process termination when the listen queue is not empty? Regards Takayuki Tsunakawa
On Tue, Sep 20, 2016 at 2:20 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > There's no apparent evidence to indicate the cause, but I could guess a few reasons. What do you think these are correctand should fix PostgreSQL? (I think so) I think that we shouldn't start changing things based on guesses about what the problem is, even if they're fairly smart guesses. The thing to do would be to construct a test rig, crash the server repeatedly, and add debugging instrumentation to figure out where the time is actually going. I do think your theory about the stats collector might be worth pursuing. It seems that the stats collector only responds to SIGQUIT, ignoring SIGTERM. Making it do a clean shutdown on SIGTERM and a fast exit on SIGQUIT seems possibly worthwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas > On Tue, Sep 20, 2016 at 2:20 AM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > There's no apparent evidence to indicate the cause, but I could guess > > a few reasons. What do you think these are correct and should fix > > PostgreSQL? (I think so) > > I think that we shouldn't start changing things based on guesses about what > the problem is, even if they're fairly smart guesses. The thing to do would > be to construct a test rig, crash the server repeatedly, and add debugging > instrumentation to figure out where the time is actually going. We have tried to reproduce the problem in the past several days with much more stress on our environment than on the customer'sone -- 1,000 tables aiming for a dozens of times larger stats file and repeated reconnection requests from hundredsof clients -- but we could not succeed. > I do think your theory about the stats collector might be worth pursuing. > It seems that the stats collector only responds to SIGQUIT, ignoring SIGTERM. > Making it do a clean shutdown on SIGTERM and a fast exit on SIGQUIT seems > possibly worthwhile. Thank you for giving confidence for proceeding. And I also believe that postmaster should close the listening ports earlier.Regardless of whether this problem will be solved not confident these will solve the, I think it'd be better to fixthese two points so that postmaster doesn't longer time than necessary. I think I'll create a patch after giving it abit more thought. Regards Takayuki Tsunakawa
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
>> I think that we shouldn't start changing things based on guesses about what
>> the problem is, even if they're fairly smart guesses.  The thing to do would
>> be to construct a test rig, crash the server repeatedly, and add debugging
>> instrumentation to figure out where the time is actually going.
> We have tried to reproduce the problem in the past several days with much more stress on our environment than on the
customer'sone -- 1,000 tables aiming for a dozens of times larger stats file and repeated reconnection requests from
hundredsof clients -- but we could not succeed.
 
>> I do think your theory about the stats collector might be worth pursuing.
>> It seems that the stats collector only responds to SIGQUIT, ignoring SIGTERM.
>> Making it do a clean shutdown on SIGTERM and a fast exit on SIGQUIT seems
>> possibly worthwhile.
> Thank you for giving confidence for proceeding.  And I also believe that postmaster should close the listening ports
earlier.Regardless of whether this problem will be solved not confident these will solve the, I think it'd be better to
fixthese two points so that postmaster doesn't longer time than necessary.  I think I'll create a patch after giving it
abit more thought.
 
FWIW, I'm pretty much -1 on messing with the timing of the socket close
actions.  I broke that once within recent memory, so maybe I'm gun-shy,
but I think that the odds of unpleasant side effects greatly outweigh
any likely benefit there.
Allowing SIGQUIT to prompt fast shutdown of the stats collector seems
sane, though.  Try to make sure it doesn't leave partly-written stats
files behind.
        regards, tom lane
			
		> From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Allowing SIGQUIT to prompt fast shutdown of the stats collector seems sane, > though. Try to make sure it doesn't leave partly-written stats files > behind. The attached patch based on HEAD does this. I'd like this to be back-patched because one of our important customers uses9.2. I didn't remove partially written stat files on SIGQUIT for the following reasons. Is this OK? 1. The recovery at the next restart will remove the stat files. 2. SIGQUIT processing should be as fast as possible. 3. If writing stats files took long due to the OS page cache flushing, removing files might be forced to wait likewise. > FWIW, I'm pretty much -1 on messing with the timing of the socket close > actions. I broke that once within recent memory, so maybe I'm gun-shy, > but I think that the odds of unpleasant side effects greatly outweigh any > likely benefit there. Wasn't it related to TouchSocketFiles()? Can I see the discussion on this ML? I don't see any problem looking at the code... Regards Takayuki Tsunakawa
Вложения
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane FWIW, I'm pretty much -1 on messing with the timing of the socket close > actions. I broke that once within recent memory, so maybe I'm gun-shy, > but I think that the odds of unpleasant side effects greatly outweigh any > likely benefit there. I couldn't find any relevant mails in pgsql-hackers. I found no problem with the attached patch. Do you think this is OK? Regards Takayuki Tsunakawa
Вложения
On Tue, Oct 4, 2016 at 5:05 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > FWIW, I'm pretty much -1 on messing with the timing of the socket close >> actions. I broke that once within recent memory, so maybe I'm gun-shy, >> but I think that the odds of unpleasant side effects greatly outweigh any >> likely benefit there. > > I couldn't find any relevant mails in pgsql-hackers. I found no problem with the attached patch. Do you think this isOK? I have no opinion on this patch, because I haven't reviewed it, but note recent commit 3b90e38c5d592ea8ec8236287dd5c749fc041728, which appears to be semi-related. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com] > I have no opinion on this patch, because I haven't reviewed it, but note > recent commit 3b90e38c5d592ea8ec8236287dd5c749fc041728, which appears to > be semi-related. Thank you for interesting information. Maybe Tom-san experienced some trouble in creating this patch. Fortunately, thisdoesn't appear to be related to my patch, because the patch changed the timing of closing listen ports in postmasterchildren, whereas my patch explicitly closes listen ports in postmaster. Regards Takayuki Tsunakawa
On Wed, Sep 28, 2016 at 8:37 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: >> From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane >> Allowing SIGQUIT to prompt fast shutdown of the stats collector seems sane, >> though. Try to make sure it doesn't leave partly-written stats files >> behind. > > The attached patch based on HEAD does this. I'd like this to be back-patched because one of our important customers uses9.2. > > I didn't remove partially written stat files on SIGQUIT for the following reasons. Is this OK? > > 1. The recovery at the next restart will remove the stat files. > 2. SIGQUIT processing should be as fast as possible. > 3. If writing stats files took long due to the OS page cache flushing, removing files might be forced to wait likewise. > I agree with the first point. The patch applies and compiles clean. make check-world is clean. In pgstat_quickdie(), I think a call to sigaddset(&BlockSig, SIGQUIT) is missing before PG_SETMASK(). Although there are some SIGQUIT handlers which do not have that call. But I guess, it will be safer to have it. Also, many other SIGQUIT handlers like bgworker_quickdie() call on_exit_reset() followed by exit(2) instead of just exit(1) in pgstat_quickdie(). Why is this difference? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Oct 26, 2016 at 7:12 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Also, many other SIGQUIT handlers like bgworker_quickdie() call on_exit_reset() > followed by exit(2) instead of just exit(1) in pgstat_quickdie(). Why is this > difference? Well, for that, you'd need to look at how postmaster.c treats those exit codes. exit(2) from a regular backend or background worker will cause a crash-and-restart cycle; I'm not sure whether the handling for the stats collector is similar or different. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 26, 2016 at 7:12 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Also, many other SIGQUIT handlers like bgworker_quickdie() call on_exit_reset()
>> followed by exit(2) instead of just exit(1) in pgstat_quickdie(). Why is this
>> difference?
> Well, for that, you'd need to look at how postmaster.c treats those
> exit codes.  exit(2) from a regular backend or background worker will
> cause a crash-and-restart cycle; I'm not sure whether the handling for
> the stats collector is similar or different.
I'm fairly sure it's different --- there are different policies for
child processes that aren't connected to shared memory (such as the
stats collector) because we shouldn't force a system-wide restart
when they crash.
        regards, tom lane
			
		From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat > In pgstat_quickdie(), I think a call to sigaddset(&BlockSig, SIGQUIT) is > missing before PG_SETMASK(). Although there are some SIGQUIT handlers which > do not have that call. But I guess, it will be safer to have it. I didn't add it because pgstat_quickdie() just exits, like some other postmaster children. I thought those processes whichare concerned about their termination processing call sigaddset(SIGQUIT), so I went after the processes who aren't. Is this really necessary? > Also, many other SIGQUIT handlers like bgworker_quickdie() call > on_exit_reset() followed by exit(2) instead of just exit(1) in > pgstat_quickdie(). Why is this difference? As Robert and Tom said, either exit(1) or exit(2) is OK because reaper() handles non-zero exit code the same. Regardingon_proc_reset(), stats collector is not attached to the shared memory and does not register on_proc_exit() callbacks. These situations are the same as the archiver process, so I followed it. Regards Takayuki Tsunakawa
On Thu, Oct 27, 2016 at 7:29 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
>> In pgstat_quickdie(), I think a call to sigaddset(&BlockSig, SIGQUIT) is
>> missing before PG_SETMASK(). Although there are some SIGQUIT handlers which
>> do not have that call. But I guess, it will be safer to have it.
>
> I didn't add it because pgstat_quickdie() just exits, like some other postmaster children.  I thought those processes
whichare concerned about their termination processing call sigaddset(SIGQUIT), so I went after the processes who
aren't. Is this really necessary? 
>
Ok. In that case, I think we shouldn't even call PG_SETMASK() similar
to pgarch_exit(). Attached patch removes PG_SETMASK(). Let me know if
it looks good.
>> Also, many other SIGQUIT handlers like bgworker_quickdie() call
>> on_exit_reset() followed by exit(2) instead of just exit(1) in
>> pgstat_quickdie(). Why is this difference?
>
> As Robert and Tom said, either exit(1) or exit(2) is OK because reaper() handles non-zero exit code the same.
Yes, per reaper().
2955         /*
2956          * Was it the statistics collector?  If so, just try to start a new
2957          * one; no need to force reset of the rest of the system.
(If fail,
2958          * we'll try again in future cycles of the main loop.)
2959          */
2960         if (pid == PgStatPID)
2961         {
2962             PgStatPID = 0;
2963             if (!EXIT_STATUS_0(exitstatus))
2964                 LogChildExit(LOG, _("statistics collector process"),
2965                              pid, exitstatus);
2966             if (pmState == PM_RUN)
2967                 PgStatPID = pgstat_start();
2968             continue;
2969         }
> Regarding on_proc_reset(), stats collector is not attached to the shared memory and does not register on_proc_exit()
callbacks. These situations are the same as the archiver process, so I followed it. 
>
Right. I got confused because of on_shmem_exit() call in
pgstat_initialize. But that's per backend code and not executed within
stats collector.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
			
		Вложения
From: Ashutosh Bapat [mailto:ashutosh.bapat@enterprisedb.com] > Ok. In that case, I think we shouldn't even call PG_SETMASK() similar to > pgarch_exit(). Attached patch removes PG_SETMASK(). Let me know if it looks > good. It looks good. Thanks. Regards Takayuki Tsunakawa
On Tue, Oct 4, 2016 at 2:35 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > FWIW, I'm pretty much -1 on messing with the timing of the socket close >> actions. I broke that once within recent memory, so maybe I'm gun-shy, >> but I think that the odds of unpleasant side effects greatly outweigh any >> likely benefit there. > > I couldn't find any relevant mails in pgsql-hackers. I found no problem with the attached patch. Do you think this isOK? Few comments on this patch, I am not sure if following condition is a good idea in ServerLoop() 1650 if (pmState == PM_WAIT_DEAD_END || ClosedSockets) There are no sockets to listen on, so select in the else condition is going to sleep for timeout determined based on the sequence expected. Just before we close sockets in pmdie() it sets AbortStartTime, which determines the timeout for the sleep here. So, it doesn't make sense to ignore it. Instead may be we should change the default 60s sleep to 100ms sleep in DetermineSleepTime(). If not, we need to at least update the comments to indicate the other reason for not polling using select(). * If we are in PM_WAIT_DEAD_END state, then we don't want to accept * any new connections, so we don't call select(), and just sleep. */ memcpy((char *) &rmask, (char *) &readmask,sizeof(fd_set)); - if (pmState == PM_WAIT_DEAD_END) + if (pmState == PM_WAIT_DEAD_END || ClosedSockets) While the postmaster is terminating children, a new connection request may arrive. We should probably close listening sockets before terminating children in pmdie(). Otherwise this patch looks good to me. It applies and compiles cleanly. make check-world doesn't show any failures. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat > I am not sure if following condition is a good idea in ServerLoop() > 1650 if (pmState == PM_WAIT_DEAD_END || ClosedSockets) > > There are no sockets to listen on, so select in the else condition is going > to sleep for timeout determined based on the sequence expected. > Just before we close sockets in pmdie() it sets AbortStartTime, which > determines the timeout for the sleep here. So, it doesn't make sense to > ignore it. Instead may be we should change the default 60s sleep to 100ms > sleep in DetermineSleepTime(). That sounds better. I modified cleaned ServerLoop() by pushing the existing 100ms logic into DetermineSleepTime(). > While the postmaster is terminating children, a new connection request may > arrive. We should probably close listening sockets before terminating > children in pmdie(). I moved ClosePostmasterSocket() call before terminating children in the immediate shutdown case. I didn't change the behaviorof smart and fast shutdown modes, because they may take a long time to complete due to checkpointing etc. Userswill want to know what's happening during shutdown or after pg_ctl stop times out, by getting the message "FATAL: thedatabase system is shutting down" when they try to connect to the database. The immediate shutdown or crash should betterbe as fast as possible. > Otherwise this patch looks good to me. It applies and compiles cleanly. > make check-world doesn't show any failures. Thank you for reviewing and testing. The revised patch is attached. Regards Takayuki Tsunakawa
Вложения
On Mon, Nov 7, 2016 at 10:14 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat >> I am not sure if following condition is a good idea in ServerLoop() >> 1650 if (pmState == PM_WAIT_DEAD_END || ClosedSockets) >> >> There are no sockets to listen on, so select in the else condition is going >> to sleep for timeout determined based on the sequence expected. >> Just before we close sockets in pmdie() it sets AbortStartTime, which >> determines the timeout for the sleep here. So, it doesn't make sense to >> ignore it. Instead may be we should change the default 60s sleep to 100ms >> sleep in DetermineSleepTime(). > > That sounds better. I modified cleaned ServerLoop() by pushing the existing 100ms logic into DetermineSleepTime(). I have changed some comments around this block. See attached patch. Let me know if that looks good. > > >> While the postmaster is terminating children, a new connection request may >> arrive. We should probably close listening sockets before terminating >> children in pmdie(). > > I moved ClosePostmasterSocket() call before terminating children in the immediate shutdown case. I didn't change the behaviorof smart and fast shutdown modes, because they may take a long time to complete due to checkpointing etc. Userswill want to know what's happening during shutdown or after pg_ctl stop times out, by getting the message "FATAL: thedatabase system is shutting down" when they try to connect to the database. The immediate shutdown or crash should betterbe as fast as possible. OK. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Вложения
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat > I have changed some comments around this block. See attached patch. > Let me know if that looks good. Thanks, it looks good. Regards Takayuki Tsunakawa
On Mon, Nov 14, 2016 at 5:16 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat >> I have changed some comments around this block. See attached patch. >> Let me know if that looks good. > > Thanks, it looks good. > Thanks. The patch 02_close_listen... closes the listen sockets explicitly when it's known that postmaster is going to stop all the children and then die. I have tried to see, if there's a possibility that it closes the listen sockets but do not actually die, thus causing a server which doesn't accept any connections and doesn't die. But I have not found that possibility. I do not have any further comments about both the patches. None else has volunteered to review the patch till now. So, marking the entry as ready for committer.
On 9/27/16 11:07 PM, Tsunakawa, Takayuki wrote: >> From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane >> Allowing SIGQUIT to prompt fast shutdown of the stats collector seems sane, >> though. Try to make sure it doesn't leave partly-written stats files >> behind. > > The attached patch based on HEAD does this. I'd like this to be back-patched because one of our important customers uses9.2. > > I didn't remove partially written stat files on SIGQUIT for the following reasons. Is this OK? > > 1. The recovery at the next restart will remove the stat files. > 2. SIGQUIT processing should be as fast as possible. > 3. If writing stats files took long due to the OS page cache flushing, removing files might be forced to wait likewise. ISTM that this would change the "immediate shutdown" to not save stats files anymore. So far, all the shutdown modes are equivalent in terms of how they preserve data and system state. They differ only in when the hard work happens. This would be a deviation from that principle. Child processes don't distinguish between a SIGQUIT coming from a user-initiated immediate shutdown request and a crash-induced kill-everyone directive. So there might not be a non-ugly way to address that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> ISTM that this would change the "immediate shutdown" to not save stats
> files anymore.  So far, all the shutdown modes are equivalent in terms
> of how they preserve data and system state.  They differ only in when
> the hard work happens.  This would be a deviation from that principle.
There is that.  Up to now, an immediate shutdown request didn't cause
any actual data loss, but now it would.  Maybe that's a reason to reject
this whole concept.  (Upthread I was thinking that's a behavior we have
anyway, but really we don't: a look through pgstats.c shows that it will
never exit without attempting to write the stats, short of a crash of
the stats collector process itself.)
> Child processes don't distinguish between a SIGQUIT coming from a
> user-initiated immediate shutdown request and a crash-induced
> kill-everyone directive.  So there might not be a non-ugly way to
> address that.
It doesn't seem to me that it's a matter of whether the signaling is
adequate; we could fix that.  It's a matter of whether you're willing to
have "pg_ctl stop -m immediate" lose stats data.
Although the stats collector was originally conceived as optional,
we depend on it enough now for autovacuum that I'm not sure it'd be a
good thing to have a popularly-used shutdown mode that loses the stats.
        regards, tom lane
			
		On 11/14/16 4:38 AM, Ashutosh Bapat wrote: > The patch 02_close_listen... closes the listen sockets > explicitly when it's known that postmaster is going to stop all the > children and then die. I have tried to see, if there's a possibility > that it closes the listen sockets but do not actually die, thus > causing a server which doesn't accept any connections and doesn't die. > But I have not found that possibility. I can see the point of this, but I'm not sure whether this is always a good idea. Some people "monitor" postgres shutdown with PQping() or by parsing the error messages that the client gets. If we just close the sockets as soon as possible, we lose all communication and can't tell what's going on anymore. If your HA system insists that the crashing server be completely shut down before moving on, then maybe that can be refined somehow instead? I haven't seen any analysis in this thread of how much of a problem this really is. For example, could we keep the socket open longer but reject connection attempts faster in this state? This patch only changes the behavior in the case of a crash or an immediate shutdown, but not for the other shutdown modes. I don't think it is good to create different behaviors for different modes. A long time ago, ClosePostmasterPorts() essentially had the job to close all postmaster sockets. Every single call thereof is in fact commented with /* Close the postmaster's sockets */. (Ancient postgres code uses "port" as a more general term for socket or communication port.) Now it has accumulated other tasks so it is more of a ClosePostmasterOnlyResourcesInChild(). So if we were to introduce a ClosePostmasterSockets() somehow, we should adjust the naming and the comments so that we don't have two similar-sounding functions with confusing comments. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> ISTM that this would change the "immediate shutdown" to not save stats
>> files anymore.  So far, all the shutdown modes are equivalent in terms
>> of how they preserve data and system state.  They differ only in when
>> the hard work happens.  This would be a deviation from that principle.
> There is that.  Up to now, an immediate shutdown request didn't cause
> any actual data loss, but now it would.
Oh, no, after rereading the thread I remember the point here: during
crash recovery, we intentionally zap the stats files on the grounds
that they might be broken, or might be out-of-sync with the recovery
stop point.  See pgstat_reset_all().
Now, you could certainly argue that that's an idea we should rethink;
we're throwing away probably-useful data because it *might* be wrong,
which seems rather pointless when it's known to be approximate anyway.
But if that stays, there's no very good reason not to short-circuit
writing of the files during immediate shutdown.
On the other side of the coin: I note that in 9.4 and up, the postmaster
has a 5-second patience limit before SIGKILL'ing child processes during a
crash or immediate shutdown (cf commit 82233ce7e).  That limit applies to
the stats collector too, and it seems to me that it addresses the real
problem here (ie, unbounded shutdown time) much better than the proposed
patch.  Note that the complaint that started this thread was against 9.2.
My feeling is that 82233ce7e has obsoleted all of the proposals made so
far in this thread, and that we should reject them all.  Instead we should
think about whether pgstat_reset_all should go away.
        regards, tom lane
			
		Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 11/14/16 4:38 AM, Ashutosh Bapat wrote:
>> The patch 02_close_listen... closes the listen sockets
>> explicitly when it's known that postmaster is going to stop all the
>> children and then die. I have tried to see, if there's a possibility
>> that it closes the listen sockets but do not actually die, thus
>> causing a server which doesn't accept any connections and doesn't die.
>> But I have not found that possibility.
> I can see the point of this, but I'm not sure whether this is always a
> good idea.
IMO it's not, and closer analysis says that this patch series is an
attempt to solve something we already fixed, better, in 9.4.
        regards, tom lane
			
		On 11/18/16 12:00 PM, Tom Lane wrote: > My feeling is that 82233ce7e has obsoleted all of the proposals made so > far in this thread, and that we should reject them all. Yes, it seems that very similar concerns were already addressed there. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > > On 11/14/16 4:38 AM, Ashutosh Bapat wrote: > >> The patch 02_close_listen... closes the listen sockets > >> explicitly when it's known that postmaster is going to stop all the > >> children and then die. I have tried to see, if there's a possibility > >> that it closes the listen sockets but do not actually die, thus > >> causing a server which doesn't accept any connections and doesn't die. > >> But I have not found that possibility. > > > I can see the point of this, but I'm not sure whether this is always a > > good idea. > > IMO it's not, and closer analysis says that this patch series is an > attempt to solve something we already fixed, better, in 9.4. ... by the same patch submitter. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> IMO it's not, and closer analysis says that this patch series is an
>> attempt to solve something we already fixed, better, in 9.4.
> ... by the same patch submitter.
[ confused ]  The commit log credits 82233ce7e to MauMau and yourself.
        regards, tom lane
			
		On Fri, Nov 18, 2016 at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Tom Lane wrote: >>> IMO it's not, and closer analysis says that this patch series is an >>> attempt to solve something we already fixed, better, in 9.4. > >> ... by the same patch submitter. > > [ confused ] The commit log credits 82233ce7e to MauMau and yourself. IIUC, MauMau = Tsunakawa Takayuki. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Fri, Nov 18, 2016 at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > >> Tom Lane wrote: > >>> IMO it's not, and closer analysis says that this patch series is an > >>> attempt to solve something we already fixed, better, in 9.4. > > > >> ... by the same patch submitter. > > > > [ confused ] The commit log credits 82233ce7e to MauMau and yourself. > > IIUC, MauMau = Tsunakawa Takayuki. Right, https://www.postgresql.org/message-id/964413269E3A41409EA53EE0D813E48C@tunaPC -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Robert Haas wrote: >> IIUC, MauMau = Tsunakawa Takayuki. > Right, https://www.postgresql.org/message-id/964413269E3A41409EA53EE0D813E48C@tunaPC Ah! I'd forgotten. regards, tom lane
From: Robert Haas [mailto:robertmhaas@gmail.com] > On Fri, Nov 18, 2016 at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > >> Tom Lane wrote: > >>> IMO it's not, and closer analysis says that this patch series is an > >>> attempt to solve something we already fixed, better, in 9.4. > > > >> ... by the same patch submitter. > > > > [ confused ] The commit log credits 82233ce7e to MauMau and yourself. > > IIUC, MauMau = Tsunakawa Takayuki. Yes, it's me. I'm pleased that you remember me! First, I understand that zapping the stats file during recovery can be a problem. In fact, it's me who proposed adding asentence in the manual that the stats file is reset after immediate shutdown. I think addressing this problem is anothertopic in a new thread. The reasons why I proposed this patch are: * It happened in a highly mission-critical production system of a customer who uses 9.2. * 9.4's solution is not perfect, because it wastes 5 seconds anyway, which is unexpected for users. The customer's requirementincludes failover within 30 seconds, so 5 seconds can be seen as a risk. Plus, I'm worried about the possibility that the SIGKILLed process wouldn't disappear if it's writing to a network storagelike NFS. * And first of all, the immediate shutdown should shut the server down immediately without doing anything heavy, as the namemeans. So, I think this patch should also be applied to later releases. The purpose of the patch in 9.4 was to avoid PostgreSQL'sbug, where the ereport() in quickdie() gets stuck waiting for malloc()'s lock to be released. Regards Takayuki Tsunakawa
On Sun, Nov 20, 2016 at 10:20 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > The reasons why I proposed this patch are: > > * It happened in a highly mission-critical production system of a customer who uses 9.2. > > * 9.4's solution is not perfect, because it wastes 5 seconds anyway, which is unexpected for users. The customer's requirementincludes failover within 30 seconds, so 5 seconds can be seen as a risk. > Plus, I'm worried about the possibility that the SIGKILLed process wouldn't disappear if it's writing to a network storagelike NFS. > > * And first of all, the immediate shutdown should shut the server down immediately without doing anything heavy, as thename means. So there are two questions here: 1. Should we try to avoid having the stats collector write a stats file during an immediate shutdown? The file will be removed anyway during crash recovery, so writing it is pointless. I think you are right that 9.4's solution here is not perfect, because of the 5 second delay, and also because if the stats collector is stuck inside the kernel trying to write to the OS, it may be in a non-interruptible wait state where even SIGKILL has no immediate effect. Anyway, it's stupid even from a performance point of view to waste time writing a file that we're just going to nuke. 2. Should we close listen sockets sooner during an immediate shutdown?I agree with Tom and Peter that this isn't a good idea. People expect the sockets not to go away until the end - e.g. they use PQping() to test the server status, or they connect just to see what error they get - and the fact that a client application could hypothetically generate such a relentless stream of connection attempts that the dead-end backends thereby created slow down shutdown is not in my mind a sufficient reason to change the behavior. So I think 001 should proceed and 002 should be rejected. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> 1. Should we try to avoid having the stats collector write a stats
> file during an immediate shutdown?  The file will be removed anyway
> during crash recovery, so writing it is pointless.
The point I was trying to make is that I think the forced-removal behavior
is not desirable, and therefore committing a patch that makes it be graven
in stone is not desirable either.
The larger picture here is that Takayuki-san wants us to commit a patch
based on a customer's objection to 9.2's behavior, without any real
evidence that the 9.4 change isn't a sufficient solution.  I've got
absolutely zero sympathy for that "the stats collector might be stuck in
an unkillable state" argument --- where's the evidence that the stats
collector is any more prone to that than any other postmaster child?
And for that matter, if we are stuck because of a nonresponding NFS
server, how is a quicker postmaster exit going to help anything?
You're not going to be able to start a new postmaster if the data
directory is on a nonresponsive server.
I'd be willing to entertain a proposal to make the 5-second limit
adjustable, but I don't think we need entirely new behavior here.
        regards, tom lane
			
		From: Robert Haas [mailto:robertmhaas@gmail.com] > So there are two questions here: > > 1. Should we try to avoid having the stats collector write a stats file > during an immediate shutdown? The file will be removed anyway during crash > recovery, so writing it is pointless. I think you are right that 9.4's > solution here is not perfect, because of the 5 second delay, and also because > if the stats collector is stuck inside the kernel trying to write to the > OS, it may be in a non-interruptible wait state where even SIGKILL has no > immediate effect. Anyway, it's stupid even from a performance point of > view to waste time writing a file that we're just going to nuke. > > 2. Should we close listen sockets sooner during an immediate shutdown? > I agree with Tom and Peter that this isn't a good idea. People expect > the sockets not to go away until the end - e.g. they use > PQping() to test the server status, or they connect just to see what error > they get - and the fact that a client application could hypothetically > generate such a relentless stream of connection attempts that the dead-end > backends thereby created slow down shutdown is not in my mind a sufficient > reason to change the behavior. > > So I think 001 should proceed and 002 should be rejected. I'm happy with this conclusion, since I think 1 was the cause of slow shutdown, and 2 is just a hypothesis to pursue thecompleteness. And I can understand the concern about PQping(). Regards Takayuki Tsunakawa
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > The point I was trying to make is that I think the forced-removal behavior > is not desirable, and therefore committing a patch that makes it be graven > in stone is not desirable either. I totally agree that we should pursue the direction for escaping from the complete loss of stats files. Personally, I wouldlike to combine that with the idea of persistent performance diagnosis information for long-term analysis (IIRC, someoneproposed it.) However, I don't think my patch will make everyone forget about the problem of stats file loss duringrecovery. The problem exists with or without my patch, and my patch doesn't have the power to delute the importanceof the problem. If you are worried about memory, we can add an entry for the problem in TODO list that Bruce-sanis maintaining. Or, maybe we can just stop removing the stats files during recovery by keeping the files of previous generation and usingit as the current one. I haven't seen how fresh the previous generation is (500ms ago?). A bit older might be betterthan nothing. > The larger picture here is that Takayuki-san wants us to commit a patch > based on a customer's objection to 9.2's behavior, without any real evidence > that the 9.4 change isn't a sufficient solution. I've got absolutely zero > sympathy for that "the stats collector might be stuck in an unkillable state" > argument --- where's the evidence that the stats collector is any more prone > to that than any other postmaster child? 9.4 change may be sufficient. But I don't think I can proudly explain the logic to a really severe customer. I can't answerthe question "Why does PostgreSQL write files that will be deleted, even during 'immediate' shutdown? Why does PostgreSQLuse 5 seconds for nothing?" Other children do nothing and exit immediately. I believe they are behaving correctly. > And for that matter, if we are stuck because of a nonresponding NFS server, > how is a quicker postmaster exit going to help anything? > You're not going to be able to start a new postmaster if the data directory > is on a nonresponsive server. NFS server can also be configured for HA, and the new postmaster can start as soon as the NFS server completes failover. > I'd be willing to entertain a proposal to make the 5-second limit adjustable, > but I don't think we need entirely new behavior here. Then, I'm at a loss what to do for the 9.2 user. Regards Takayuki Tsunakawa
On Tue, Nov 22, 2016 at 12:26 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> The point I was trying to make is that I think the forced-removal behavior >> is not desirable, and therefore committing a patch that makes it be graven >> in stone is not desirable either. > > I totally agree that we should pursue the direction for escaping from the complete loss of stats files. Personally, Iwould like to combine that with the idea of persistent performance diagnosis information for long-term analysis (IIRC, someoneproposed it.) However, I don't think my patch will make everyone forget about the problem of stats file loss duringrecovery. The problem exists with or without my patch, and my patch doesn't have the power to delute the importanceof the problem. If you are worried about memory, we can add an entry for the problem in TODO list that Bruce-sanis maintaining. I don't think Tom is worried about forgetting this. I think he is worried that if we do the changes as suggested in 01... patch, we will make it more difficult to change stats file behaviour later. Right now we are throwing away statistics files at the time of crash recovery. In case we want to change it tomorrow, we will have to fix the existing problem with the half-written/stale stats files and then fix not to zap those files at the time of crash recovery. In case we go ahead with this patch, we will have more instances of creating half-written/stale stats file, which will need to fixed. > > 9.4 change may be sufficient. But I don't think I can proudly explain the logic to a really severe customer. I can'tanswer the question "Why does PostgreSQL write files that will be deleted, even during 'immediate' shutdown? Why doesPostgreSQL use 5 seconds for nothing?" > > Other children do nothing and exit immediately. I believe they are behaving correctly. I agree, with this though. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> The point I was trying to make is that I think the forced-removal behavior
>> is not desirable, and therefore committing a patch that makes it be graven
>> in stone is not desirable either.
> I totally agree that we should pursue the direction for escaping from the complete loss of stats files.  Personally,
Iwould like to combine that with the idea of persistent performance diagnosis information for long-term analysis (IIRC,
someoneproposed it.)  However, I don't think my patch will make everyone forget about the problem of stats file loss
duringrecovery.  The problem exists with or without my patch, and my patch doesn't have the power to delute the
importanceof the problem.  If you are worried about memory, we can add an entry for the problem in TODO list that
Bruce-sanis maintaining. 
> Or, maybe we can just stop removing the stats files during recovery by keeping the files of previous generation and
usingit as the current one.  I haven't seen how fresh the previous generation is (500ms ago?).  A bit older might be
betterthan nothing. 
Freshness isn't the issue.  The stats file isn't there at all, in the
permanent stats directory, unless the collector takes the time to write
it before exiting.  Without that, we have unrecoverable loss of the stats
data.  Now, that isn't as bad as loss of the SQL data content, but it's
not good either.
It's already the case that the pgstats code writes the stats data under a
temporary file name and then renames it into place atomically.  So the
prospects for corrupt data are not large, and I do not think that the
existing removal behavior was intended to prevent that.  Rather, the
concern was that if you do a point-in-time recovery to someplace much
earlier on the WAL timeline, the stats file will be out of sync with
what's now in your database.  That's a valid point, but deleting the
stats file during *any* recovery seems like an overreaction.
The simplest solution I can think of is to delete the stats file when
doing a PITR operation, but not during simple crash recovery.  I've
not looked to see how hard it would be to do that, but it seems like
it should be a fairly minor logic tweak.  Maybe decide to do the removal
at the point where we intentionally stop following WAL someplace earlier
than its end.
Another angle we might take, independently of that, is to delete the
stats file if the stats collector process itself crashes.  This would
provide a recovery avenue if somehow we did have a stats file that
was corrupt enough to crash the collector.  And it would not matter
for post-startup crashes of the stats collector, because the file
would not be there anyway.
        regards, tom lane
			
		On Tue, Nov 22, 2016 at 10:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's already the case that the pgstats code writes the stats data under a > temporary file name and then renames it into place atomically. So the > prospects for corrupt data are not large, and I do not think that the > existing removal behavior was intended to prevent that. Rather, the > concern was that if you do a point-in-time recovery to someplace much > earlier on the WAL timeline, the stats file will be out of sync with > what's now in your database. That's a valid point, but deleting the > stats file during *any* recovery seems like an overreaction. But that's not what is at issue here. The issue is whether, when asked to exit immediately, all processes should exit immediately, or whether it would be better for all processes except one to exit immediately and the last one exit non-immediately. In other words, when the user asks for an immediate shutdown, do they really mean it, or are they OK taking time to do some other stuff first? I think that the charter of an immediate shutdown is to perform a simulated crash. We do our best to let clients know that we are going down and then we go down. We don't checkpoint, we don't do any other operation that might take a lengthy period of time, we just SHUT DOWN. You're arguing that preserving the stats file is more important than timely shutdown, but I don't buy it. We've never tried to do that in the past and we've got no evidence that it's causing a problem for anyone. Yeah, it's not good, but neither are the things that prompt people to perform an immediate shutdown in the first place. On the other hand, there's plenty of evidence that slow shutdowns are a big problem for users. People use fast shutdown because smart shutdown was too slow (never-ending), and we changed the default for that reason. I've repeatedly seen users who resorted to an immediate shut down because a fast shutdown wasn't fast enough. The patch that you claim moots this one was inspired by immediate shutdowns taking too long, and that patch enjoyed pretty broad support IIRC. I think it's fairly perverse to receive yet another complaint about shutdown being slow and, instead of fixing it, go try to make sure that slowness is baked in as a policy decision. Immediate, adj. occurring or done at once; instant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> But that's not what is at issue here.  The issue is whether, when
> asked to exit immediately, all processes should exit immediately, or
> whether it would be better for all processes except one to exit
> immediately and the last one exit non-immediately.  In other words,
> when the user asks for an immediate shutdown, do they really mean it,
> or are they OK taking time to do some other stuff first?
Peter already gave the response to that, which is that users do not
expect an immediate shutdown to have permanent harmful effects.
It doesn't have such effects so far as the SQL data goes; why is it
okay to blow away statistical data?
> You're arguing that preserving the stats file is more important than
> timely shutdown, but I don't buy it.
Yes, I am, and I disagree with you.  The current decision on this point
was made ages ago, before autovacuum even existed let alone relied on
the stats for proper functioning.  The tradeoff you're saying you're
okay with is "we'll shut down a few seconds faster, but you're going
to have table bloat problems later because autovacuum won't know it
needs to do anything".  I wonder how many of the complaints we get
about table bloat are a consequence of people not realizing that
"pg_ctl stop -m immediate" is going to cost them.
> ...  Yeah, it's not good, but neither are the things that prompt
> people to perform an immediate shutdown in the first place.
Really?  I think many users think an immediate shutdown is just fine.
> The patch that
> you claim moots this one was inspired by immediate shutdowns taking
> too long, and that patch enjoyed pretty broad support IIRC.
I think that's historical revisionism.  The commit message for 82233ce7e
says "(This might happen, for example, if a backend gets tangled trying
to malloc() due to gettext(), as in an example illustrated by MauMau.)".
There is absolutely nothing in it about people being dissatisfied with
the shutdown timing in normal cases; rather, it was done to prevent
cases where shutdown failed to happen at all.
        regards, tom lane
			
		On Tue, Nov 22, 2016 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> But that's not what is at issue here. The issue is whether, when >> asked to exit immediately, all processes should exit immediately, or >> whether it would be better for all processes except one to exit >> immediately and the last one exit non-immediately. In other words, >> when the user asks for an immediate shutdown, do they really mean it, >> or are they OK taking time to do some other stuff first? > > Peter already gave the response to that, which is that users do not > expect an immediate shutdown to have permanent harmful effects. > It doesn't have such effects so far as the SQL data goes; why is it > okay to blow away statistical data? You're doing that anyway. The backends aren't going to send any accumulated but unsent statistics to the stats collector before exiting; they're just going to exit. >> You're arguing that preserving the stats file is more important than >> timely shutdown, but I don't buy it. > > Yes, I am, and I disagree with you. The current decision on this point > was made ages ago, before autovacuum even existed let alone relied on > the stats for proper functioning. The tradeoff you're saying you're > okay with is "we'll shut down a few seconds faster, but you're going > to have table bloat problems later because autovacuum won't know it > needs to do anything". I wonder how many of the complaints we get > about table bloat are a consequence of people not realizing that > "pg_ctl stop -m immediate" is going to cost them. That would be useful information to have, but I bet the answer is "not that many". Most people don't shut down their database very often; they're looking for continuous uptime. It looks to me like autovacuum activity causes the statistics files to get refreshed at least once per autovacuum_naptime, which defaults to once a minute, so on the average we're talking about the loss of perhaps 30 seconds worth of statistics. What percentage of database activity occurs within 30 second of an immediate shutdown? For a hypothetical installation where the DBA does an immediate shutdown at four randomly-chosen times each day, the answer is "less than 0.2%". In real installations, the time between immediate shutdown is likely to be weeks or months, and so the answer is perhaps two or three orders of magnitude less than that. The loss of a few (or even a few dozen) parts-per-million of statistics data shouldn't make any material difference. To make this add up to a significant loss, you have to suppose that there was a particularly large transaction in flight just before the crash. But it can't have been in flight at the moment of the crash, because then the statistics message would not then have been sent. So it has to be the case that some really big transaction ended and then just after that the DBA did an immediate shutdown. I'm not arguing that it couldn't happen, but it's a pretty narrow target to be aiming at. I also think that you're wildly overestimating the likelihood that writing the stats file will be fast, because (1) anything that involves writing to the disk can be very slow, either because there's a lot of other write activity or because the disk is going bad, the latter being actually a pretty common cause of emergency database shutdowns, (2) the stats files can be quite large if the database system contains hundreds of thousands or even millions of objects, which is not all that infrequent, and (3) pgstat wait timeouts are pretty common, which would not be the case if writing the file was invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741). >> ... Yeah, it's not good, but neither are the things that prompt >> people to perform an immediate shutdown in the first place. > > Really? I think many users think an immediate shutdown is just fine. Why would anybody ever perform an immediate shutdown rather than a fast shutdown if a fast shutdown were fast? Now you've incurred the overhead of a crash recovery for no gain. In my experience, people perform immediate shutdowns mostly when they try something else and it doesn't work. (Other reasons: They are Oracle users who think that immediate will do what it does on Oracle, namely what we call a fast shutdown; or they don't want to wait for the shutdown checkpoint.) >> The patch that >> you claim moots this one was inspired by immediate shutdowns taking >> too long, and that patch enjoyed pretty broad support IIRC. > > I think that's historical revisionism. The commit message for 82233ce7e > says "(This might happen, for example, if a backend gets tangled trying > to malloc() due to gettext(), as in an example illustrated by MauMau.)". > There is absolutely nothing in it about people being dissatisfied with > the shutdown timing in normal cases; rather, it was done to prevent > cases where shutdown failed to happen at all. OK, you're probably right about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, Nov 22, 2016 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> But that's not what is at issue here. The issue is whether, when > >> asked to exit immediately, all processes should exit immediately, or > >> whether it would be better for all processes except one to exit > >> immediately and the last one exit non-immediately. In other words, > >> when the user asks for an immediate shutdown, do they really mean it, > >> or are they OK taking time to do some other stuff first? > > > > Peter already gave the response to that, which is that users do not > > expect an immediate shutdown to have permanent harmful effects. > > It doesn't have such effects so far as the SQL data goes; why is it > > okay to blow away statistical data? > > You're doing that anyway. The backends aren't going to send any > accumulated but unsent statistics to the stats collector before > exiting; they're just going to exit. Sure. That loses a few counts, but as you argue below, it's just "a few parts per million". > > Yes, I am, and I disagree with you. The current decision on this point > > was made ages ago, before autovacuum even existed let alone relied on > > the stats for proper functioning. The tradeoff you're saying you're > > okay with is "we'll shut down a few seconds faster, but you're going > > to have table bloat problems later because autovacuum won't know it > > needs to do anything". I wonder how many of the complaints we get > > about table bloat are a consequence of people not realizing that > > "pg_ctl stop -m immediate" is going to cost them. > > That would be useful information to have, but I bet the answer is "not > that many". Most people don't shut down their database very often; > they're looking for continuous uptime. It looks to me like autovacuum > activity causes the statistics files to get refreshed at least once > per autovacuum_naptime, which defaults to once a minute, so on the > average we're talking about the loss of perhaps 30 seconds worth of > statistics. I think you're misunderstanding how this works. Losing that file doesn't lose just the final 30 seconds worth of data -- it loses *everything*, and every counter goes back to zero. So it's not a few parts-per-million, it loses however many millions there were. > I also think that you're wildly overestimating the likelihood that > writing the stats file will be fast, because (1) anything that > involves writing to the disk can be very slow, either because there's > a lot of other write activity or because the disk is going bad, the > latter being actually a pretty common cause of emergency database > shutdowns, (2) the stats files can be quite large if the database > system contains hundreds of thousands or even millions of objects, > which is not all that infrequent, and (3) pgstat wait timeouts are > pretty common, which would not be the case if writing the file was > invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741). Those writes are slow because of the concurrent activity. If all backends just throw their hands in the air, no more writes come from them, so the OS is going to finish the writes pretty quickly (or at least empty enough of the caches so that the pgstat data fits); so neither (1) nor (3) should be terribly serious. I agree that (2) is a problem, but it's not a problem for everyone. > >> ... Yeah, it's not good, but neither are the things that prompt > >> people to perform an immediate shutdown in the first place. > > > > Really? I think many users think an immediate shutdown is just fine. > > Why would anybody ever perform an immediate shutdown rather than a > fast shutdown if a fast shutdown were fast? A fast shutdown is not all that fast -- it needs to write the whole contents of shared buffers down to disk, which may be enormous. Millions of times bigger than pgstat data. So a fast shutdown is actually very slow in a large machine. An immediate shutdown, even if it writes pgstat data, is still going to be much smaller in terms of what is written. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > Yes, I am, and I disagree with you. The current decision on this point >> > was made ages ago, before autovacuum even existed let alone relied on >> > the stats for proper functioning. The tradeoff you're saying you're >> > okay with is "we'll shut down a few seconds faster, but you're going >> > to have table bloat problems later because autovacuum won't know it >> > needs to do anything". I wonder how many of the complaints we get >> > about table bloat are a consequence of people not realizing that >> > "pg_ctl stop -m immediate" is going to cost them. >> >> That would be useful information to have, but I bet the answer is "not >> that many". Most people don't shut down their database very often; >> they're looking for continuous uptime. It looks to me like autovacuum >> activity causes the statistics files to get refreshed at least once >> per autovacuum_naptime, which defaults to once a minute, so on the >> average we're talking about the loss of perhaps 30 seconds worth of >> statistics. > > I think you're misunderstanding how this works. Losing that file > doesn't lose just the final 30 seconds worth of data -- it loses > *everything*, and every counter goes back to zero. So it's not a few > parts-per-million, it loses however many millions there were. OK, that's possible, but I'm not sure. I think there are two separate issues here. One is whether we should nuke the stats file on recovery, and the other is whether we should force a final write of the stats file before agreeing to an immediate shutdown. It seems to me that the first one affects whether all of the counters go to zero, and the second affects whether we lose a small amount of data from just prior to the shutdown. Right now, we are doing the first, so the second is a waste. If we decide to start doing the first, we can independently decide whether to also do the second. >> I also think that you're wildly overestimating the likelihood that >> writing the stats file will be fast, because (1) anything that >> involves writing to the disk can be very slow, either because there's >> a lot of other write activity or because the disk is going bad, the >> latter being actually a pretty common cause of emergency database >> shutdowns, (2) the stats files can be quite large if the database >> system contains hundreds of thousands or even millions of objects, >> which is not all that infrequent, and (3) pgstat wait timeouts are >> pretty common, which would not be the case if writing the file was >> invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741). > > Those writes are slow because of the concurrent activity. If all > backends just throw their hands in the air, no more writes come from > them, so the OS is going to finish the writes pretty quickly (or at > least empty enough of the caches so that the pgstat data fits); so > neither (1) nor (3) should be terribly serious. I agree that (2) is a > problem, but it's not a problem for everyone. If the operating system buffer cache doesn't contain much dirty data, then I agree. But there is a large backlog of dirty data there, then it might be quite slow. >> >> ... Yeah, it's not good, but neither are the things that prompt >> >> people to perform an immediate shutdown in the first place. >> > >> > Really? I think many users think an immediate shutdown is just fine. >> >> Why would anybody ever perform an immediate shutdown rather than a >> fast shutdown if a fast shutdown were fast? > > A fast shutdown is not all that fast -- it needs to write the whole > contents of shared buffers down to disk, which may be enormous. > Millions of times bigger than pgstat data. So a fast shutdown is > actually very slow in a large machine. An immediate shutdown, even if > it writes pgstat data, is still going to be much smaller in terms of > what is written. I agree. However, in many cases, the major cost of a fast shutdown is getting the dirty data already in the operating system buffers down to disk, not in writing out shared_buffers itself. The latter is probably a single-digit number of gigabytes, or maybe double-digit. The former might be a lot more, and the write of the pgstat file may back up behind it. I've seen cases where an 8kB buffered write from Postgres takes tens of seconds to complete because the OS buffer cache is already saturated with dirty data, and the stats files could easily be a lot more than that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > >> > Yes, I am, and I disagree with you. The current decision on this point > >> > was made ages ago, before autovacuum even existed let alone relied on > >> > the stats for proper functioning. The tradeoff you're saying you're > >> > okay with is "we'll shut down a few seconds faster, but you're going > >> > to have table bloat problems later because autovacuum won't know it > >> > needs to do anything". I wonder how many of the complaints we get > >> > about table bloat are a consequence of people not realizing that > >> > "pg_ctl stop -m immediate" is going to cost them. > >> > >> That would be useful information to have, but I bet the answer is "not > >> that many". Most people don't shut down their database very often; > >> they're looking for continuous uptime. It looks to me like autovacuum > >> activity causes the statistics files to get refreshed at least once > >> per autovacuum_naptime, which defaults to once a minute, so on the > >> average we're talking about the loss of perhaps 30 seconds worth of > >> statistics. > > > > I think you're misunderstanding how this works. Losing that file > > doesn't lose just the final 30 seconds worth of data -- it loses > > *everything*, and every counter goes back to zero. So it's not a few > > parts-per-million, it loses however many millions there were. > > OK, that's possible, but I'm not sure. I think there are two separate > issues here. One is whether we should nuke the stats file on > recovery, and the other is whether we should force a final write of > the stats file before agreeing to an immediate shutdown. It seems to > me that the first one affects whether all of the counters go to zero, > and the second affects whether we lose a small amount of data from > just prior to the shutdown. Right now, we are doing the first, so the > second is a waste. If we decide to start doing the first, we can > independently decide whether to also do the second. Well, the problem is that the stats data is not on disk while the system is in operation, as far as I recall -- it's only in the collector's local memory. On shutdown we tell it to write it down to a file, and on startup we tell it to read it from the file and then delete it. I think the rationale for this is to avoid leaving a file with stale data on disk while the system is running. > > Those writes are slow because of the concurrent activity. If all > > backends just throw their hands in the air, no more writes come from > > them, so the OS is going to finish the writes pretty quickly (or at > > least empty enough of the caches so that the pgstat data fits); so > > neither (1) nor (3) should be terribly serious. I agree that (2) is a > > problem, but it's not a problem for everyone. > > If the operating system buffer cache doesn't contain much dirty data, > then I agree. But there is a large backlog of dirty data there, then > it might be quite slow. That's true, but if the system isn't crashing, then writing a bunch of pages would make room for the pgstat data to be written to the OS, which is enough (we request only a write, not a flush, as I recall). So we don't need to wait for a very long period. > > A fast shutdown is not all that fast -- it needs to write the whole > > contents of shared buffers down to disk, which may be enormous. > > Millions of times bigger than pgstat data. So a fast shutdown is > > actually very slow in a large machine. An immediate shutdown, even if > > it writes pgstat data, is still going to be much smaller in terms of > > what is written. > > I agree. However, in many cases, the major cost of a fast shutdown is > getting the dirty data already in the operating system buffers down to > disk, not in writing out shared_buffers itself. The latter is > probably a single-digit number of gigabytes, or maybe double-digit. > The former might be a lot more, and the write of the pgstat file may > back up behind it. I've seen cases where an 8kB buffered write from > Postgres takes tens of seconds to complete because the OS buffer cache > is already saturated with dirty data, and the stats files could easily > be a lot more than that. In the default config, background flushing is invoked when memory is 10% dirty (dirty_background_ratio); foreground flushing is forced when memory is 40% dirty (dirty_ratio). That means the pgstat process can dirty 30% additional memory before being forced to perform flushing. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 22, 2016 at 3:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> OK, that's possible, but I'm not sure. I think there are two separate >> issues here. One is whether we should nuke the stats file on >> recovery, and the other is whether we should force a final write of >> the stats file before agreeing to an immediate shutdown. It seems to >> me that the first one affects whether all of the counters go to zero, >> and the second affects whether we lose a small amount of data from >> just prior to the shutdown. Right now, we are doing the first, so the >> second is a waste. If we decide to start doing the first, we can >> independently decide whether to also do the second. > > Well, the problem is that the stats data is not on disk while the system > is in operation, as far as I recall -- it's only in the collector's > local memory. On shutdown we tell it to write it down to a file, and on > startup we tell it to read it from the file and then delete it. I think > the rationale for this is to avoid leaving a file with stale data on > disk while the system is running. /me tests. I think you are almost right. When the server is running, there are files in pg_stat_tmp but not pg_stat; when it is shut down, there are files in pg_stat but not pg_stat_tmp. Of course the data can never be ONLY in the collector's backend-local memory because then nobody else could read it. I don't actually really understand the reason for this distinction. If it's important not to lose the data, then the current system is the worst of all possible worlds, and it would be better to have only only one file and atomically rename() a new one in over top of it from time to time. If we did that and also committed the proposed patch, it would be only slightly worse than if we did only that. Wouldn't it? >> > Those writes are slow because of the concurrent activity. If all >> > backends just throw their hands in the air, no more writes come from >> > them, so the OS is going to finish the writes pretty quickly (or at >> > least empty enough of the caches so that the pgstat data fits); so >> > neither (1) nor (3) should be terribly serious. I agree that (2) is a >> > problem, but it's not a problem for everyone. >> >> If the operating system buffer cache doesn't contain much dirty data, >> then I agree. But there is a large backlog of dirty data there, then >> it might be quite slow. > > That's true, but if the system isn't crashing, then writing a bunch of > pages would make room for the pgstat data to be written to the OS, which > is enough (we request only a write, not a flush, as I recall). So we > don't need to wait for a very long period. I'm not sure what you are saying here. Of course, if the OS writes pages to disk then there will be room in the buffer cache for more dirty pages. The issue is whether this will unduly delay shutdown. >> > A fast shutdown is not all that fast -- it needs to write the whole >> > contents of shared buffers down to disk, which may be enormous. >> > Millions of times bigger than pgstat data. So a fast shutdown is >> > actually very slow in a large machine. An immediate shutdown, even if >> > it writes pgstat data, is still going to be much smaller in terms of >> > what is written. >> >> I agree. However, in many cases, the major cost of a fast shutdown is >> getting the dirty data already in the operating system buffers down to >> disk, not in writing out shared_buffers itself. The latter is >> probably a single-digit number of gigabytes, or maybe double-digit. >> The former might be a lot more, and the write of the pgstat file may >> back up behind it. I've seen cases where an 8kB buffered write from >> Postgres takes tens of seconds to complete because the OS buffer cache >> is already saturated with dirty data, and the stats files could easily >> be a lot more than that. > > In the default config, background flushing is invoked when memory is 10% > dirty (dirty_background_ratio); foreground flushing is forced when > memory is 40% dirty (dirty_ratio). That means the pgstat process can > dirty 30% additional memory before being forced to perform flushing. There's absolutely no guarantee that we aren't already hard up against that 40% threshold - or whatever it is - already. Backends write data all the time, and flushes only happen at checkpoint time. Indeed, I'd argue that if somebody is performing an immediate shutdown, the most likely reason is that a fast shutdown is too slow. And the most likely reason for that is that the operating system buffer cache is full of dirty data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> I agree.  However, in many cases, the major cost of a fast shutdown is
> getting the dirty data already in the operating system buffers down to
> disk, not in writing out shared_buffers itself.  The latter is
> probably a single-digit number of gigabytes, or maybe double-digit.
> The former might be a lot more, and the write of the pgstat file may
> back up behind it.  I've seen cases where an 8kB buffered write from
> Postgres takes tens of seconds to complete because the OS buffer cache
> is already saturated with dirty data, and the stats files could easily
> be a lot more than that.
I think this is mostly FUD, because we don't fsync the stats files.  Maybe
we should, but we don't today.  So even if we have managed to get the
system into a state where physical writes are heavily backlogged, that's
not a reason to assume that the stats collector will be unable to do its
thing promptly.  All it has to do is push a relatively small amount of
data into kernel buffers.
        regards, tom lane
			
		Hi,
On 2016-11-22 15:49:27 -0500, Robert Haas wrote:
> I think you are almost right.  When the server is running, there are
> files in pg_stat_tmp but not pg_stat; when it is shut down, there are
> files in pg_stat but not pg_stat_tmp.  Of course the data can never be
> ONLY in the collector's backend-local memory because then nobody else
> could read it.
> I don't actually really understand the reason for this distinction.
pg_stat_tmp commonly is placed on tmpfs/a ramdisk.
But I'm a bit confused too - does this make any sort of difference?
Because the startup path for crash recovery is like this:
void
StartupXLOG(void)
{
.../* REDO */if (InRecovery){
...    /*     * Reset pgstat data, because it may be invalid after recovery.     */    pgstat_reset_all();
so it seems quite inconsequential whether we write out pgstat, because
we're going to nuke it either way after an immediate shutdown?
Greetings,
Andres Freund
			
		Andres Freund <andres@anarazel.de> writes:
> But I'm a bit confused too - does this make any sort of difference?
> Because the startup path for crash recovery is like this:
>         pgstat_reset_all();
> so it seems quite inconsequential whether we write out pgstat, because
> we're going to nuke it either way after an immediate shutdown?
The discussion is exactly about whether we shouldn't get rid of that,
rather than doing what's proposed in this patch.
        regards, tom lane
			
		On Tue, Nov 22, 2016 at 3:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I agree. However, in many cases, the major cost of a fast shutdown is >> getting the dirty data already in the operating system buffers down to >> disk, not in writing out shared_buffers itself. The latter is >> probably a single-digit number of gigabytes, or maybe double-digit. >> The former might be a lot more, and the write of the pgstat file may >> back up behind it. I've seen cases where an 8kB buffered write from >> Postgres takes tens of seconds to complete because the OS buffer cache >> is already saturated with dirty data, and the stats files could easily >> be a lot more than that. > > I think this is mostly FUD, because we don't fsync the stats files. Maybe > we should, but we don't today. So even if we have managed to get the > system into a state where physical writes are heavily backlogged, that's > not a reason to assume that the stats collector will be unable to do its > thing promptly. All it has to do is push a relatively small amount of > data into kernel buffers. I don't believe that's automatically fast, if we're bumping up against dirty_ratio. However, suppose you're right. Then what prompted the original complaint? The OP said "The problem here is that postmaster took as long as 15 seconds to terminate after it had detected a crashed backend." It clearly WASN'T an indefinite hang as might have occurred with the malloc-lock problem for which we implemented the SIGKILL stuff. So something during shutdown took a long time, but not forever. There's no convincing evidence I've seen that it has to have been this particular thing, but I find it plausible, because normal backends bail out without doing much of anything, and here we have a process that is trying to continue doing work after having received SIGQUIT. If not this, then what? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Well, the problem is that the stats data is not on disk while the system
> is in operation, as far as I recall -- it's only in the collector's
> local memory.  On shutdown we tell it to write it down to a file, and on
> startup we tell it to read it from the file and then delete it.  I think
> the rationale for this is to avoid leaving a file with stale data on
> disk while the system is running.
Maybe a workable compromise would be to leave the file present, and have
the stats collector re-write it every (say) five minutes.  Then I'd be
okay with having an immediate shutdown skip writing the file; you'd be
losing up to five minutes' worth of activity, but not going completely
nuts.  So the stats collector's normal activities would include writing
the temp file on-demand and the permanent file on a timed cycle.
The other components of the fix (deleting on PITR rewind or stats
collector crash) would remain the same.
        regards, tom lane
			
		On Tue, Nov 22, 2016 at 4:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Well, the problem is that the stats data is not on disk while the system >> is in operation, as far as I recall -- it's only in the collector's >> local memory. On shutdown we tell it to write it down to a file, and on >> startup we tell it to read it from the file and then delete it. I think >> the rationale for this is to avoid leaving a file with stale data on >> disk while the system is running. > > Maybe a workable compromise would be to leave the file present, and have > the stats collector re-write it every (say) five minutes. Then I'd be > okay with having an immediate shutdown skip writing the file; you'd be > losing up to five minutes' worth of activity, but not going completely > nuts. So the stats collector's normal activities would include writing > the temp file on-demand and the permanent file on a timed cycle. > > The other components of the fix (deleting on PITR rewind or stats > collector crash) would remain the same. Sounds great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-11-22 16:15:58 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Well, the problem is that the stats data is not on disk while the system > > is in operation, as far as I recall -- it's only in the collector's > > local memory. On shutdown we tell it to write it down to a file, and on > > startup we tell it to read it from the file and then delete it. I think > > the rationale for this is to avoid leaving a file with stale data on > > disk while the system is running. > > Maybe a workable compromise would be to leave the file present, and have > the stats collector re-write it every (say) five minutes. Then I'd be > okay with having an immediate shutdown skip writing the file; you'd be > losing up to five minutes' worth of activity, but not going completely > nuts. So the stats collector's normal activities would include writing > the temp file on-demand and the permanent file on a timed cycle. > > The other components of the fix (deleting on PITR rewind or stats > collector crash) would remain the same. It could even make sense to WAL log the contents of the stats file at checkpoint (or similar) time. Then it'd be sane after crash recovery, including starting from an old base backup. Loosing the information about what to vacuum / analyze after a failover is quite problematic... Andres
Andres Freund wrote: > On 2016-11-22 16:15:58 -0500, Tom Lane wrote: > > Maybe a workable compromise would be to leave the file present, and have > > the stats collector re-write it every (say) five minutes. Then I'd be > > okay with having an immediate shutdown skip writing the file; you'd be > > losing up to five minutes' worth of activity, but not going completely > > nuts. So the stats collector's normal activities would include writing > > the temp file on-demand and the permanent file on a timed cycle. > > > > The other components of the fix (deleting on PITR rewind or stats > > collector crash) would remain the same. +1 > It could even make sense to WAL log the contents of the stats file at > checkpoint (or similar) time. Then it'd be sane after crash recovery, > including starting from an old base backup. Loosing the information > about what to vacuum / analyze after a failover is quite problematic... An idea worth considering. This would also mean the file is valid in a standby -- the lack of the file is currently a problem if you promote the standby, as I recall. But the file is perhaps too large to write to WAL on every checkpoint. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 22, 2016 at 10:26 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Andres Freund wrote:
> On 2016-11-22 16:15:58 -0500, Tom Lane wrote:
> > Maybe a workable compromise would be to leave the file present, and have
> > the stats collector re-write it every (say) five minutes. Then I'd be
> > okay with having an immediate shutdown skip writing the file; you'd be
> > losing up to five minutes' worth of activity, but not going completely
> > nuts. So the stats collector's normal activities would include writing
> > the temp file on-demand and the permanent file on a timed cycle.
> >
> > The other components of the fix (deleting on PITR rewind or stats
> > collector crash) would remain the same.
+1
> It could even make sense to WAL log the contents of the stats file at
> checkpoint (or similar) time. Then it'd be sane after crash recovery,
> including starting from an old base backup. Loosing the information
> about what to vacuum / analyze after a failover is quite problematic...
An idea worth considering. This would also mean the file is valid in a
standby -- the lack of the file is currently a problem if you promote
the standby, as I recall. But the file is perhaps too large to write to
WAL on every checkpoint.
There's also the consideration of what to do with stats *on the standby*. If we WAL log the stats file, then when it replays onthe standby, the stats there will be overwritten. And stats like number of index vs seq scans on the standby are still interesting and would be lost.
That's not to say that I don't agree with the issues of failover vs autovacuum. But let's try to find a way that doesn't hurt other things to get there. But that could just be part of how we deal with the replay of it.
On Wed, Nov 23, 2016 at 6:15 PM, Magnus Hagander <magnus@hagander.net> wrote: > There's also the consideration of what to do with stats *on the standby*. If > we WAL log the stats file, then when it replays onthe standby, the stats > there will be overwritten. And stats like number of index vs seq scans on > the standby are still interesting and would be lost. Perhaps it would make sense to separate the stat files by type then? The action taken for each file depends on its type. -- Michael
On 11/23/2016 12:24 PM, Michael Paquier wrote: > On Wed, Nov 23, 2016 at 6:15 PM, Magnus Hagander <magnus@hagander.net> wrote: >> There's also the consideration of what to do with stats *on the standby*. If >> we WAL log the stats file, then when it replays onthe standby, the stats >> there will be overwritten. And stats like number of index vs seq scans on >> the standby are still interesting and would be lost. > > Perhaps it would make sense to separate the stat files by type then? > The action taken for each file depends on its type. > That seems reasonable. There are two types of stats - usage statistics (number of index scans etc.), which we probably don't need on standby, and statistics that we use to drive autovacuum. This would also reduce the amount of data that we need to write to WAL, although I'm not sure the amount is actually a problem. I've seen instances with ~500MB stat files, but those were instances with hundreds of databases, each with many thousands of objects. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> This would also reduce the amount of data that we need to write to WAL, 
> although I'm not sure the amount is actually a problem. I've seen 
> instances with ~500MB stat files, but those were instances with hundreds 
> of databases, each with many thousands of objects.
Hm, was this before we split up the stats file per-database?
        regards, tom lane
			
		From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Robert Haas <robertmhaas@gmail.com> writes: > > I agree. However, in many cases, the major cost of a fast shutdown is > > getting the dirty data already in the operating system buffers down to > > disk, not in writing out shared_buffers itself. The latter is > > probably a single-digit number of gigabytes, or maybe double-digit. > > The former might be a lot more, and the write of the pgstat file may > > back up behind it. I've seen cases where an 8kB buffered write from > > Postgres takes tens of seconds to complete because the OS buffer cache > > is already saturated with dirty data, and the stats files could easily > > be a lot more than that. > > I think this is mostly FUD, because we don't fsync the stats files. Maybe > we should, but we don't today. So even if we have managed to get the system > into a state where physical writes are heavily backlogged, that's not a > reason to assume that the stats collector will be unable to do its thing > promptly. All it has to do is push a relatively small amount of data into > kernel buffers. I'm sorry for my late reply, yesterday was a national holiday in Japan. It's not FUD. I understand you hit the slow stats file write problem during some regression test. You said it took 57 secondsto write the stats file during the postmaster shutdown. That caused pg_ctl stop to fail due to its 60 second timeout. Even the regression test environment suffered from the trouble. Regards Takayuki Tsunakawa
From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane > Maybe a workable compromise would be to leave the file present, and have > the stats collector re-write it every (say) five minutes. Then I'd be okay > with having an immediate shutdown skip writing the file; you'd be losing > up to five minutes' worth of activity, but not going completely nuts. So > the stats collector's normal activities would include writing the temp file > on-demand and the permanent file on a timed cycle. > > The other components of the fix (deleting on PITR rewind or stats collector > crash) would remain the same. The manual says: "Also, the collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL milliseconds (500 ms unless alteredwhile building the server). So the displayed information lags behind actual activity." Doesn't this mean that the stats collector writes files in pg_stat_tmp/ every 500ms? If true, how about just moving thosefiles into appropriate locations during recovery, instead of removing the files? I also find others's ideas woth considering -- WAL-logging the stats files, type-specific stats files, etc. -- but I'm afraidthose ideas would only be employed in a new major release, not in released versions. I'm asking for a remedy for auser (and potential users) who use older releases. And, I don't yet understand why patch would make the situation for existingusers, and why stop writing files during immediate/abnormal shutdown requires other efforts. Regards Takayuki Tsunakawa
On Thu, Nov 24, 2016 at 12:41 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: pgsql-hackers-owner@postgresql.org >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane >> Robert Haas <robertmhaas@gmail.com> writes: >> > I agree. However, in many cases, the major cost of a fast shutdown is >> > getting the dirty data already in the operating system buffers down to >> > disk, not in writing out shared_buffers itself. The latter is >> > probably a single-digit number of gigabytes, or maybe double-digit. >> > The former might be a lot more, and the write of the pgstat file may >> > back up behind it. I've seen cases where an 8kB buffered write from >> > Postgres takes tens of seconds to complete because the OS buffer cache >> > is already saturated with dirty data, and the stats files could easily >> > be a lot more than that. >> >> I think this is mostly FUD, because we don't fsync the stats files. Maybe >> we should, but we don't today. So even if we have managed to get the system >> into a state where physical writes are heavily backlogged, that's not a >> reason to assume that the stats collector will be unable to do its thing >> promptly. All it has to do is push a relatively small amount of data into >> kernel buffers. > > I'm sorry for my late reply, yesterday was a national holiday in Japan. > > It's not FUD. I understand you hit the slow stats file write problem during some regression test. You said it took 57seconds to write the stats file during the postmaster shutdown. That caused pg_ctl stop to fail due to its 60 second timeout. Even the regression test environment suffered from the trouble. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> > I also find others's ideas woth considering -- WAL-logging the stats files, type-specific stats files, etc. -- but I'mafraid those ideas would only be employed in a new major release, not in released versions. I'm asking for a remedy fora user (and potential users) who use older releases. And, I don't yet understand why patch would make the situation forexisting users, and why stop writing files during immediate/abnormal shutdown requires other efforts. > I agree with this. I don't think we are going to change stats file management in older versions. Hence after every crash recovery they will be thrown away. In that case, why should stats collector in the older versions wait for those files to be flushed to the disk? I think, we should apply patch to shut down stats collector immediately to the older versions and implement the stats file management solution to the head. Why isn't that a feasible option? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company