Обсуждение: could not stat promote trigger file leads to shutdown
I have seen the error
could not stat promote trigger file "...": Permission denied
because of a misconfiguration (for example, setting promote_trigger_file
to point into a directory to which you don't have appropriate read or
execute access).
The problem is that because this happens in the startup process, the
ERROR is turned into a FATAL and the whole instance shuts down. That
seems like a harsh penalty. Would it be better to turn this ERROR into
a WARNING?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 14, 2019 at 10:58 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > I have seen the error > > could not stat promote trigger file "...": Permission denied > > because of a misconfiguration (for example, setting promote_trigger_file > to point into a directory to which you don't have appropriate read or > execute access). > > The problem is that because this happens in the startup process, the > ERROR is turned into a FATAL and the whole instance shuts down. That > seems like a harsh penalty. Would it be better to turn this ERROR into > a WARNING? +1 Regards, -- Fujii Masao
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> I have seen the error
> could not stat promote trigger file "...": Permission denied
> because of a misconfiguration (for example, setting promote_trigger_file
> to point into a directory to which you don't have appropriate read or
> execute access).
> The problem is that because this happens in the startup process, the
> ERROR is turned into a FATAL and the whole instance shuts down. That
> seems like a harsh penalty. Would it be better to turn this ERROR into
> a WARNING?
It is harsh, but I suspect if we just logged the complaint, we'd get
bug reports about "Postgres isn't reacting to my trigger file",
because people don't read the postmaster log unless forced to.
Is there some more-visible way to report the problem, short of
shutting down?
(BTW, from this perspective, WARNING is especially bad because it
might not get logged at all. Better to use LOG.)
One thought is to try to detect the misconfiguration at postmaster
start --- better to fail at startup than sometime later. But I'm
not sure how reliably we could do that.
regards, tom lane
On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote: > It is harsh, but I suspect if we just logged the complaint, we'd get > bug reports about "Postgres isn't reacting to my trigger file", > because people don't read the postmaster log unless forced to. > Is there some more-visible way to report the problem, short of > shutting down? > > (BTW, from this perspective, WARNING is especially bad because it > might not get logged at all. Better to use LOG.) Neither am I comfortable with that. > One thought is to try to detect the misconfiguration at postmaster > start --- better to fail at startup than sometime later. But I'm > not sure how reliably we could do that. I think that we could do something close to the area where RemovePromoteSignalFiles() gets called. Why not simply checking the path defined by PromoteTriggerFile() at startup time then? I take it that the only thing we should not complain about is stat() returning ENOENT when looking at the promote file defined. -- Michael
Вложения
On Fri, Nov 15, 2019 at 10:49 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote: > > It is harsh, but I suspect if we just logged the complaint, we'd get > > bug reports about "Postgres isn't reacting to my trigger file", > > because people don't read the postmaster log unless forced to. > > Is there some more-visible way to report the problem, short of > > shutting down? > > > > (BTW, from this perspective, WARNING is especially bad because it > > might not get logged at all. Better to use LOG.) > > Neither am I comfortable with that. I always wonder why WARNING was defined that way. I think that users usually pay attention to the word "WARNING" rather than "LOG". > > One thought is to try to detect the misconfiguration at postmaster > > start --- better to fail at startup than sometime later. But I'm > > not sure how reliably we could do that. > > I think that we could do something close to the area where > RemovePromoteSignalFiles() gets called. Why not simply checking the > path defined by PromoteTriggerFile() at startup time then? I take it > that the only thing we should not complain about is stat() returning > ENOENT when looking at the promote file defined. promote_trigger_file is declared as PGC_SIGHUP, so such check would be necessary even while the standby is running. Which can cause the server to fail after the startup. Regards, -- Fujii Masao
On 2019-11-15 03:14, Fujii Masao wrote:
>>> One thought is to try to detect the misconfiguration at postmaster
>>> start --- better to fail at startup than sometime later. But I'm
>>> not sure how reliably we could do that.
>> I think that we could do something close to the area where
>> RemovePromoteSignalFiles() gets called. Why not simply checking the
>> path defined by PromoteTriggerFile() at startup time then? I take it
>> that the only thing we should not complain about is stat() returning
>> ENOENT when looking at the promote file defined.
> promote_trigger_file is declared as PGC_SIGHUP,
> so such check would be necessary even while the standby is running.
> Which can cause the server to fail after the startup.
Let me illustrate a scenario in a more lively way:
Say you want to set up promote_trigger_file to point to a file outside
of the data directory, maybe because you want to integrate it with some
external tooling. So you go into your configuration and set
promote_trigger_file = '/srv/foobar/trigger'
and reload the server. Everything is happy. The fact that the
directory /srv/foobar/ does not exist at this point is completely ignored.
Now you become root and run
mkdir /srv/foobar
and, depending circumstances such as root's umask or the permissions of
/srv, your PostgreSQL server crashes immediately. That can't be good.
Also imagine the above steps being run by a configuration management system.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Fujii Masao <masao.fujii@gmail.com> writes:
> On Fri, Nov 15, 2019 at 10:49 AM Michael Paquier <michael@paquier.xyz> wrote:
>> On Thu, Nov 14, 2019 at 10:38:30AM -0500, Tom Lane wrote:
>>> (BTW, from this perspective, WARNING is especially bad because it
>>>> might not get logged at all. Better to use LOG.)
>> Neither am I comfortable with that.
> I always wonder why WARNING was defined that way.
> I think that users usually pay attention to the word "WARNING"
> rather than "LOG".
The issue really is "what are we warning about". The way things
are set up basically assumes that WARNING is for complaining about
user-issued commands that might not be doing what the user wants.
Which is a legitimate use-case, but it doesn't necessarily mean
something that's very important to put in the postmaster log.
What we're seeing, in these repeated proposals to use WARNING in
some background process that doesn't run user commands, is that
there is also a use-case for "more-significant-than-usual log
messages". Maybe we need a new elevel category for that.
SYSTEM_WARNING or LOG_WARNING, perhaps?
>> I think that we could do something close to the area where
>> RemovePromoteSignalFiles() gets called. Why not simply checking the
>> path defined by PromoteTriggerFile() at startup time then? I take it
>> that the only thing we should not complain about is stat() returning
>> ENOENT when looking at the promote file defined.
> promote_trigger_file is declared as PGC_SIGHUP,
> so such check would be necessary even while the standby is running.
> Which can cause the server to fail after the startup.
No, it'd be just the same as any other GUC: if we make such a test
in the check hook, then we'd fail for a bad value at startup, but
at SIGHUP we'd just reject the new setting. I think this might be
a workable answer to Peter's concern.
regards, tom lane
Hello > Maybe we need a new elevel category for that. > SYSTEM_WARNING or LOG_WARNING, perhaps? I think a separate levels for user warnings and system warnings (and errors) would be great for log analytics. Error dueto user typo in query is not the same as cache lookup error (for example). regards, Sergei
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Say you want to set up promote_trigger_file to point to a file outside
> of the data directory, maybe because you want to integrate it with some
> external tooling. So you go into your configuration and set
> promote_trigger_file = '/srv/foobar/trigger'
> and reload the server. Everything is happy. The fact that the
> directory /srv/foobar/ does not exist at this point is completely ignored.
> Now you become root and run
> mkdir /srv/foobar
> and, depending circumstances such as root's umask or the permissions of
> /srv, your PostgreSQL server crashes immediately. That can't be good.
No, it's not good, but the proposed fix of s/ERROR/LOG/ simply delays
the problem till later, ie when you try to promote the server nothing
happens. That's not good either. (To be clear: I'm not necessarily
against that change, I just don't think it's a sufficient response.)
If we add a GUC-check-hook test, then the problem of misconfiguration
is reduced to the previously unsolved problem that we have crappy
feedback for erroneous on-the-fly configuration changes. So it's
still unsolved, but at least we've got one unsolved problem not two.
regards, tom lane
On 2019-Nov-15, Tom Lane wrote: > If we add a GUC-check-hook test, then the problem of misconfiguration > is reduced to the previously unsolved problem that we have crappy > feedback for erroneous on-the-fly configuration changes. So it's > still unsolved, but at least we've got one unsolved problem not two. I am now against this kind of behavior, because nowadays it is common to have external orchestrating systems stopping and starting postmaster on their own volition. If this kind of misconfiguration causes postmaster refuse to start, it can effectively become a service-shutdown scenario which requires the DBA to go temporarily mad. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Nov-15, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes. So it's
>> still unsolved, but at least we've got one unsolved problem not two.
> I am now against this kind of behavior, because nowadays it is common
> to have external orchestrating systems stopping and starting postmaster
> on their own volition.
> If this kind of misconfiguration causes postmaster refuse to start, it
> can effectively become a service-shutdown scenario which requires the
> DBA to go temporarily mad.
By that argument, postgresql.conf could contain complete garbage
and the postmaster should still start. I'm not willing to say
that an "external orchestrating system" doesn't need to take
responsibility for putting valid info into the config file.
regards, tom lane
On 2019-11-15 19:23, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> Say you want to set up promote_trigger_file to point to a file outside >> of the data directory, maybe because you want to integrate it with some >> external tooling. So you go into your configuration and set >> promote_trigger_file = '/srv/foobar/trigger' >> and reload the server. Everything is happy. The fact that the >> directory /srv/foobar/ does not exist at this point is completely ignored. >> Now you become root and run >> mkdir /srv/foobar >> and, depending circumstances such as root's umask or the permissions of >> /srv, your PostgreSQL server crashes immediately. That can't be good. > > No, it's not good, but the proposed fix of s/ERROR/LOG/ simply delays > the problem till later, ie when you try to promote the server nothing > happens. That's not good either. (To be clear: I'm not necessarily > against that change, I just don't think it's a sufficient response.) > > If we add a GUC-check-hook test, then the problem of misconfiguration > is reduced to the previously unsolved problem that we have crappy > feedback for erroneous on-the-fly configuration changes. So it's > still unsolved, but at least we've got one unsolved problem not two. AFAICT, a GUC check hook wouldn't actually be able to address the specific scenario I described. At the time the GUC is set, the containing the directory of the trigger file does not exist yet. This is currently not an error. The problem only happens if after the GUC is set the containing directory appears and is not readable. I notice that we use LOG level if an SSL key or certificate file is not accessible on reload, so that seems somewhat similar. We don't have any GUC check hooks on other file system location string settings that ensure accessibility or presence of the file. Although I do notice that we use check_canonical_path() in some places and not others for mysterious and undocumented reasons. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2019-11-15 19:23, Tom Lane wrote:
>> If we add a GUC-check-hook test, then the problem of misconfiguration
>> is reduced to the previously unsolved problem that we have crappy
>> feedback for erroneous on-the-fly configuration changes. So it's
>> still unsolved, but at least we've got one unsolved problem not two.
> AFAICT, a GUC check hook wouldn't actually be able to address the
> specific scenario I described. At the time the GUC is set, the
> containing the directory of the trigger file does not exist yet. This
> is currently not an error. The problem only happens if after the GUC is
> set the containing directory appears and is not readable.
True, if the hook just consists of trying fopen() and checking the
errno. Would it be feasible to insist that the containing directory
exist and be readable? We have enough infrastructure that that
should only take a few lines of code, so the question is whether
or not that's a nicer behavior than we have now.
If we had this to do over, I'd argue that we misdesigned trigger
files: they should be required to exist always, and triggering
depends on file contents (eg empty vs. not) not existence. That
would make it far easier to check for configuration mistakes
at startup. But I suppose it's too late now.
> We don't have any GUC check hooks on other file system location string
> settings that ensure accessibility or presence of the file.
Right, but I'm suggesting we should add that where appropriate.
Basically the complaint here is that the system lacks checks
that the given configuration settings are workable, and we ought
to add such.
> Although I
> do notice that we use check_canonical_path() in some places and not
> others for mysterious and undocumented reasons.
Probably only that some patch authors didn't know about it :-(
regards, tom lane
On 2019-11-20 16:21, Tom Lane wrote: >> AFAICT, a GUC check hook wouldn't actually be able to address the >> specific scenario I described. At the time the GUC is set, the >> containing the directory of the trigger file does not exist yet. This >> is currently not an error. The problem only happens if after the GUC is >> set the containing directory appears and is not readable. > True, if the hook just consists of trying fopen() and checking the > errno. Would it be feasible to insist that the containing directory > exist and be readable? We have enough infrastructure that that > should only take a few lines of code, so the question is whether > or not that's a nicer behavior than we have now. Is it possible to do this in a mostly bullet-proof way? Just because the directory exists and looks pretty good otherwise, doesn't mean we can read a file created in it later in a way that doesn't fall afoul of the existing error checks. There could be something like SELinux lurking, for example. Maybe some initial checking would be useful, but I think we still need to downgrade the error check at use time a bit to not crash in the cases that we miss. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Wed, 4 Dec 2019 11:52:33 +0100, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in > On 2019-11-20 16:21, Tom Lane wrote: > >> AFAICT, a GUC check hook wouldn't actually be able to address the > >> specific scenario I described. At the time the GUC is set, the > >> containing the directory of the trigger file does not exist yet. This > >> is currently not an error. The problem only happens if after the GUC > >> is > >> set the containing directory appears and is not readable. > > True, if the hook just consists of trying fopen() and checking the > > errno. Would it be feasible to insist that the containing directory > > exist and be readable? We have enough infrastructure that that > > should only take a few lines of code, so the question is whether > > or not that's a nicer behavior than we have now. > > Is it possible to do this in a mostly bullet-proof way? Just because > the directory exists and looks pretty good otherwise, doesn't mean we > can read a file created in it later in a way that doesn't fall afoul > of the existing error checks. There could be something like SELinux > lurking, for example. > > Maybe some initial checking would be useful, but I think we still need > to downgrade the error check at use time a bit to not crash in the > cases that we miss. +1. Any GUC variables that points to outer, or externally-modifiable resources, including directories, files, commands can face that kind of problem. For example a bogus value for archive_command doesn't preveint server from starting. I understand that the reason is that we don't have a reliable means to check-up the command before we actually execute it, but server can (or should) continue running even if it fails. I think this issue falls into that category. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Dec 04, 2019 at 11:52:33AM +0100, Peter Eisentraut wrote: > Is it possible to do this in a mostly bullet-proof way? Just because the > directory exists and looks pretty good otherwise, doesn't mean we can read a > file created in it later in a way that doesn't fall afoul of the existing > error checks. There could be something like SELinux lurking, for example. > > Maybe some initial checking would be useful, but I think we still need to > downgrade the error check at use time a bit to not crash in the cases that > we miss. I got that thread in my backlog for some time, and was not able to come back to it. Reading it again the thread, it seems to me that using a LOG would make the promote file handling more consistent with what we do for the SSL context reload. Still, one downside I can see here is that this causes the backend to create a new LOG entry each time the promote file is checked, aka each time we check if WAL is available. Couldn't that bloat be a problem? During the SSL reload, we only generate LOG entries for each backend on SIGHUP. -- Michael