Re: max_standby_delay considered harmful

Поиск
Список
Период
Сортировка
От Stefan Kaltenbrunner
Тема Re: max_standby_delay considered harmful
Дата
Msg-id 4BEAFD24.2070707@kaltenbrunner.cc
обсуждение исходный текст
Ответ на Re: max_standby_delay considered harmful  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: max_standby_delay considered harmful  (Simon Riggs <simon@2ndQuadrant.com>)
Re: max_standby_delay considered harmful  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
On 05/12/2010 05:28 PM, Simon Riggs wrote:
> On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote:
>> On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
>>> On Wed, May 12, 2010 at 7:26 AM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>>> On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:
>>>>
>>>>> I'm not sure what to make of this.  Sometimes not shutting down
>>>>> doesn't sound like a feature to me.
>>>>
>>>> It acts exactly the same in recovery as in normal running. It is not a
>>>> special feature of recovery at all, bug or otherwise.
>>>
>>> Simon, that doesn't make any sense.  We are talking about a backend
>>> getting stuck forever on an exclusive lock that is held by the startup
>>> process and which will never be released (for example, because the
>>> master has shut down and no more WAL can be obtained for replay).  The
>>> startup process does not hold locks in normal operation.
>>
>> When I test it, startup process holding a lock does not prevent shutdown
>> of a standby.
>>
>> I'd be happy to see your test case showing a bug exists and that the
>> behaviour differs from normal running.
>
> Let me put this differently: I accept that Stefan has reported a
> problem. Neither Tom nor myself can reproduce the problem. I've re-run
> Stefan's test case and restarted the server more than 400 times now
> without any issue.
>
> I re-read your post where you gave what you yourself called "uninformed
> speculation". There's no real polite way to say it, but yes your
> speculation does appear to be uninformed, since it is incorrect. Reasons
> would be not least that Stefan's tests don't actually send any locks to
> the standby anyway (!), but even if they did your speculation as to the
> cause is still all wrong, as explained.
>
> There is no evidence to link this behaviour with HS, as yet, and you
> should be considering the possibility the problem lies elsewhere,
> especially since it could be code you committed that is at fault.

Well I'm not sure why people seem to have that hard a time reproducing 
that issue - it seems that I can provoke it really trivially(in this 
case no loops, no pgbench, no tricks). A few minutes ago I logged into 
my test standby (which is idle except for the odd connect to template1 
caused by nagios - the master is idle as well and has been for days):

postgres@soldata005:~$ psql
psql (9.0beta1)
Type "help" for help.

postgres=# select 1; ?column?
----------        1
(1 row)

postgres=# \q
postgres@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postgres@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postgres@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut 
down............................................................... failed
pg_ctl: server does not shut down


the server log for that is as follows:

<2010-05-12 20:36:18.166 CEST,,,> LOG:  received smart shutdown request
<2010-05-12 20:36:18.167 CEST,,,> FATAL:  terminating walreceiver 
process due to administrator command
<2010-05-12 20:36:18.174 CEST,,,> LOG:  shutting down
<2010-05-12 20:36:18.251 CEST,,,> LOG:  database system is shut down
<2010-05-12 20:36:19.706 CEST,,,> LOG:  database system was interrupted 
while in recovery at log time 2010-05-06 17:36:05 CEST
<2010-05-12 20:36:19.706 CEST,,,> HINT:  If this has occurred more than 
once some data might be corrupted and you might need to choose an 
earlier recovery target.
<2010-05-12 20:36:19.706 CEST,,,> LOG:  entering standby mode
<2010-05-12 20:36:19.721 CEST,,,> LOG:  consistent recovery state 
reached at 1/12000078
<2010-05-12 20:36:19.721 CEST,,,> LOG:  invalid record length at 1/12000078
<2010-05-12 20:36:19.723 CEST,,,> LOG:  database system is ready to 
accept read only connections
<2010-05-12 20:36:19.737 CEST,,,> LOG:  streaming replication 
successfully connected to primary
<2010-05-12 20:36:19.918 CEST,,,> LOG:  received smart shutdown request
<2010-05-12 20:36:19.919 CEST,,,> FATAL:  terminating walreceiver 
process due to administrator command
<2010-05-12 20:36:19.922 CEST,,,> LOG:  shutting down
<2010-05-12 20:36:19.937 CEST,,,> LOG:  database system is shut down
<2010-05-12 20:36:21.433 CEST,,,> LOG:  database system was interrupted 
while in recovery at log time 2010-05-06 17:36:05 CEST
<2010-05-12 20:36:21.433 CEST,,,> HINT:  If this has occurred more than 
once some data might be corrupted and you might need to choose an 
earlier recovery target.
<2010-05-12 20:36:21.433 CEST,,,> LOG:  entering standby mode
<2010-05-12 20:36:21.482 CEST,,,> LOG:  received smart shutdown request
<2010-05-12 20:36:21.504 CEST,,,> LOG:  consistent recovery state 
reached at 1/12000078
<2010-05-12 20:36:21.504 CEST,,,> LOG:  invalid record length at 1/12000078
<2010-05-12 20:36:21.505 CEST,,,> LOG:  database system is ready to 
accept read only connections
<2010-05-12 20:36:21.516 CEST,,,> LOG:  streaming replication 
successfully connected to primary

so it restarted two times successfully - however if one looks at the 
third time one can see that it received the smart shutdown request 
BEFORE it reached a consistent recovery state - yet it continued to 
enable HS and reenabled SR as well.

The database is now sitting there doing nothing and it more or less 
broken because you cannot connect to it in the current state:

~$ psql
psql: FATAL:  the database system is shutting down

the startup process has the following backtrace:

(gdb) bt
#0  0x00007fbe24cb2c83 in select () from /lib/libc.so.6
#1  0x00000000006e811a in pg_usleep ()
#2  0x000000000048c333 in XLogPageRead ()
#3  0x000000000048c967 in ReadRecord ()
#4  0x0000000000493ab6 in StartupXLOG ()
#5  0x0000000000495a88 in StartupProcessMain ()
#6  0x00000000004ab25e in AuxiliaryProcessMain ()
#7  0x00000000005d4a7d in StartChildProcess ()
#8  0x00000000005d70c2 in PostmasterMain ()
#9  0x000000000057d898 in main ()


Stefan


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: multibyte charater set in levenshtein function
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: pg_upgrade versus MSVC build scripts