Обсуждение: aborting startup due to startup process failure
I posted to this group before with the same topic but nobody replied. Please, provide some feedback if you can…
I am running a warm standby server, which executes the following command in a recovery mode:
triggered=false
while (test ! -f /var/ipsc/WAL/$1 && ! $triggered)
do
echo waiting for file: $1
sleep 30
if test -f /var/ipsc/pgsql/trigger
then
echo --- trigger found ---
echo --- exiting recovery mode ---
triggered=true
fi
done
if ( ! $triggered)
then
cp /var/ipsc/WAL/$1 $2
else
exit 133
fi
Recovery command works just fine restoring data from the WAL files scp’d from the primary server. While in the recovery mode, when I create the trigger file breaking the while loop in recovery command, postgres does not go gently into the active database mode. Here is output:
waiting for file: 00000001000000000000003A
--- trigger found ---
--- exiting recovery mode ---
FATAL: could not restore file "00000001000000000000003A" from archive: return code 34048
LOG: startup process (PID 13994) exited with exit code 1
LOG: aborting startup due to startup process failure
After finding the trigger file my recovery_cmd returns non-zero code. Why am I still getting FATAL: could not restore file ?
Both my primary and standby servers run on Solaris 10 under SMF. When the standby server is attempting to change mode from recovery to regular database mode, there might be a race condition there between SMF trying to restart the server and the server trying to restart itself… or am I just hallucinating…
Thanks in advance for your comments.
Cheers,
~george
>>> On Thu, Jun 28, 2007 at 2:55 PM, in message <006101c7b9be$479dfed0$0d7ba8c0@ellacoya.com>, "George Wilk" <gwilk@ellacoya.com> wrote: > > FATAL: could not restore file "00000001000000000000003A" from archive: > return code 34048 > > LOG: startup process (PID 13994) exited with exit code 1 > > LOG: aborting startup due to startup process failure You're not terminating the warm standby while the database is still logging "the database system is starting up", are you? If so, I would wait for a minute until it is settled in. -Kevin
I think you may have a race condition in your code -- you don't find the new file, sleep, while sleeping both the new file and the stop file come in, you wake up, find the stop file and never copy the last segment over. George Wilk wrote: > > I posted to this group before with the same topic but nobody replied. > Please, provide some feedback if you can… > > I am running a warm standby server, which executes the following > command in a recovery mode: > > *triggered=false* > > *while (test ! -f /var/ipsc/WAL/$1 && ! $triggered)* > > *do* > > * echo waiting for file: $1* > > * * > > * sleep 30* > > * * > > * if test -f /var/ipsc/pgsql/trigger* > > * then* > > * echo --- trigger found ---* > > * echo --- exiting recovery mode ---* > > * triggered=true* > > * fi* > > * * > > *done* > > * * > > *if ( ! $triggered)* > > *then* > > * cp /var/ipsc/WAL/$1 $2* > > *else* > > * exit 133* > > *fi* > > Recovery command works just fine restoring data from the WAL files > scp’d from the primary server. While in the recovery mode, when I > create the trigger file breaking the while loop in recovery command, > postgres does not go gently into the active database mode. Here is output: > > *waiting for file: 00000001000000000000003A* > > *--- trigger found ---* > > *--- exiting recovery mode ---* > > *FATAL: could not restore file "00000001000000000000003A" from > archive: return code 34048* > > *LOG: startup process (PID 13994) exited with exit code 1* > > *LOG: aborting startup due to startup process failure* > > * * > > After finding the trigger file my recovery_cmd returns non-zero code. > Why am I still getting *FATAL: could not restore file *? > > Both my primary and standby servers run on Solaris 10 under SMF. When > the standby server is attempting to change mode from recovery to > regular database mode, there might be a race condition there between > SMF trying to restart the server and the server trying to restart > itself… or am I just hallucinating… > > Thanks in advance for your comments. > > Cheers, > > ~george >
"George Wilk" <gwilk@ellacoya.com> writes: > I am running a warm standby server, which executes the following command in > a recovery mode: > [ when done: ] > exit 133 Did you pick that number out of a hat? Postgres thinks it means that the recovery command died horribly, and abandons the recovery as a safety measure. (Per Single Unix Spec, this exit code from a shell script would ordinarily mean that some program the shell called died with a signal 5.) Use "exit 1" or some low number like that to signal ordinary failure to find the requested xlog file. Numbers larger than about 125 mean catastrophic failure of the recovery command. regards, tom lane
Tom, You nailed it right in the head! My bad... My exit code was causing the problem. Once I changed it to 1 the recovery mode exited and the database came up: waiting for file: 0000000100000000000000F2 --- trigger found --- --- exiting recovery mode --- exiting with 1 LOG: could not open file "pg_xlog/0000000100000000000000F2" LOG: redo done at 0/F1000070 file: 0000000100000000000000F1 path: pg_xlog/RECOVERYXLOG LOG: restored log file "0000000100000000000000F1" from archive LOG: archive recovery complete LOG: database system is ready Thanks to all who contributed to this thread - much obliged. Cheers, ~george -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Thursday, June 28, 2007 10:16 PM To: George Wilk Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] aborting startup due to startup process failure "George Wilk" <gwilk@ellacoya.com> writes: > I am running a warm standby server, which executes the following command in > a recovery mode: > [ when done: ] > exit 133 Did you pick that number out of a hat? Postgres thinks it means that the recovery command died horribly, and abandons the recovery as a safety measure. (Per Single Unix Spec, this exit code from a shell script would ordinarily mean that some program the shell called died with a signal 5.) Use "exit 1" or some low number like that to signal ordinary failure to find the requested xlog file. Numbers larger than about 125 mean catastrophic failure of the recovery command. regards, tom lane