Обсуждение: aborting startup due to startup process failure

Поиск
Список
Период
Сортировка

aborting startup due to startup process failure

От
"George Wilk"
Дата:

I posted to this group before with the same topic but nobody replied.  Please, provide some feedback if you can…

I am running a warm standby server, which executes the following command in a recovery mode:

 

triggered=false

while (test ! -f /var/ipsc/WAL/$1 && ! $triggered)

do

    echo waiting for file: $1

 

    sleep 30

 

    if test -f /var/ipsc/pgsql/trigger

    then

         echo --- trigger found         ---

         echo --- exiting recovery mode ---

         triggered=true

    fi

 

done

 

if ( ! $triggered)

then

  cp /var/ipsc/WAL/$1 $2

else

  exit 133

fi

 

Recovery command works just fine restoring data from the WAL files scp’d from the primary server.  While in the recovery mode, when I create the trigger file breaking the while loop in recovery command, postgres does not go gently into the active database mode.  Here is output:

 

waiting for file: 00000001000000000000003A

--- trigger found ---

--- exiting recovery mode ---

FATAL:  could not restore file "00000001000000000000003A" from archive: return code 34048

LOG:  startup process (PID 13994) exited with exit code 1

LOG:  aborting startup due to startup process failure

 

After finding the trigger file my recovery_cmd returns non-zero code.  Why am I still getting FATAL:  could not restore file ?

Both my primary and standby servers run on Solaris 10 under SMF.  When the standby server is attempting to change mode from recovery to regular database mode, there might be a race condition there between SMF trying to restart the server and the server trying to restart itself… or am I just hallucinating…

 

Thanks in advance for your comments.

 

Cheers,

~george

Re: aborting startup due to startup process failure

От
"Kevin Grittner"
Дата:
>>> On Thu, Jun 28, 2007 at  2:55 PM, in message
<006101c7b9be$479dfed0$0d7ba8c0@ellacoya.com>, "George Wilk"
<gwilk@ellacoya.com> wrote:
>
> FATAL:  could not restore file "00000001000000000000003A" from archive:
> return code 34048
>
> LOG:  startup process (PID 13994) exited with exit code 1
>
> LOG:  aborting startup due to startup process failure

You're not terminating the warm standby while the database is still logging
"the database system is starting up", are you?

If so, I would wait for a minute until it is settled in.

-Kevin




Re: aborting startup due to startup process failure

От
Tom Arthurs
Дата:
I think you may have a race condition in your code -- you don't find the
new file, sleep, while sleeping both the new file and the stop file come
in, you wake up, find the stop file and never copy the last segment over.

George Wilk wrote:
>
> I posted to this group before with the same topic but nobody replied.
> Please, provide some feedback if you can…
>
> I am running a warm standby server, which executes the following
> command in a recovery mode:
>
> *triggered=false*
>
> *while (test ! -f /var/ipsc/WAL/$1 && ! $triggered)*
>
> *do*
>
> * echo waiting for file: $1*
>
> * *
>
> * sleep 30*
>
> * *
>
> * if test -f /var/ipsc/pgsql/trigger*
>
> * then*
>
> * echo --- trigger found ---*
>
> * echo --- exiting recovery mode ---*
>
> * triggered=true*
>
> * fi*
>
> * *
>
> *done*
>
> * *
>
> *if ( ! $triggered)*
>
> *then*
>
> * cp /var/ipsc/WAL/$1 $2*
>
> *else*
>
> * exit 133*
>
> *fi*
>
> Recovery command works just fine restoring data from the WAL files
> scp’d from the primary server. While in the recovery mode, when I
> create the trigger file breaking the while loop in recovery command,
> postgres does not go gently into the active database mode. Here is output:
>
> *waiting for file: 00000001000000000000003A*
>
> *--- trigger found ---*
>
> *--- exiting recovery mode ---*
>
> *FATAL: could not restore file "00000001000000000000003A" from
> archive: return code 34048*
>
> *LOG: startup process (PID 13994) exited with exit code 1*
>
> *LOG: aborting startup due to startup process failure*
>
> * *
>
> After finding the trigger file my recovery_cmd returns non-zero code.
> Why am I still getting *FATAL: could not restore file *?
>
> Both my primary and standby servers run on Solaris 10 under SMF. When
> the standby server is attempting to change mode from recovery to
> regular database mode, there might be a race condition there between
> SMF trying to restart the server and the server trying to restart
> itself… or am I just hallucinating…
>
> Thanks in advance for your comments.
>
> Cheers,
>
> ~george
>


Re: aborting startup due to startup process failure

От
Tom Lane
Дата:
"George Wilk" <gwilk@ellacoya.com> writes:
> I am running a warm standby server, which executes the following command in
> a recovery mode:
> [ when done: ]
>   exit 133

Did you pick that number out of a hat?  Postgres thinks it means that
the recovery command died horribly, and abandons the recovery as a
safety measure.  (Per Single Unix Spec, this exit code from a shell
script would ordinarily mean that some program the shell called
died with a signal 5.)

Use "exit 1" or some low number like that to signal ordinary failure
to find the requested xlog file.  Numbers larger than about 125 mean
catastrophic failure of the recovery command.

            regards, tom lane

Re: aborting startup due to startup process failure

От
"George Wilk"
Дата:
Tom,

You nailed it right in the head!  My bad...  My exit code was causing the
problem.  Once I changed it to 1 the recovery mode exited and the database
came up:

waiting for file: 0000000100000000000000F2
--- trigger found ---
--- exiting recovery mode ---
exiting with 1
LOG:  could not open file "pg_xlog/0000000100000000000000F2"
LOG:  redo done at 0/F1000070
file: 0000000100000000000000F1
path: pg_xlog/RECOVERYXLOG
LOG:  restored log file "0000000100000000000000F1" from archive
LOG:  archive recovery complete
LOG:  database system is ready

Thanks to all who contributed to this thread - much obliged.
Cheers,
~george

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, June 28, 2007 10:16 PM
To: George Wilk
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] aborting startup due to startup process failure

"George Wilk" <gwilk@ellacoya.com> writes:
> I am running a warm standby server, which executes the following command
in
> a recovery mode:
> [ when done: ]
>   exit 133

Did you pick that number out of a hat?  Postgres thinks it means that
the recovery command died horribly, and abandons the recovery as a
safety measure.  (Per Single Unix Spec, this exit code from a shell
script would ordinarily mean that some program the shell called
died with a signal 5.)

Use "exit 1" or some low number like that to signal ordinary failure
to find the requested xlog file.  Numbers larger than about 125 mean
catastrophic failure of the recovery command.

            regards, tom lane