Обсуждение: Is PQreset() proper ?

Поиск
Список
Период
Сортировка

Is PQreset() proper ?

От
"Hiroshi Inoue"
Дата:
HI all,

I've encountered a database freeze and found it's due
to the reset of connection after abort.
The following is a part of postmaster log.
A new backend(pid=395) started immedaitely after
a backend(pid=394) abort. OTOH postmaster tries
to kill all backends to cleanup shared memory.
However the process 394 ignored SIGUSR1 signal
and is waiting for some lock which would never be
released.

FATAL 2:  elog: error during error recovery, giving up!
DEBUG:  proc_exit(2)
DEBUG:  shmem_exit(2)
postmaster: ServerLoop:        handling reading 5
postmaster: ServerLoop:        handling reading 5
postmaster: ServerLoop:        handling writing 5
postmaster: BackendStartup: pid 395 user reindex db reindex socket 5
DEBUG:  exit(2)
postmaster: reaping dead processes...
postmaster: CleanupProc: pid 394 exited with status 512
Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000
Terminating any active server processes...
postmaster: CleanupProc: sending SIGUSR1 to process 395
postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex )
FindExec: searching PATH ...
ValidateBinary: can't stat "/bin/postgres"
ValidateBinary: can't stat "/usr/bin/postgres"
ValidateBinary: can't stat "/usr/local/bin/postgres"
ValidateBinary: can't stat "/usr/bin/X11/postgres"
ValidateBinary: can't stat "/usr/lib/jdk1.2/bin/postgres"
ValidateBinary: can't stat "/home/freetools/bin/postgres"
FindExec: found "/home/freetools/reindex/bin/postgres" using PATH
DEBUG:  connection: host=[local] user=reindex database=reindex
DEBUG:  InitPostgres

Regards.
Hiroshi Inoue


Re: Is PQreset() proper ?

От
Tom Lane
Дата:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> postmaster: BackendStartup: pid 395 user reindex db reindex socket 5
> DEBUG:  exit(2)
> postmaster: reaping dead processes...
> postmaster: CleanupProc: pid 394 exited with status 512
> Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000
> Terminating any active server processes...
> postmaster: CleanupProc: sending SIGUSR1 to process 395
> postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex )

This isn't PQreset()'s fault that I can see.  This is a race condition
caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's
set up the correct signal handler for same.  The postmaster should have
started the child process with all signals blocked, so SIGUSR1 will be
held off until the child explicitly enables it; but it does so a few
lines too soon.  Will fix.
        regards, tom lane


Re: Is PQreset() proper ?

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > postmaster: BackendStartup: pid 395 user reindex db reindex socket 5
> > DEBUG:  exit(2)
> > postmaster: reaping dead processes...
> > postmaster: CleanupProc: pid 394 exited with status 512
> > Server process (pid 394) exited with status 512 at Tue Dec 19 20:12:41 2000
> > Terminating any active server processes...
> > postmaster: CleanupProc: sending SIGUSR1 to process 395
> > postmaster child[395]: starting with (postgres -d2 -v131072 -p reindex )
> 
> This isn't PQreset()'s fault that I can see.  This is a race condition
> caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's
> set up the correct signal handler for same.  The postmaster should have
> started the child process with all signals blocked, so SIGUSR1 will be
> held off until the child explicitly enables it; but it does so a few
> lines too soon.  Will fix.
> 

I once observed another case,the hang of CheckPoint process
while postmaster was in a backend crash recovery. I changed
postmaster.c to not invoke CheckPoint process while postmaster
is in a backend crash recovery but it doesn't seem sufficient.
SIGUSR1 signal seems to be blocked all the way in CheckPoint
process.

Regards.
Hiroshi Inoue


Re: Is PQreset() proper ?

От
Tom Lane
Дата:
> Tom Lane wrote:
>> This isn't PQreset()'s fault that I can see.  This is a race condition
>> caused by bogosity in PostgresMain --- it enables SIGUSR1 before it's
>> set up the correct signal handler for same.  The postmaster should have
>> started the child process with all signals blocked, so SIGUSR1 will be
>> held off until the child explicitly enables it; but it does so a few
>> lines too soon.  Will fix.

Actually, it turns out the real problem is that backends were inheriting
a SIG_IGN setting for SIGUSR1 from the postmaster.  So a SIGUSR1
delivered before they got as far as setting up their own signal handling
would get lost.  Fixed now.

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> I once observed another case,the hang of CheckPoint process
> while postmaster was in a backend crash recovery. I changed
> postmaster.c to not invoke CheckPoint process while postmaster
> is in a backend crash recovery but it doesn't seem sufficient.
> SIGUSR1 signal seems to be blocked all the way in CheckPoint
> process.

Hm.  Vadim, do you think it's safe to let CheckPoint be killed by
SIGUSR1?  If not, what will we do about this?
        regards, tom lane