Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010:
> 2) We did a strace to the postmaster pid. However we had 2 postmasters not
> dead
>
> # ps -fea |grep -i postmaster
> postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
> postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
>
> As soon as we did a "strace" to the 3889 pid everything started to work
> again.
Sorry for my previous response -- evidently I failed to scroll down
enough to notice this part.
It seems to me that this process was stuck in a unnatural way.
> Not sure it was a coincidence but that was the way it was.
>
> *# strace -p 3889*
> *Process 3889 attached - interrupt to quit*
> *select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
This seems normal postmaster activity: receiving SIGUSR1, then SIGCHLD,
and doing stuff accordingly.
Rather than a coincidence, I would think that the act of tracing it made
it come back to life. A kernel bug maybe? Have you upgraded your
kernel or libc lately?
> I also straced the other postmaster pid
>
> *# strace -p 7601*
> *Process 7601 attached - interrupt to quit*
> *recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 181*
This one seems like a regular postmaster child that hadn't gotten around
to changing its ps status yet. (Note it had PPID 3889 which is
consistent with this idea.)
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support