Обсуждение: Streaming Replication Server Crash

Поиск
Список
Период
Сортировка

Streaming Replication Server Crash

От
raghu ram
Дата:
Hi All,

We have configured Streaming Replication b/w Primary and Standby server and Pgpool-II load balancing module diverting SELECT statements to  Standby server. As per our observations, Standby server crashed during peak hours on today and error message as follows:

2012-10-19 12:26:43 IST [11934]: [1-1] user=postgres,db=hmis LOG:  duration: 20345.702 ms  execute <unnamed>:  SELECT  DISTINCT y.ais_insm_name,y.ais_insm_type,ha_dism_name,ha_hudm_name,x2.ais_equip_receipt_details_value_lov, x3.ais_equip_receipt_details_value,x4.ais_equip_receipt_details_value, mast_frm_freq, t_report_code, t_report_date, t_report_month, t_report_year,t_report_createdate,t_report_date2, t_report_month2, t_report_year2,mast_frm_validity_days FROM TNHSPTOOL_FORM_MASTER, TRAN_712_MASTER    full join ais_institution_master as y on(  t_instiution_code = y.ais_insm_code)  full join ha_district_master on(  t_district_code=ha_dism_code )  full join ha_hud_master on(  t_hud_code=ha_hudm_code )full join tran_712_664 x2 on (x2.t_rep_code=t_report_code and x2.t_rep_rowno=3) full join tran_712_664 x3 on (x3.t_rep_code=t_report_code and x3.t_rep_rowno=5) full join tran_712_664 x4 on (x4.t_rep_code=t_report_code and x4.t_rep_rowno=36) WHERE mast_frm_code = T_REPORT_FRMID AND  y.ais_insm_code ='00231' and   T_REPORT_FRMID =712 and T_REPORT_OLDFLAG='O' ORDER BY T_REPORT_CODE DESC

2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG:  server process (PID 15565) was terminated by signal 10

2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG:  terminating any other active server processes

2012-10-19 12:26:46 IST [18450]: [3-1] user=postgres,db=DBHMS WARNING:  terminating connection because of crash of another server process


Standby Server configuration Details as follows:

max_wal_senders = 5             
wal_sender_delay = 200ms       
wal_keep_segments = 128         
vacuum_defer_cleanup_age = 0   
hot_standby = on                    
max_standby_archive_delay = -1
max_standby_streaming_delay = -1

PostgreSQL version: 9.0.4
OS: Solaris 64bit 

Could you please share your inputs,to fix this issue.


Thanks & Regards,

Raghu Ram




Re: Streaming Replication Server Crash

От
Craig Ringer
Дата:
On 10/19/2012 04:40 PM, raghu ram wrote:
> Hi All,
>
> We have configured Streaming Replication b/w Primary and Standby server
> and Pgpool-II load balancing module diverting
> SELECT statements to  Standby server. As per our observations, Standby
> server crashed during peak hours on today and error message as follows

> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG:  server process
> (PID 15565) was terminated by signal 10
>
> 2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG:  terminating any
> other active server processes

That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.

Was the server intentionally sent SIGUSR1 by an admin? Do you know what
triggered the signal?

Are you running any procedural languages other than PL/PgSQL, or any
custom C extensions? Anything that might have unwittingly cleared the
signal handler for SIGUSR1?

--
Craig Ringer


Re: [ADMIN] Streaming Replication Server Crash

От
Tom Lane
Дата:
Craig Ringer <ringerc@ringerc.id.au> writes:
> On 10/19/2012 04:40 PM, raghu ram wrote:
>> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG:  server process
>> (PID 15565) was terminated by signal 10

> That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.

> Was the server intentionally sent SIGUSR1 by an admin? Do you know what
> triggered the signal?

SIGUSR1 is used for all sorts of internal cross-process signaling
purposes.  There's no need to hypothesize any external force sending
it; if somebody had broken a PG process's signal handling setup for
SIGUSR1, a crash of this sort could be expected in short order.

But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix.  (I see SIGUSR1 is 30 on OS X for instance.)

> Are you running any procedural languages other than PL/PgSQL, or any
> custom C extensions? Anything that might have unwittingly cleared the
> signal handler for SIGUSR1?

libperl has a bad habit of thinking it can mess with the process's
signal setup ...

            regards, tom lane


Re: [ADMIN] Streaming Replication Server Crash

От
Craig Ringer
Дата:
On 10/22/2012 08:52 PM, Tom Lane wrote:
> Craig Ringer <ringerc@ringerc.id.au> writes:
>> On 10/19/2012 04:40 PM, raghu ram wrote:
>>> 2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG:  server process
>>> (PID 15565) was terminated by signal 10
>
>> That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.
>
>> Was the server intentionally sent SIGUSR1 by an admin? Do you know what
>> triggered the signal?
>
> SIGUSR1 is used for all sorts of internal cross-process signaling
> purposes.  There's no need to hypothesize any external force sending
> it; if somebody had broken a PG process's signal handling setup for
> SIGUSR1, a crash of this sort could be expected in short order.
>
> But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
> AFAIK, that signal number is not at all compatible across different
> flavors of Unix.  (I see SIGUSR1 is 30 on OS X for instance.)

Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.

A bit of searching suggests that on Solaris/SunOS, signal 10 is SIGBUS:

http://www.s-gms.ms.edus.si/cgi-bin/man-cgi?signal+3HEAD
http://docs.oracle.com/cd/E23824_01/html/821-1464/signal-3head.html

... which tends to suggest an entirely different interpretation than
"someone broke a signal hander":

https://blogs.oracle.com/peteh/entry/sigbus_versus_sigsegv_according_to

such as:

- Bad mmap()ed read
- alignment error
- hardware fault

so it's not immensely different to a segfault in that it can be caused
by errors in hardware, OS, or applications.

Raghu, did PostgreSQL dump a core file? If it didn't, you might want to
enable core dumps in future. If it did dump a core, attaching a debugger
to the core file might tell you where it crashed, possibly offering some
more information to diagnose the issue. I'm not familiar enough with
Solaris to offer detailed advice on that, especially as you haven't
mentioned your Solaris version, how you installed Pg, etc. This may be
of some use:


http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris

--
Craig Ringer


Re: [ADMIN] Streaming Replication Server Crash

От
Tom Lane
Дата:
Craig Ringer <ringerc@ringerc.id.au> writes:
> On 10/22/2012 08:52 PM, Tom Lane wrote:
>> But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
>> AFAIK, that signal number is not at all compatible across different
>> flavors of Unix.  (I see SIGUSR1 is 30 on OS X for instance.)

> Gah. I incorrectly though that POSIX specified signal *numbers*, not
> just names. That does not appear to actually be the case. Thanks.

This isn't the first time I've wondered exactly which signal was meant
in a postmaster child-crash report.  Seems like it might be worth
expending some code on a symbolic translation, instead of just printing
the number.  That'd be easy enough (for common signal names) on Unix,
but has anyone got a suggestion how we might do something useful on
Windows?

            regards, tom lane


Re: [ADMIN] Streaming Replication Server Crash

От
Craig Ringer
Дата:
On 10/23/2012 01:20 PM, Tom Lane wrote:

This isn't the first time I've wondered exactly which signal was meant
in a postmaster child-crash report.  Seems like it might be worth
expending some code on a symbolic translation, instead of just printing
the number.  That'd be easy enough (for common signal names) on Unix,
but has anyone got a suggestion how we might do something useful on
Windows?

Here's a typical Windows exception:


2012-10-04 14:29:08 CEST LOG:  server process (PID 1416) was terminated by exception 0xC0000005

2012-10-04 14:29:08 CEST HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.


These codes can be translated with FormatMessage:

  http://msdn.microsoft.com/en-us/library/windows/desktop/ms679351(v=vs.85).aspx
  http://support.microsoft.com/kb/259693

FormatMessage may not be safe to perform in the context of a munged heap or some other failure conditions, so you probably don't want to do it from a crash handler. It is safe for the postmaster to do it based on the exception code it gets from the dying backend, though.

I'd say the best option is for the postmaster to print the FormatMessage( FORMAT_MESSAGE_ALLOCATE_BUFFER|FORMAT_MESSAGE_FROM_SYSTEM|FORMAT_MESSAGE_FROM_HMODULE, ...) output when it sees the exception code from the dying backend.

RtlNtStatusToDosError may also be of interest: http://msdn.microsoft.com/en-us/library/windows/desktop/ms680600(v=vs.85).aspx ... but it's in Winternl.h so it's not guaranteed to exist / be compatible between versions and can only be accessed via runtime dynamic linking. Not ideal.

--
Craig Ringer