Обсуждение: pg crashing
Platform: Windows 2003 Server 32 bit Windows SharedSection set to 1024,3072 based on recommendations from http://www.postgresql.org/docs/faqs.FAQ_windows.html#4.4 and http://support.microsoft.com/kb/184802 Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" Autovacuum: on Virus Scanning: disabled on PostgreSQL home directory Problem: My database keeps on crashing every few days with this type of error message: 2008-07-01 10:46:30 CDT LOG: all server processes terminated; reinitializing 2008-07-01 10:46:31 CDT FATAL: pre-existing shared memory block is still in use 2008-07-01 10:46:31 CDT HINT: Check if there are any old server processes still running, and terminate them. Is this problem fixed in 8.3.3 and/or fixed by moving to a Unix environment like Solaris? Jon
"Roberts, Jon" <Jon.Roberts@asurion.com> writes: > Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" Well, there are plenty of known bugs in 8.3.0 by now. You really should update before complaining, not after. > Problem: My database keeps on crashing every few days with this type of > error message: > 2008-07-01 10:46:30 CDT LOG: all server processes terminated; > reinitializing I think your real problem is with what happened *before* that. But: > 2008-07-01 10:46:31 CDT FATAL: pre-existing shared memory block is > still in use > 2008-07-01 10:46:31 CDT HINT: Check if there are any old server > processes still running, and terminate them. Hmm ... the code in win32_shmem.c that generates this message seems mighty bogus to me --- it's just hoping that one-second delay is enough. Another problem is that postmaster children that do PGSharedMemoryDetach will still have valid inherited handles for the shmem segment --- does that factor into the behavior? It looks to me like the CloseHandle ought to be in PGSharedMemoryDetach. regards, tom lane
Tom Lane wrote: > "Roberts, Jon" <Jon.Roberts@asurion.com> writes: >> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" > > Well, there are plenty of known bugs in 8.3.0 by now. You really > should update before complaining, not after. Yes. And the traditional question should be asked - is there any antivirus or other "personal security" software running on tihs machine? If so, uninstall (not just disable!) it and see if the problem goes away. >> 2008-07-01 10:46:30 CDT LOG: all server processes terminated; >> reinitializing > > I think your real problem is with what happened *before* that. Agreed. There is some reason that they all terminated... > But: >> 2008-07-01 10:46:31 CDT FATAL: pre-existing shared memory block is >> still in use >> 2008-07-01 10:46:31 CDT HINT: Check if there are any old server >> processes still running, and terminate them. > > Hmm ... the code in win32_shmem.c that generates this message seems > mighty bogus to me --- it's just hoping that one-second delay is > enough. Well, we're basically waiting for the kernel cleanup thread to get run. In my tests it never came above 20ms or so, so 1 second is a pretty long time. And you need *some* kind of timeout... It would also be interesting to know if there are actually any other processes running at this time. > Another problem is that postmaster children that do > PGSharedMemoryDetach will still have valid inherited handles for > the shmem segment --- does that factor into the behavior? It looks > to me like the CloseHandle ought to be in PGSharedMemoryDetach. Not as long as the processes die. If they die, their handles go with them, and once the reference count goes to zero, the object goes away. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> Another problem is that postmaster children that do >> PGSharedMemoryDetach will still have valid inherited handles for >> the shmem segment --- does that factor into the behavior? It looks >> to me like the CloseHandle ought to be in PGSharedMemoryDetach. > Not as long as the processes die. If they die, their handles go with > them, and once the reference count goes to zero, the object goes away. But the syslogger process (and maybe others) is *not* supposed to die. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Tom Lane wrote: >>> Another problem is that postmaster children that do >>> PGSharedMemoryDetach will still have valid inherited handles for >>> the shmem segment --- does that factor into the behavior? It looks >>> to me like the CloseHandle ought to be in PGSharedMemoryDetach. > >> Not as long as the processes die. If they die, their handles go with >> them, and once the reference count goes to zero, the object goes away. > > But the syslogger process (and maybe others) is *not* supposed to die. Right. But are you saying we actually want to start up a new backend in a directory where we already have a running syslogger (and maybe others) processes, just no postmaster? I'd assume we might run into such simple things as "sharing violations" on the logfile - if nothing inside the db itself.. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> But the syslogger process (and maybe others) is *not* supposed to die. > Right. But are you saying we actually want to start up a new backend in > a directory where we already have a running syslogger (and maybe others) > processes, just no postmaster? Not great, maybe, but what it looks to me is that the current system guarantees that a postmaster with a syslogger child will never recover from a backend-child crash. That's not better. regards, tom lane
> "Roberts, Jon" <Jon.Roberts@asurion.com> writes: > > Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" > > Well, there are plenty of known bugs in 8.3.0 by now. You really > should update before complaining, not after. I'm not complaining. I just want to make sure that if I upgrade, it will fix the problem. An upgrade could possible introduce a new problem. I also wonder if this is isolated to Win32 because we are upgrading to Solaris very soon. > > > Problem: My database keeps on crashing every few days with this type of > > error message: > > > 2008-07-01 10:46:30 CDT LOG: all server processes terminated; > > reinitializing > > I think your real problem is with what happened *before* that. I found the first instance of "crash" and then got the rest of the log file. 2008-07-01 10:43:42 CDT LOG: server process (PID 3524) exited with exit code 128 2008-07-01 10:43:42 CDT LOG: terminating any other active server processes 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT CONTEXT: SQL statement "SELECT pg_sleep(cast(control.fn_get_variable('sleep_time') as int))" PL/pgSQL function "fn_update_status" line 136 at PERFORM SQL statement "SELECT gp_load.fn_update_status( $1 , $2 , 'Processing', '', 0)" PL/pgSQL function "fn_load" line 56 at PERFORM 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT CONTEXT: PL/pgSQL function "fn_get_job_details" line 114 at IF PL/pgSQL function "fn_load" line 465 at FOR over SELECT rows 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT CONTEXT: SQL statement "SELECT pg_sleep(cast(control.fn_get_variable('sleep_time') as int))" PL/pgSQL function "fn_update_status" line 136 at PERFORM SQL statement "SELECT gp_load.fn_update_status( $1 , $2 , 'Processing', '', 0)" PL/pgSQL function "fn_load" line 56 at PERFORM 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT CONTEXT: SQL statement "SELECT pg_sleep(cast(control.fn_get_variable('sleep_time') as int))" PL/pgSQL function "fn_update_status" line 136 at PERFORM SQL statement "SELECT gp_load.fn_update_status( $1 , $2 , 'Processing', '', 0)" PL/pgSQL function "fn_load" line 56 at PERFORM 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT CONTEXT: SQL statement "SELECT pg_sleep(cast(control.fn_get_variable('sleep_time') as int))" PL/pgSQL function "fn_update_status" line 136 at PERFORM SQL statement "SELECT gp_load.fn_update_status( $1 , $2 , 'Processing', '', 0)" PL/pgSQL function "fn_load" line 56 at PERFORM 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT CONTEXT: SQL statement "SELECT pg_sleep(cast(control.fn_get_variable('sleep_time') as int))" PL/pgSQL function "fn_update_status" line 136 at PERFORM SQL statement "SELECT gp_load.fn_update_status( $1 , $2 , 'Processing', '', 0)" PL/pgSQL function "fn_load" line 56 at PERFORM 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:42 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:42 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:42 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. INFO: 2008-07-01 10:44:42.373008-05:3000:Executing Query ..... create temporary table order_iud on commit drop as select s."jid",s."order_id",s."parent_order_id",s."claim_id",s."fulfillment_met hod",s."order_status_id",s."payment_status",s."created_by",s."created_dt m",s."modified_by",s."modified_dtm",s."deleted_flg",s."claimed_item_id", s."order_state_hashcode",s."autopaymentauth_flg",s."salvageexpectation_s tatus",s."snr_fee",s."salvageexpectation_reason",s."salvageexpectation_n ote",s."nocharge_override",s."change_datetime",s."change_type",s."edw_pr ocess_flag",s."edw_process_ts" from stage_intelliset.order s, (select max("order".jid) as jid, "order".order_id from stage_intelliset."order" group by "order".order_id) m where m.jid = s.jid and m.order_id::text = s.order_id::text distributed by (order_id) INFO: 2008-07-01 10:44:43.185777-05:3000:Execution Complete. INFO: 2008-07-01 10:44:46.988228-05:5000:Executing Update ..... update replica_intelliset.invoiceautopayment set "order_id" = y."order_id", "invoice_id" = y."invoice_id", "transaction_id" = y."transaction_id", "paymenttype_code" = y."paymenttype_code", "amount" = y."amount", "status_code" = y."status_code", "deleted_flg" = y."deleted_flg", "created_by" = y."created_by", "created_dtm" = y."created_dtm", "modified_by" = y."modified_by", "modified_dtm" = y."modified_dtm", edw_modified_dt = y.edw_process_ts, edw_modified_id = y.jid, edw_clock_ts = clock_timestamp() from (select a."invoiceautopayement_id", a."order_id", a."invoice_id", a."transaction_id", a."paymenttype_code", a."amount", a."status_code", a."deleted_flg", a."created_by", a."created_dtm", a."modified_by", a."modified_dtm", a.edw_process_ts, a.jid, a.change_type from invoiceautopayment_iud a inner join replica_intelliset.invoiceautopayment b on a.invoiceautopayement_id = b.invoiceautopayement_id) y where replica_intelliset.invoiceautopayment.invoiceautopayement_id = y.invoiceautopayement_id and y.change_type <> 3 INFO: 2008-07-01 10:44:48.785215-05:5000:Update Complete. INFO: 2008-07-01 10:44:48.786175-05:6000:Executing Insert ..... insert into replica_intelliset.invoiceautopayment ("invoiceautopayement_id", "order_id", "invoice_id", "transaction_id", "paymenttype_code", "amount", "status_code", "deleted_flg", "created_by", "created_dtm", "modified_by", "modified_dtm", edw_created_dt, edw_modified_dt, edw_created_id, edw_modified_id, edw_clock_ts) select a."invoiceautopayement_id", a."order_id", a."invoice_id", a."transaction_id", a."paymenttype_code", a."amount", a."status_code", a."deleted_flg", a."created_by", a."created_dtm", a."modified_by", a."modified_dtm", edw_process_ts as edw_created_dt, null as edw_modified_dt, a.jid as edw_created_id, null as edw_modified_id, clock_timestamp() as edw_clock_ts from invoiceautopayment_iud a left outer join replica_intelliset.invoiceautopayment b on a.invoiceautopayement_id = b.invoiceautopayement_id where (b.invoiceautopayement_id is null) and a.change_type <> 3 INFO: 2008-07-01 10:44:50.738058-05:6000:Insert Complete. INFO: 2008-07-01 10:44:50.750247-05:8000:Executing insert into archive table..... insert into stage_intelliset.invoiceautopayment_arch select * from stage_intelliset.invoiceautopayment INFO: 2008-07-01 10:44:50.839105-05:8000:Insert complete 2008-07-01 10:43:55 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:55 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:55 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:55 CDT CONTEXT: SQL statement "select * from dblink('gp', $1 ) as t1 (return_text varchar)" PL/pgSQL function "fn_remote_sql" line 63 at SQL statement SQL statement "SELECT gp_execute.fn_remote_sql( $1 , $2 )" PL/pgSQL function "fn_load" line 393 at PERFORM 2008-07-01 10:43:55 CDT LOG: could not send data to client: No connection could be made because the target machine actively refused it. 2008-07-01 10:43:55 CDT CONTEXT: SQL statement "select * from dblink('gp', $1 ) as t1 (return_text varchar)" PL/pgSQL function "fn_remote_sql" line 63 at SQL statement SQL statement "SELECT gp_execute.fn_remote_sql( $1 , $2 )" PL/pgSQL function "fn_load" line 393 at PERFORM 2008-07-01 10:43:55 CDT STATEMENT: select * from gp_load.fn_load(809) INFO: 2008-07-01 10:44:53.103065-05:5000:Executing Update ..... update replica_intelliset.order set "parent_order_id" = y."parent_order_id", "claim_id" = y."claim_id", "fulfillment_method" = y."fulfillment_method", "order_status_id" = y."order_status_id", "payment_status" = y."payment_status", "created_by" = y."created_by", "created_dtm" = y."created_dtm", "modified_by" = y."modified_by", "modified_dtm" = y."modified_dtm", "deleted_flg" = y."deleted_flg", "claimed_item_id" = y."claimed_item_id", "order_state_hashcode" = y."order_state_hashcode", "autopaymentauth_flg" = y."autopaymentauth_flg", "salvageexpectation_status" = y."salvageexpectation_status", "snr_fee" = y."snr_fee", "salvageexpectation_reason" = y."salvageexpectation_reason", "salvageexpectation_note" = y."salvageexpectation_note", "nocharge_override" = y."nocharge_override", edw_modified_dt = y.edw_process_ts, edw_modified_id = y.jid, edw_clock_ts = clock_timestamp() from (select a."order_id", a."parent_order_id", a."claim_id", a."fulfillment_method", a."order_status_id", a."payment_status", a."created_by", a."created_dtm", a."modified_by", a."modified_dtm", a."deleted_flg", a."claimed_item_id", a."order_state_hashcode", a."autopaymentauth_flg", a."salvageexpectation_status", a."snr_fee", a."salvageexpectation_reason", a."salvageexpectation_note", a."nocharge_override", a.edw_process_ts, a.jid, a.change_type from order_iud a inner join replica_intelliset.order b on a.order_id = b.order_id) y where replica_intelliset.order.order_id = y.order_id and y.change_type <> 3 INFO: 2008-07-01 10:44:54.536366-05:5000:Update Complete. INFO: 2008-07-01 10:44:54.537312-05:6000:Executing Insert ..... insert into replica_intelliset.order ("order_id", "parent_order_id", "claim_id", "fulfillment_method", "order_status_id", "payment_status", "created_by", "created_dtm", "modified_by", "modified_dtm", "deleted_flg", "claimed_item_id", "order_state_hashcode", "autopaymentauth_flg", "salvageexpectation_status", "snr_fee", "salvageexpectation_reason", "salvageexpectation_note", "nocharge_override", edw_created_dt, edw_modified_dt, edw_created_id, edw_modified_id, edw_clock_ts) select a."order_id", a."parent_order_id", a."claim_id", a."fulfillment_method", a."order_status_id", a."payment_status", a."created_by", a."created_dtm", a."modified_by", a."modified_dtm", a."deleted_flg", a."claimed_item_id", a."order_state_hashcode", a."autopaymentauth_flg", a."salvageexpectation_status", a."snr_fee", a."salvageexpectation_reason", a."salvageexpectation_note", a."nocharge_override", edw_process_ts as edw_created_dt, null as edw_modified_dt, a.jid as edw_created_id, null as edw_modified_id, clock_timestamp() as edw_clock_ts from order_iud a left outer join replica_intelliset.order b on a.order_id = b.order_id where (b.order_id is null) and a.change_type <> 3 INFO: 2008-07-01 10:44:54.577749-05:6000:Insert Complete. INFO: 2008-07-01 10:44:54.583708-05:8000:Executing insert into archive table..... insert into stage_intelliset.order_arch select * from stage_intelliset.order INFO: 2008-07-01 10:44:54.715591-05:8000:Insert complete 2008-07-01 10:43:58 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:43:58 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:43:58 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:43:58 CDT CONTEXT: SQL statement "select * from dblink('gp', $1 ) as t1 (return_text varchar)" PL/pgSQL function "fn_remote_sql" line 63 at SQL statement SQL statement "SELECT gp_execute.fn_remote_sql( $1 , $2 )" PL/pgSQL function "fn_load" line 393 at PERFORM 2008-07-01 10:43:58 CDT LOG: could not send data to client: No connection could be made because the target machine actively refused it. 2008-07-01 10:43:58 CDT CONTEXT: SQL statement "select * from dblink('gp', $1 ) as t1 (return_text varchar)" PL/pgSQL function "fn_remote_sql" line 63 at SQL statement SQL statement "SELECT gp_execute.fn_remote_sql( $1 , $2 )" PL/pgSQL function "fn_load" line 393 at PERFORM 2008-07-01 10:43:58 CDT STATEMENT: select * from gp_load.fn_load(813) 2008-07-01 10:46:30 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:46:30 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:46:30 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:46:30 CDT CONTEXT: PL/pgSQL function "fn_load" line 271 at IF 2008-07-01 10:46:30 CDT LOG: could not send data to client: No connection could be made because the target machine actively refused it. 2008-07-01 10:46:30 CDT CONTEXT: PL/pgSQL function "fn_load" line 271 at IF 2008-07-01 10:46:30 CDT STATEMENT: select * from gp_load.fn_load(757) 2008-07-01 10:46:30 CDT WARNING: terminating connection because of crash of another server process 2008-07-01 10:46:30 CDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2008-07-01 10:46:30 CDT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2008-07-01 10:46:30 CDT CONTEXT: SQL statement "select remote_execute.fn_get_max( $1 , $2 , $3 , $4 , $5 , $6 , $7 , $8 , $9 , $10 , $11 )" PL/pgSQL function "fn_load" line 265 at SQL statement 2008-07-01 10:46:30 CDT LOG: could not send data to client: No connection could be made because the target machine actively refused it. 2008-07-01 10:46:30 CDT CONTEXT: SQL statement "select remote_execute.fn_get_max( $1 , $2 , $3 , $4 , $5 , $6 , $7 , $8 , $9 , $10 , $11 )" PL/pgSQL function "fn_load" line 265 at SQL statement 2008-07-01 10:46:30 CDT STATEMENT: select * from gp_load.fn_load(965) 2008-07-01 10:46:30 CDT LOG: all server processes terminated; reinitializing 2008-07-01 10:46:31 CDT FATAL: pre-existing shared memory block is still in use 2008-07-01 10:46:31 CDT HINT: Check if there are any old server processes still running, and terminate them. > But: > > > 2008-07-01 10:46:31 CDT FATAL: pre-existing shared memory block is > > still in use > > 2008-07-01 10:46:31 CDT HINT: Check if there are any old server > > processes still running, and terminate them. > > Hmm ... the code in win32_shmem.c that generates this message seems > mighty bogus to me --- it's just hoping that one-second delay is > enough. Another problem is that postmaster children that do > PGSharedMemoryDetach will still have valid inherited handles for > the shmem segment --- does that factor into the behavior? It looks > to me like the CloseHandle ought to be in PGSharedMemoryDetach. > > regards, tom lane This wouldn't be a problem in Unix? Jon
> Tom Lane wrote: > > "Roberts, Jon" <Jon.Roberts@asurion.com> writes: > >> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" > > > > Well, there are plenty of known bugs in 8.3.0 by now. You really > > should update before complaining, not after. > > Yes. And the traditional question should be asked - is there any > antivirus or other "personal security" software running on tihs machine? > If so, uninstall (not just disable!) it and see if the problem goes away. I am not able to un-install this. However, this problem only started as the database grew in size and usage. It is 232 MB in size now and it has 30 or so active sessions 24x7. > > >> 2008-07-01 10:46:30 CDT LOG: all server processes terminated; > >> reinitializing > > > > I think your real problem is with what happened *before* that. > > Agreed. There is some reason that they all terminated... > I just emailed a larger part of the log file. > > > But: > >> 2008-07-01 10:46:31 CDT FATAL: pre-existing shared memory block is > >> still in use > >> 2008-07-01 10:46:31 CDT HINT: Check if there are any old server > >> processes still running, and terminate them. > > > > Hmm ... the code in win32_shmem.c that generates this message seems > > mighty bogus to me --- it's just hoping that one-second delay is > > enough. > > Well, we're basically waiting for the kernel cleanup thread to get run. > In my tests it never came above 20ms or so, so 1 second is a pretty long > time. > > And you need *some* kind of timeout... > > It would also be interesting to know if there are actually any other > processes running at this time. > Yes, there were about 30 active sessions executing functions. > > Another problem is that postmaster children that do > > PGSharedMemoryDetach will still have valid inherited handles for > > the shmem segment --- does that factor into the behavior? It looks > > to me like the CloseHandle ought to be in PGSharedMemoryDetach. > > Not as long as the processes die. If they die, their handles go with > them, and once the reference count goes to zero, the object goes away. > > //Magnus
> Magnus Hagander <magnus@hagander.net> writes: > > Tom Lane wrote: > >> But the syslogger process (and maybe others) is *not* supposed to die. > > > Right. But are you saying we actually want to start up a new backend in > > a directory where we already have a running syslogger (and maybe others) > > processes, just no postmaster? > > Not great, maybe, but what it looks to me is that the current system > guarantees that a postmaster with a syslogger child will never recover > from a backend-child crash. That's not better. > When you say "current system", do you mean PG on Windows? Jon
Roberts, Jon wrote: >> "Roberts, Jon" <Jon.Roberts@asurion.com> writes: >>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" >> Well, there are plenty of known bugs in 8.3.0 by now. You really >> should update before complaining, not after. > > I'm not complaining. I just want to make sure that if I upgrade, it > will fix the problem. An upgrade could possible introduce a new > problem. I also wonder if this is isolated to Win32 because we are > upgrading to Solaris very soon. Are you able to get your hands on a cheap Linux box (say, convert a spare desktop) and do some testing on that? It may also be helpful to test on a Windows machine without a virus scanner, personal firewall, or other such system-mangling poison. If you can reproduce the fault in either of those configurations I'm sure it'd be very informative. Personally (and my opinion isn't worth much) I'd blame the virus scanner basically by default, because they're always causing subtle problems and introducing weird quirks. Something like PostgreSQL really does rely on the OS doing what it says it will, and virus scanners tend to bend or break the rules. -- Craig Ringer
Roberts, Jon wrote: >> Tom Lane wrote: >>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes: >>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" >>> Well, there are plenty of known bugs in 8.3.0 by now. You really >>> should update before complaining, not after. >> Yes. And the traditional question should be asked - is there any >> antivirus or other "personal security" software running on tihs > machine? >> If so, uninstall (not just disable!) it and see if the problem goes > away. > > I am not able to un-install this. However, this problem only started as > the database grew in size and usage. It is 232 MB in size now and it > has 30 or so active sessions 24x7. But you do run antivirus on the machine? Which antivirus? It's quite possible that this is the reason. As the files grow, the AV may take longer to do whatever crap it's doing to them, thus exceeding timeouts. In general, you can get all sorts of strange things when you run AV on your database server. The general recommendation is never to do that if you want things to work, and it will be the first thing you're told to remove. It would be good if you could at least temporarily remove it and see if it fixes the issue. >> It would also be interesting to know if there are actually any other >> processes running at this time. >> > > Yes, there were about 30 active sessions executing functions. Right, but are the processes still active at the moment whrere it crashes, or has the postmaster managed to kill them off for the restart? //Magnus
> Roberts, Jon wrote: > >> Tom Lane wrote: > >>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes: > >>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" > >>> Well, there are plenty of known bugs in 8.3.0 by now. You really > >>> should update before complaining, not after. > >> Yes. And the traditional question should be asked - is there any > >> antivirus or other "personal security" software running on tihs > > machine? > >> If so, uninstall (not just disable!) it and see if the problem goes > > away. > > > > I am not able to un-install this. However, this problem only started as > > the database grew in size and usage. It is 232 MB in size now and it > > has 30 or so active sessions 24x7. > > But you do run antivirus on the machine? Which antivirus? We are running Computer Associates eTrust Antivirus. The realtime scanner has an exclusion of e:\PostgreSQL and all subdirectories. > > It's quite possible that this is the reason. As the files grow, the AV > may take longer to do whatever crap it's doing to them, thus exceeding > timeouts. > The log files are located here: e:\PostgreSQL\data\pg_log Before I had that exclusion on the directory, the database would crash because data files would get locked. We don't get those errors now at all so I don't think this problem is caused by the antivirus program. We do have lots of sessions running at once. It seems that the extra load on the server could also cause it to exceed the timeout. > In general, you can get all sorts of strange things when you run AV on > your database server. The general recommendation is never to do that if > you want things to work, and it will be the first thing you're told to > remove. > I agree to an extent. I've used Oracle and SQL Server on Windows and when the antivirus program scans any database files, it screws it up. However, I've never had a problem with an antivirus program once the database directories were excluded. > It would be good if you could at least temporarily remove it and see if > it fixes the issue. > I can't do that. In this corporate environment, I would get shot if I did that. > > >> It would also be interesting to know if there are actually any other > >> processes running at this time. > >> > > > > Yes, there were about 30 active sessions executing functions. > > Right, but are the processes still active at the moment whrere it > crashes, or has the postmaster managed to kill them off for the restart? > Yes, there are active sessions. Could this be the problem? We have a queue table in which we only allow x number of concurrent jobs. A job is first inserted into the Queue with a Status of "Queued". Next the job executes a function to update the Status to "Processing". Function Details: 1. Creates a db_link 2. "lock table gp_load.queue in access exclusive mode" 3. executes this in the linked session: select count(*) from gp_load.queue where status = 'Processing' 4. if current jobs less than max jobs allowed then Update gp_load.queue set status = 'Processing' Commit 5. Close connection 6. Returns the status of still Queued or Processing I have to use a dblink because PostgreSQL doesn't have autonomous transactions. We call a single function to do all of this work and the other jobs need to see the current Status so they know if they should wait or run right now. The calling function uses a while loop to execute this function. If the function returns Queued, it means it was unable to change the status so it uses pg_sleep to wait 10 seconds and then it trys again until it is able to set the status to Processing. Sorry for the lengthy answer but yes, we could have active sessions using pg_sleep waiting for a slot in the Queue to process a job. The log file shows that this happened too. Jon
Roberts, Jon wrote: >> Roberts, Jon wrote: >>>> Tom Lane wrote: >>>>> "Roberts, Jon" <Jon.Roberts@asurion.com> writes: >>>>>> Version: "PostgreSQL 8.3.0, compiled by Visual C++ build 1400" >>>>> Well, there are plenty of known bugs in 8.3.0 by now. You really >>>>> should update before complaining, not after. >>>> Yes. And the traditional question should be asked - is there any >>>> antivirus or other "personal security" software running on tihs >>> machine? >>>> If so, uninstall (not just disable!) it and see if the problem goes >>> away. >>> >>> I am not able to un-install this. However, this problem only > started as >>> the database grew in size and usage. It is 232 MB in size now and > it >>> has 30 or so active sessions 24x7. >> But you do run antivirus on the machine? Which antivirus? > > We are running Computer Associates eTrust Antivirus. The realtime > scanner has an exclusion of e:\PostgreSQL and all subdirectories. Yeah. Exclusion alone often isn't enough though. CA isn't one of the ones on my "blacklist", but *all* AV is on my very-dark-graylist. >> It's quite possible that this is the reason. As the files grow, the AV >> may take longer to do whatever crap it's doing to them, thus exceeding >> timeouts. >> > The log files are located here: e:\PostgreSQL\data\pg_log > > Before I had that exclusion on the directory, the database would crash > because data files would get locked. We don't get those errors now at > all so I don't think this problem is caused by the antivirus program. It can still very much be caused by the AV. > We do have lots of sessions running at once. It seems that the extra > load on the server could also cause it to exceed the timeout. Lots of sessions = lots of processes = lots of AV don't like that. >> In general, you can get all sorts of strange things when you run AV on >> your database server. The general recommendation is never to do that > if >> you want things to work, and it will be the first thing you're told to >> remove. >> > I agree to an extent. I've used Oracle and SQL Server on Windows and > when the antivirus program scans any database files, it screws it up. > However, I've never had a problem with an antivirus program once the > database directories were excluded. The difference is that PostgreSQL uses a multi-process architecture with inherited handles. Most AV don't deal well with that, and it can show up as very strange errors. SQL Server and Oracle are both threadead on win32, which is what the AV software is used to seeing. >> It would be good if you could at least temporarily remove it and see > if >> it fixes the issue. >> > > I can't do that. In this corporate environment, I would get shot if I > did that. Oops. We wouldn't want *that* to happen :D >>>> It would also be interesting to know if there are actually any > other >>>> processes running at this time. >>>> >>> Yes, there were about 30 active sessions executing functions. >> Right, but are the processes still active at the moment whrere it >> crashes, or has the postmaster managed to kill them off for the > restart? > > Yes, there are active sessions. > > Could this be the problem? > > We have a queue table in which we only allow x number of concurrent > jobs. A job is first inserted into the Queue with a Status of "Queued". > Next the job executes a function to update the Status to "Processing". > > Function Details: > 1. Creates a db_link > 2. "lock table gp_load.queue in access exclusive mode" > 3. executes this in the linked session: > select count(*) from gp_load.queue where status = 'Processing' > 4. if current jobs less than max jobs allowed then > Update gp_load.queue set status = 'Processing' > Commit > 5. Close connection > 6. Returns the status of still Queued or Processing > > I have to use a dblink because PostgreSQL doesn't have autonomous > transactions. We call a single function to do all of this work and the > other jobs need to see the current Status so they know if they should > wait or run right now. > > The calling function uses a while loop to execute this function. If the > function returns Queued, it means it was unable to change the status so > it uses pg_sleep to wait 10 seconds and then it trys again until it is > able to set the status to Processing. > > Sorry for the lengthy answer but yes, we could have active sessions > using pg_sleep waiting for a slot in the Queue to process a job. The > log file shows that this happened too. Not having looked at the internals of db_link, I'd say it's certainly possible that this is the reason for the failed restart. If db_link is blocking something, the postmaster can't kill it off, and it'll still be sitting there holding a reference to the shared memory segment. That said, it shouldn't be the reason why it's crashing in the first place - just the reason why it won't restart properly. //Magnus
> > Not having looked at the internals of db_link, I'd say it's certainly > possible that this is the reason for the failed restart. If db_link is > blocking something, the postmaster can't kill it off, and it'll still be > sitting there holding a reference to the shared memory segment. > > That said, it shouldn't be the reason why it's crashing in the first > place - just the reason why it won't restart properly. > Is this a problem in Unix? We are about 1 - 2 weeks away from moving this database to Solaris. Jon
Roberts, Jon wrote: >> Not having looked at the internals of db_link, I'd say it's certainly >> possible that this is the reason for the failed restart. If db_link is >> blocking something, the postmaster can't kill it off, and it'll still > be >> sitting there holding a reference to the shared memory segment. >> >> That said, it shouldn't be the reason why it's crashing in the first >> place - just the reason why it won't restart properly. >> > > Is this a problem in Unix? We are about 1 - 2 weeks away from moving > this database to Solaris. Not likely, but I'd test it anyway. If the issue is related to AV, it's certainly fine - you won't be running AV on your Solaris. But more importantly, Unix has actual support for signals and not just the fake stuff we have on Win32, so it's likely that the postmaster will be capable of killing the child processes. //Magnus
"Roberts, Jon" <Jon.Roberts@asurion.com> writes: > I'm not complaining. I just want to make sure that if I upgrade, it > will fix the problem. An upgrade could possible introduce a new > problem. I also wonder if this is isolated to Win32 because we are > upgrading to Solaris very soon. The specific problem you are seeing is clearly isolated to Win32, because the error message you are getting comes out of Win32-only code. This is not to say that there might not be comparable failures on Unixen, but PG has a lot more track record on Unixen ... regards, tom lane
BTW, just looking at win32_shmem.c ... retptr = malloc(bufsize + 1 + 18); /* 1 NULL and 18 for * Global\PostgreSQL: */ if (retptr == NULL) elog(FATAL, "could not allocate memory for shared memory name"); strcpy(retptr, "Global\\PostgreSQL:"); r = GetFullPathName(DataDir, bufsize, retptr + 11, NULL); Surely that "11" ought to be "18". Also, since the loop immediately below this is going to convert \ to /, wouldn't it be clearer to describe the path prefix as Global/PostgreSQL: in the first place? (BTW, as far as I can tell the +1 added to the malloc request is useless: bufsize includes the trailing null, and the code would not work if it did not.) regards, tom lane
Magnus Hagander <magnus@hagander.net> writes: > Not likely, but I'd test it anyway. If the issue is related to AV, it's > certainly fine - you won't be running AV on your Solaris. But more > importantly, Unix has actual support for signals and not just the fake > stuff we have on Win32, so it's likely that the postmaster will be > capable of killing the child processes. I'm not sure what failure mode you're imagining, but the postmaster has already verified that all the children that are supposed to be connected to shared memory are dead before it attempts to recreate shared memory. So the above sounds completely bogus. I'm still suspicious of the syslogger holding onto an inherited handle to the shared-memory file, though that theory would seem to mean that crash recovery would never work at all on Windows if the syslogger were enabled. But maybe there is some additional gating factor needed to cause the problem to manifest. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Not likely, but I'd test it anyway. If the issue is related to AV, it's >> certainly fine - you won't be running AV on your Solaris. But more >> importantly, Unix has actual support for signals and not just the fake >> stuff we have on Win32, so it's likely that the postmaster will be >> capable of killing the child processes. > > I'm not sure what failure mode you're imagining, but the postmaster has > already verified that all the children that are supposed to be connected > to shared memory are dead before it attempts to recreate shared memory. > So the above sounds completely bogus. > > I'm still suspicious of the syslogger holding onto an inherited handle > to the shared-memory file, though that theory would seem to mean that > crash recovery would never work at all on Windows if the syslogger > were enabled. But maybe there is some additional gating factor needed > to cause the problem to manifest. Well, the syslogger is enabled by default on *all* binary installs on windows, so I think we would've seen more if it never works. I'll see if I can repro a case like it to see if the syslogger prevents the shared mem from going away when I get back to a dev box. Should be enough to just stick a sleep preventing it from stopping, right? //Magnus
Magnus Hagander <magnus@hagander.net> writes: > I'll see if I can repro a case like it to see if the syslogger prevents > the shared mem from going away when I get back to a dev box. Should be > enough to just stick a sleep preventing it from stopping, right? The syslogger isn't restarted at all during a crash --- this isn't a race-condition scenario. If there is a race condition here, it must be associated with cleanup for a process continuing to happen after win32_waitpid has already reported it dead. Hmm ... how much do we trust that bit of spaghetti around pgwin32_deadchild_callback? What condition is it really waiting for? regards, tom lane
> Roberts, Jon wrote: > >> Not having looked at the internals of db_link, I'd say it's certainly > >> possible that this is the reason for the failed restart. If db_link is > >> blocking something, the postmaster can't kill it off, and it'll still > > be > >> sitting there holding a reference to the shared memory segment. > >> > >> That said, it shouldn't be the reason why it's crashing in the first > >> place - just the reason why it won't restart properly. > >> > > > > Is this a problem in Unix? We are about 1 - 2 weeks away from moving > > this database to Solaris. > > Not likely, but I'd test it anyway. If the issue is related to AV, it's > certainly fine - you won't be running AV on your Solaris. But more > importantly, Unix has actual support for signals and not just the fake > stuff we have on Win32, so it's likely that the postmaster will be > capable of killing the child processes. > Our AV program has been off for a while now and I haven't had a crash. I think part of the problem is how we have PostgreSQL installed and how eTrust is configured. We have the binaries installed on C:\program files\PostgreSQL\8.3\ and the data is located on E:\PostgreSQL\data\. We have eTrust excluding just E:\PostgreSQL\data\. I'm guessing the activity on the binaries causes some scanning which may have slowed down the cleanup enough to cause the crash to happen. Jon
Roberts, Jon wrote: >> Roberts, Jon wrote: >>>> Not having looked at the internals of db_link, I'd say it's > certainly >>>> possible that this is the reason for the failed restart. If db_link > is >>>> blocking something, the postmaster can't kill it off, and it'll > still >>> be >>>> sitting there holding a reference to the shared memory segment. >>>> >>>> That said, it shouldn't be the reason why it's crashing in the > first >>>> place - just the reason why it won't restart properly. >>>> >>> Is this a problem in Unix? We are about 1 - 2 weeks away from > moving >>> this database to Solaris. >> Not likely, but I'd test it anyway. If the issue is related to AV, > it's >> certainly fine - you won't be running AV on your Solaris. But more >> importantly, Unix has actual support for signals and not just the fake >> stuff we have on Win32, so it's likely that the postmaster will be >> capable of killing the child processes. >> > > Our AV program has been off for a while now and I haven't had a crash. > I think part of the problem is how we have PostgreSQL installed and how > eTrust is configured. We have the binaries installed on C:\program > files\PostgreSQL\8.3\ and the data is located on E:\PostgreSQL\data\. > We have eTrust excluding just E:\PostgreSQL\data\. > > I'm guessing the activity on the binaries causes some scanning which may > have slowed down the cleanup enough to cause the crash to happen. Yeah, that does seem like a reasonable explanation. Yet another reason not to use AV on your database server ;-) And if you absolutely have to, exclude the postgresql stuff. Since we do re-execute postgres.exe for every new connection, it's quite possible that the AV scanned it every single time, and it's a fairly large EXE... //Magnus
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> I'll see if I can repro a case like it to see if the syslogger prevents >> the shared mem from going away when I get back to a dev box. Should be >> enough to just stick a sleep preventing it from stopping, right? > > The syslogger isn't restarted at all during a crash --- this isn't > a race-condition scenario. > > If there is a race condition here, it must be associated with cleanup > for a process continuing to happen after win32_waitpid has already > reported it dead. Hmm ... how much do we trust that bit of spaghetti > around pgwin32_deadchild_callback? What condition is it really waiting > for? I looked that code over a bit again, and it still looks good to me :-) The wait on the handle will fire when a process exits (according to the API). When it does, we post that information to the queue and send SIGCHLD. And the waitpid function pick off the top of the queue. (It's not particularly spaghettified if you know your way around those APIs :-P That's not to say it's impossible there's a bug there, of course) //Magnus