[BUGS] BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp)
От | matioli.matheus@gmail.com |
---|---|
Тема | [BUGS] BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp) |
Дата | |
Msg-id | 20170920092609.17363.35370@wrigleys.postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp)
(Jeremy Schneider <schnjere@amazon.com>)
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 14820 Logged by: Matheus de Oliveira Email address: matioli.matheus@gmail.com PostgreSQL version: 9.6.4 Operating system: Linux CentOS 7 Description: Hi dear hackers, I have a situation that I _think_ it may be a bug, although I couldn't find the reason for that in detail. I have 3 servers, 1 primary and 2 standby (using streaming replication and hot standby). All of them with 9.6.4 version. In the primary it was adjust some parameters: wal_level = 'logical' max_worker_processes = 10 max_replication_slots = 10 max_wal_senders = 10 shared_preload_libraries= 'pglogical' track_commit_timestamp = on Since we changed `max_worker_processes` both standby crashed (that is expected), with the messages: 2017-09-20 03:30:51.387 -03 [84754]: [2-1] LOG: replication terminated by primary server 2017-09-20 03:30:51.387 -03 [84754]: [3-1] DETAIL: End of WAL reached on timeline 2 at 23AF/68000098. 2017-09-20 03:30:51.388 -03 [84754]: [4-1] FATAL: could not send end-of-streaming message to primary: no COPY in progress 2017-09-20 03:30:51.391 -03 [84739]: [6-1] LOG: invalid resource manager ID 55 at 23AF/68000098 2017-09-20 03:30:54.525 -03 [50377]: [1-1] FATAL: could not connect to the primary server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2017-09-20 03:30:56.402 -03 [50378]: [1-1] LOG: startedstreaming WAL from primary at 23AF/68000000 on timeline 2 2017-09-20 03:30:56.411 -03 [84739]: [7-1] FATAL: hot standby is not possible because max_worker_processes = 8 is a lower setting than on the master server (its value was 10) 2017-09-20 03:30:56.411 -03 [84739]: [8-1] CONTEXT: xlog redo at 23AF/68000098 for XLOG/PARAMETER_CHANGE: max_connections=3000 max_worker_processes=10 max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical wal_log_hints=off track_commit_timestamp=on 2017-09-20 03:30:57.261 -03 [84736]: [6-1] LOG: startupprocess (PID 84739) exited with exit code 1 2017-09-20 03:30:57.261 -03 [84736]: [7-1] LOG: terminating any other active server processes 2017-09-20 03:30:57.261 -03 [50376]: [3-1] WARNING: terminating connection because of crash of another server process 2017-09-20 03:30:57.261 -03 [50376]: [4-1] DETAIL: The postmasterhas commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-09-20 03:30:57.261 -03 [50376]: [5-1] HINT: In a moment you should be able to reconnect to the database and repeat your command. ... => many of the above (for all connections) 2017-09-2003:31:00.509 -03 [84736]: [8-1] LOG: database system is shut down After that, we adjusted all the new parameters on the standby servers the same as the primary, and tried to started them again, with no success: 2017-09-20 03:46:36.041 -03 [50892]: [1-1] LOG: database system was interrupted while in recovery at log time 2017-09-20 03:17:15 -03 2017-09-20 03:46:36.041 -03 [50892]: [2-1] HINT: Ifthis has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2017-09-20 03:46:36.598 -03 [50892]: [3-1] LOG: entering standby mode 2017-09-20 03:46:36.607-03 [50892]: [4-1] LOG: redo starts at 23AF/47E40620 2017-09-20 03:46:36.621 -03 [50892]: [5-1] FATAL: could not access status of transaction 1556437428 2017-09-20 03:46:36.621 -03 [50892]: [6-1] DETAIL: Could not read from file "pg_commit_ts/E7FB" at offset 229376: Success. 2017-09-20 03:46:36.621 -03 [50892]: [7-1] CONTEXT: xlog redo at 23AF/481CE818 for Transaction/COMMIT: 2017-09-20 03:17:18.239579-03 2017-09-20 03:46:36.623 -03 [50889]: [3-1] LOG: startupprocess (PID 50892) exited with exit code 1 2017-09-20 03:46:36.623 -03 [50889]: [4-1] LOG: terminating any other active server processes 2017-09-20 03:46:36.721 -03 [50889]: [5-1] LOG: database system is shut down Now any attempt to start any of the standby servers gives the above errors (the transaction and offset are exact the same). Do you think it may be a bug? Notice that I loaded `pglogical`, which I don't think is the one causing trouble, but it may be an important information. It seems to be something related to `track_commit_timestamp` or using `wal_level='logical'`, but I'm not sure how to debug it. I have checked and the file `pg_commit_ts/E7FB` is on the data directory with 229376, so looks like it is trying to read the file past its end, hence the above error. I have resynchronized one of the standby servers, but I left the other one in this state so I can help with any further debugging if needed. The instance has about 1.2 TB, so trial and error with dd and something like that is not that simple. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Tomas VondraДата:
Сообщение: Re: [BUGS] Query planner skipping index depending on DISTINCTparameter order (2)
Следующее
От: lukas@fittl.comДата:
Сообщение: [BUGS] BUG #14821: idle_in_transaction_session_timeout sometimes getsignored when statement timeout is pending