Обсуждение: [BUGS] BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp)

Поиск
Список
Период
Сортировка

[BUGS] BUG #14820: Standby crash with "could not access status oftransaction" (for track_commit_timestamp)

От
matioli.matheus@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      14820
Logged by:          Matheus de Oliveira
Email address:      matioli.matheus@gmail.com
PostgreSQL version: 9.6.4
Operating system:   Linux CentOS 7
Description:

Hi dear hackers,

I have a situation that I _think_ it may be a bug, although I couldn't find
the reason for that in detail.

I have 3 servers, 1 primary and 2 standby (using streaming replication and
hot standby). All of them with 9.6.4 version. In the primary it was adjust
some parameters:
   wal_level = 'logical'   max_worker_processes = 10   max_replication_slots = 10   max_wal_senders = 10
shared_preload_libraries= 'pglogical'   track_commit_timestamp = on
 

Since we changed `max_worker_processes` both standby crashed (that is
expected), with the messages:
   2017-09-20 03:30:51.387 -03 [84754]: [2-1] LOG:  replication terminated
by primary server   2017-09-20 03:30:51.387 -03 [84754]: [3-1] DETAIL:  End of WAL reached
on timeline 2 at 23AF/68000098.   2017-09-20 03:30:51.388 -03 [84754]: [4-1] FATAL:  could not send
end-of-streaming message to primary: no COPY in progress   2017-09-20 03:30:51.391 -03 [84739]: [6-1] LOG:  invalid
resource
manager ID 55 at 23AF/68000098   2017-09-20 03:30:54.525 -03 [50377]: [1-1] FATAL:  could not connect to
the primary server: server closed the connection unexpectedly               This probably means the server terminated
abnormally              before or while processing the request.   2017-09-20 03:30:56.402 -03 [50378]: [1-1] LOG:
startedstreaming WAL 
from primary at 23AF/68000000 on timeline 2   2017-09-20 03:30:56.411 -03 [84739]: [7-1] FATAL:  hot standby is not
possible because max_worker_processes = 8 is a lower setting than on the
master server (its value was 10)   2017-09-20 03:30:56.411 -03 [84739]: [8-1] CONTEXT:  xlog redo at
23AF/68000098 for XLOG/PARAMETER_CHANGE: max_connections=3000
max_worker_processes=10 max_prepared_xacts=0 max_locks_per_xact=64
wal_level=logical wal_log_hints=off track_commit_timestamp=on   2017-09-20 03:30:57.261 -03 [84736]: [6-1] LOG:
startupprocess (PID 
84739) exited with exit code 1   2017-09-20 03:30:57.261 -03 [84736]: [7-1] LOG:  terminating any other
active server processes   2017-09-20 03:30:57.261 -03 [50376]: [3-1] WARNING:  terminating
connection because of crash of another server process   2017-09-20 03:30:57.261 -03 [50376]: [4-1] DETAIL:  The
postmasterhas 
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.   2017-09-20 03:30:57.261 -03 [50376]: [5-1] HINT:  In a moment you should
be able to reconnect to the database and repeat your command.   ... => many of the above (for all connections)
2017-09-2003:31:00.509 -03 [84736]: [8-1] LOG:  database system is shut 
down

After that, we adjusted all the new parameters on the standby servers the
same as the primary, and tried to started them again, with no success:
   2017-09-20 03:46:36.041 -03 [50892]: [1-1] LOG:  database system was
interrupted while in recovery at log time 2017-09-20 03:17:15 -03   2017-09-20 03:46:36.041 -03 [50892]: [2-1] HINT:
Ifthis has occurred 
more than once some data might be corrupted and you might need to choose an
earlier recovery target.   2017-09-20 03:46:36.598 -03 [50892]: [3-1] LOG:  entering standby mode   2017-09-20
03:46:36.607-03 [50892]: [4-1] LOG:  redo starts at 
23AF/47E40620   2017-09-20 03:46:36.621 -03 [50892]: [5-1] FATAL:  could not access
status of transaction 1556437428   2017-09-20 03:46:36.621 -03 [50892]: [6-1] DETAIL:  Could not read from
file "pg_commit_ts/E7FB" at offset 229376: Success.   2017-09-20 03:46:36.621 -03 [50892]: [7-1] CONTEXT:  xlog redo at
23AF/481CE818 for Transaction/COMMIT: 2017-09-20 03:17:18.239579-03   2017-09-20 03:46:36.623 -03 [50889]: [3-1] LOG:
startupprocess (PID 
50892) exited with exit code 1   2017-09-20 03:46:36.623 -03 [50889]: [4-1] LOG:  terminating any other
active server processes   2017-09-20 03:46:36.721 -03 [50889]: [5-1] LOG:  database system is shut
down

Now any attempt to start any of the standby servers gives the above errors
(the transaction and offset are exact the same). Do you think it may be a
bug? Notice that I loaded `pglogical`, which I don't think is the one
causing trouble, but it may be an important information. It seems to be
something related to `track_commit_timestamp` or using
`wal_level='logical'`, but I'm not sure how to debug it.

I have checked and the file `pg_commit_ts/E7FB` is on the data directory
with 229376, so looks like it is trying to read the file past its end, hence
the above error.

I have resynchronized one of the standby servers, but I left the other one
in this state so I can help with any further debugging if needed. The
instance has about 1.2 TB, so trial and error with dd and something like
that is not that simple.


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On 9/20/17 02:26, matioli.matheus@gmail.com wrote:
The following bug has been logged on the website:

Bug reference:      14820
...   2017-09-20 03:46:36.598 -03 [50892]: [3-1] LOG:  entering standby mode   2017-09-20 03:46:36.607 -03 [50892]: [4-1] LOG:  redo starts at
23AF/47E40620   2017-09-20 03:46:36.621 -03 [50892]: [5-1] FATAL:  could not access
status of transaction 1556437428   2017-09-20 03:46:36.621 -03 [50892]: [6-1] DETAIL:  Could not read from
file "pg_commit_ts/E7FB" at offset 229376: Success.   2017-09-20 03:46:36.621 -03 [50892]: [7-1] CONTEXT:  xlog redo at
23AF/481CE818 for Transaction/COMMIT: 2017-09-20 03:17:18.239579-03   2017-09-20 03:46:36.623 -03 [50889]: [3-1] LOG:  startup process (PID
50892) exited with exit code 1   2017-09-20 03:46:36.623 -03 [50889]: [4-1] LOG:  terminating any other
active server processes   2017-09-20 03:46:36.721 -03 [50889]: [5-1] LOG:  database system is shut
down

Now any attempt to start any of the standby servers gives the above errors
(the transaction and offset are exact the same). Do you think it may be a
bug? Notice that I loaded `pglogical`, which I don't think is the one
causing trouble, but it may be an important information. It seems to be
something related to `track_commit_timestamp` or using
`wal_level='logical'`, but I'm not sure how to debug it.

I have checked and the file `pg_commit_ts/E7FB` is on the data directory
with 229376, so looks like it is trying to read the file past its end, hence
the above error.

I recently saw a problem that looked very similar to this, seems worth reporting on the same thread.

PostgreSQL version: 10.3
Operating system: Linux

On startup, while attempting recovery, PostgreSQL failed to startup with the following error:

FATAL: could not access status of transaction 68544569
DETAIL: Could not read from file "pg_commit_ts/0A37" at offset 106496: Success.
CONTEXT: WAL redo at 70E/E5FE2030 for Transaction/COMMIT: 2018-xx-xx xx:xx:xx.xxxxxx-xx; subxacts: 68544570
LOG: startup process (PID 100292) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down

This is on a primary, not a standby.  track_commit_timestamp is on, wal_level is replica and there are no shared_preload_libraries.

The pg_commit_ts directory was actually empty when recovery started and in the process of recovery PostgreSQL recreated the 0A37 file.  However it consistently created a 106496 byte file (entirely filled with 0/nulls) and then threw the error above when trying to read beyond the end.  If we removed the 0A37 file and restarted then it would again recreate the 106496 byte file and error out in the same way.

Here's the WAL it was trying to apply at failure, with preceding and following entries:

rmgr: Heap        len (rec/tot):     71/    71, tx:   68544569, lsn: 70E/E5FE1FD0, prev 70E/E5FE1F98, desc: HOT_UPDATE off 2 xmax 68544569 ; new off 10 xmax 0, blkref #0: rel 1663/3850018/16485 blk 295474
rmgr: Transaction len (rec/tot):     46/    46, tx:   68544569, lsn: 70E/E5FE2030, prev 70E/E5FE1FD0, desc: COMMIT 2018-xx-xx xx:xx:xx.xxxxxx UTC; subxacts: 68544570
rmgr: Heap        len (rec/tot):     54/    54, tx:   68539960, lsn: 70E/E5FE2060, prev 70E/E5FE2030, desc: LOCK off 12: xid 68539960: flags 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/11038913/16866 blk 6135

Unfortunately the WAL files from the time of the issue are no longer available, so I'm unable to go back and do any additional analysis of the original crash.  (Like looking more at those transactions.)  But I wanted to at least get a record over here of what happened.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services
On 5/23/18 14:20, Jeremy Schneider wrote:
> On 9/20/17 02:26, matioli.matheus@gmail.com wrote:
>> I have checked and the file `pg_commit_ts/E7FB` is on the data directory
>> with 229376, so looks like it is trying to read the file past its end, hence
>> the above error.
> 
> I recently saw a problem that looked very similar to this, seems worth
> reporting on the same thread.
> 
> PostgreSQL version: 10.3
> Operating system: Linux
> 
> On startup, while attempting recovery, PostgreSQL failed to startup with
> the following error:
> 
> FATAL: could not access status of transaction 68544569
> DETAIL: Could not read from file "pg_commit_ts/0A37" at offset 106496:
> Success.
> CONTEXT: WAL redo at 70E/E5FE2030 for Transaction/COMMIT: 2018-xx-xx
> xx:xx:xx.xxxxxx-xx; subxacts: 68544570
> LOG: startup process (PID 100292) exited with exit code 1
> LOG: aborting startup due to startup process failure
> LOG: database system is shut down
> 
> This is on a primary, not a standby.  track_commit_timestamp is on,
> wal_level is replica and there are no shared_preload_libraries.
> 
> The pg_commit_ts directory was actually empty when recovery started and
> in the process of recovery PostgreSQL recreated the 0A37 file.  However
> it consistently created a 106496 byte file (entirely filled with
> 0/nulls) and then threw the error above when trying to read beyond the
> end.  If we removed the 0A37 file and restarted then it would again
> recreate the 106496 byte file and error out in the same way.
> 
> Here's the WAL it was trying to apply at failure, with preceding and
> following entries:
> 
> rmgr: Heap        len (rec/tot):     71/    71, tx:   68544569, lsn:
> 70E/E5FE1FD0, prev 70E/E5FE1F98, desc: HOT_UPDATE off 2 xmax 68544569 ;
> new off 10 xmax 0, blkref #0: rel 1663/3850018/16485 blk 295474
> rmgr: Transaction len (rec/tot):     46/    46, tx:   68544569, lsn:
> 70E/E5FE2030, prev 70E/E5FE1FD0, desc: COMMIT 2018-xx-xx xx:xx:xx.xxxxxx
> UTC; subxacts: 68544570
> rmgr: Heap        len (rec/tot):     54/    54, tx:   68539960, lsn:
> 70E/E5FE2060, prev 70E/E5FE2030, desc: LOCK off 12: xid 68539960: flags
> 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/11038913/16866 blk 6135
> 
> Unfortunately the WAL files from the time of the issue are no longer
> available, so I'm unable to go back and do any additional analysis of
> the original crash.  (Like looking more at those transactions.)  But I
> wanted to at least get a record over here of what happened.

I was able to get a little more information about this.

We can see that:
* a vacuum was running on TABLE1 at the time of the problem
* there were many transactions containing sub-transactions (inserting
into TABLE1) at the time of the problem
*** vacuum activity was seen on TABLE1 concurrently for quite some time
before the problem
* nearly all DB activity including a stream of inserts into TABLE2
stopped abruptly at TIME1
* at TIME1 there were 9 xacts in flight. 3 of these aborted around this
time.
* after TIME1, one new transaction started, inserting into TABLE3 (with
subxact in TABLE1). that xact committed at 06:04:04.015169 UTC.
* the vacuum on TABLE1 continue just past the new transaction commit. it
was the last observed activity in the DB before crash.
* underlying I/O system looks healthy during this entire period.

Not enough information to get a root cause, but I wanted to record it in
case someone hits something similar in the future.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services