Обсуждение: BUG #16961: Could not access status of transaction
The following bug has been logged on the website: Bug reference: 16961 Logged by: Sergey Zhuravlev Email address: sergii.zhuravlev@smartnet.ua PostgreSQL version: 13.2 Operating system: CentOS Linux release 7.9.2009 (Core) Description: Hello Command - LISTEN missed_trades_empty_instrument " ERROR: could not access status of transaction 1954017648 DETAIL: Could not open file "pg_xact/0747": No such file or directory. STATEMENT: LISTEN missed_trades_empty_instrument " Current transaction # select txid_current(); txid_current -------------- 6985716158 Parameter - autovacuum_freeze_max_age = 200000000
Any other ideas?
Thanks!
From: "PG Bug reporting form" <noreply@postgresql.org>
Date: 13 April 2021, 18:50:26
The following bug has been logged on the website: Bug reference: 16961 Logged by: Sergey Zhuravlev Email address: sergii.zhuravlev@smartnet.ua PostgreSQL version: 13.2 Operating system: CentOS Linux release 7.9.2009 (Core) Description: Hello Command - LISTEN missed_trades_empty_instrument " ERROR: could not access status of transaction 1954017648 DETAIL: Could not open file "pg_xact/0747": No such file or directory. STATEMENT: LISTEN missed_trades_empty_instrument " Current transaction # select txid_current(); txid_current -------------- 6985716158 Parameter - autovacuum_freeze_max_age = 200000000
The database was rebooted and the issue disappeared
Hopefully it was one time issue.
From: Stepan Yankevych <stepya@ukr.net>
Sent: Wednesday, April 14, 2021 7:12 PM
To: sergii.zhuravlev@smartnet.ua; pgsql-bugs@lists.postgresql.org
Cc: sergii.zhuravlev@smartnet.ua
Subject: Re: BUG #16961: Could not access status of transaction
Hi Guys!
Let me clarify few things things about the issue.
The issue happening each morning when application starts on the production DataBase during about a month.
Always the same transaction id is mentioned in the error (1954017648)
We tried to do UNLISTEN - no changes. the same issue.
LISTEN works good for any other channels.
Can it be related to some hanged transaction? 1954017648? (for example while some network interruption)
Is it possible to kill/clean it somehow without DB restart?
Can it be related to some non-vacuumed system table or so?
Any other ideas?
Thanks!
--- Original message ---
From: "PG Bug reporting form" <noreply@postgresql.org>
Date: 13 April 2021, 18:50:26
The following bug has been logged on the website:Bug reference: 16961Logged by: Sergey ZhuravlevEmail address: sergii.zhuravlev@smartnet.uaPostgreSQL version: 13.2Operating system: CentOS Linux release 7.9.2009 (Core)Description:HelloCommand - LISTEN missed_trades_empty_instrument"ERROR: could not access status of transaction 1954017648DETAIL: Could not open file "pg_xact/0747": No such file or directory.STATEMENT: LISTEN missed_trades_empty_instrument"Current transaction# select txid_current();txid_current--------------6985716158Parameter - autovacuum_freeze_max_age = 200000000
On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote: > The database was rebooted and the issue disappeared Each postmaster restart clears the LISTEN/NOTIFY queue, so that fits. > The issue happening each morning when application starts on the production DataBase during about a month. > Always the same transaction id is mentioned in the error (1954017648) > We tried to do UNLISTEN - no changes. the same issue. > LISTEN works good for any other channels. > > > Can it be related to some hanged transaction? 1954017648? (for example while some network interruption) Something like that; see below. > Is it possible to kill/clean it somehow without DB restart? Not to my knowledge. > Can it be related to some non-vacuumed system table or so? Probably not. > Command - LISTEN missed_trades_empty_instrument > > ERROR: could not access status of transaction 1954017648 > DETAIL: Could not open file "pg_xact/0747": No such file or directory. > STATEMENT: LISTEN missed_trades_empty_instrument The LISTEN/NOTIFY queue stores a transaction id for each notification (internally, each AsyncQueueEntry). I can imagine the "could not access status" happening if a sequence of events like this happened since the last postmaster restart: backend 1: LISTEN missed_trades_empty_instrument backend 2: BEGIN; NOTIFY missed_trades_empty_instrument [TransactionId N] backend 1: BEGIN backend 2: COMMIT backend 1: CREATE TEMP TABLE x (); [sets TransactionId N+K] autovacuum: freezes tuples, deletes pg_xact data covering TransactionId N backend 1: COMMIT On the other hand, if that's what happened, your report that "LISTEN works good for any other channels" surprises me. Perhaps something completely different happened on your system. We could prevent the trouble if vac_truncate_clog() had access to the oldest xid in the notification queue; it would set frozenXID to that value if frozenXID would otherwise be older.
Noah Misch <noah@leadboat.com> writes: > On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote: >> The issue happening each morning when application starts on the production DataBase during about a month. >> Always the same transaction id is mentioned in the error (1954017648) >> We tried to do UNLISTEN - no changes. the same issue. >> LISTEN works good for any other channels. > ... > On the other hand, if that's what happened, your report that "LISTEN works > good for any other channels" surprises me. Perhaps something completely > different happened on your system. I suspect the true state of affairs was more like "LISTEN works good in any other databases". The described symptoms are consistent with there being a message in the NOTIFY queue that has a pruned-away xid. An incoming listener would try to scan over already-committed notify messages, but testing to see whether this stale message is committed would fail. However, since asyncQueueProcessPageEntries just ignores messages not targeted for the current database, incoming listeners in other databases wouldn't notice the problem. ISTM the interesting question here is what was holding back truncation of the NOTIFY queue. Could there have been an open transaction somewhere that was failing to collect NOTIFY data? But a transaction sitting open for a month is likely to cause far more severe problems than that one. > We could prevent the trouble if vac_truncate_clog() had access to the oldest > xid in the notification queue; it would set frozenXID to that value if > frozenXID would otherwise be older. Perhaps. I'm not sure how hard it is to extract the oldest xid in the queue (where "oldest" is defined as "numerically smallest"). The entries are in xid commit order which is a different thing. regards, tom lane
On Sat, Jun 05, 2021 at 05:25:39PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote: > >> The issue happening each morning when application starts on the production DataBase during about a month. > >> Always the same transaction id is mentioned in the error (1954017648) > >> We tried to do UNLISTEN - no changes. the same issue. > >> LISTEN works good for any other channels. > > > ... > > On the other hand, if that's what happened, your report that "LISTEN works > > good for any other channels" surprises me. Perhaps something completely > > different happened on your system. > > I suspect the true state of affairs was more like "LISTEN works good in > any other databases". The described symptoms are consistent with there > being a message in the NOTIFY queue that has a pruned-away xid. An > incoming listener would try to scan over already-committed notify > messages, but testing to see whether this stale message is committed > would fail. However, since asyncQueueProcessPageEntries just ignores > messages not targeted for the current database, incoming listeners > in other databases wouldn't notice the problem. > > ISTM the interesting question here is what was holding back truncation > of the NOTIFY queue. Could there have been an open transaction somewhere > that was failing to collect NOTIFY data? Open transactions always decline to collect notify data, don't they? See ProcessNotifyInterrupt(). > But a transaction sitting open > for a month is likely to cause far more severe problems than that one. True. > > We could prevent the trouble if vac_truncate_clog() had access to the oldest > > xid in the notification queue; it would set frozenXID to that value if > > frozenXID would otherwise be older. > > Perhaps. I'm not sure how hard it is to extract the oldest xid in the > queue (where "oldest" is defined as "numerically smallest"). The entries > are in xid commit order which is a different thing. Yeah, it wouldn't be cheap in the general case. The value could be a field in pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also run roughly as often as autovacuum visits template0.
Hi, On Wed, Jun 25, 2025 at 10:29 AM Noah Misch <noah@leadboat.com> wrote: > > On Sat, Jun 05, 2021 at 05:25:39PM -0400, Tom Lane wrote: > > > > ISTM the interesting question here is what was holding back truncation > > of the NOTIFY queue. Could there have been an open transaction somewhere > > that was failing to collect NOTIFY data? > > Open transactions always decline to collect notify data, don't they? See > ProcessNotifyInterrupt(). > > > But a transaction sitting open > > for a month is likely to cause far more severe problems than that one. > > True. > One of our clients faced this problem, and I found out that long transactions may have nothing to do with it. We have the following logic in the notify queue : If there are no listeners within all databases, and we are calling LISTEN, then we must iterate from 'tail' to 'head' of the queue and check statuses of transactions (see Exec_ListenPreCommit). If there is a pruned-away xid in the queue, we will try to access its status and get an error. Because the tail of the queue is not necessarily always advanced forward by the listeners, we can get such error without any long lived transactions. > > > We could prevent the trouble if vac_truncate_clog() had access to the oldest > > > xid in the notification queue; it would set frozenXID to that value if > > > frozenXID would otherwise be older. > > > > Perhaps. I'm not sure how hard it is to extract the oldest xid in the > > queue (where "oldest" is defined as "numerically smallest"). The entries > > are in xid commit order which is a different thing. > > Yeah, it wouldn't be cheap in the general case. The value could be a field in > pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also > run roughly as often as autovacuum visits template0. As a temporary solution, which almost completely eliminates the possibility of such a situation, I suggest adding a AsyncQueueAdvanceTail call to vacuum (inside vac_update_datfrozenxid). I mean 'static asyncQueueAdvanceTail()' that should be made 'external'. Thus, we can get rid of the problem that I described above. But there is one more : if the listener client is lagging, the AsyncQueueAdvanceTail call will not save us (because it cannot advance tail any further than the listener's position in the queue). Again, this may be due to a very high load, but not because the client keeps the transaction open for a very long time. The best solution is to teach vacuum to recognize the minimum xid in constant time, but I didn't come up with any sane implementations. What do you think? -- Best regards, Daniil Davydov Postgres Professional
A customer encountered this issue, and thanks to Andrei Varashen's
very well described steps in the duplicate thread[1], I can reproduce
it on 16.3 and on the master branch.
One of our clients faced this problem, and I found out that long
transactions may have nothing to do with it.
We have the following logic in the notify queue :
If there are no listeners within all databases, and we are calling
LISTEN, then we must iterate from 'tail' to 'head' of the queue and
check statuses of transactions (see Exec_ListenPreCommit).
If there is a pruned-away xid in the queue, we will try to access its
status and get an error.
Because the tail of the queue is not necessarily always advanced
forward by the listeners, we can get such error without any long lived
transactions.
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
* frame #0: 0x0000000102c958c8 postgres`SlruReportIOError(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:1084:4
frame #1: 0x0000000102c952b0 postgres`SimpleLruReadPage(ctl=0x00000001037aa5e0, pageno=0, write_ok=true, xid=757) at slru.c:603:4
frame #2: 0x0000000102c95f7c postgres`SimpleLruReadPage_ReadOnly(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:661:9
frame #3: 0x0000000102c866bc postgres`TransactionIdGetStatus(xid=757, lsn=0x000000016d27b5e8) at clog.c:745:11
frame #4: 0x0000000102c9a924 postgres`TransactionLogFetch(transactionId=757) at transam.c:79:14
frame #5: 0x0000000102c9a74c postgres`TransactionIdDidCommit(transactionId=757) at transam.c:130:14
frame #6: 0x0000000102de4a2c postgres`asyncQueueProcessPageEntries(current=0x000000016d27b720, stop=QueuePosition @ 0x000000016d27b690, page_buffer="\U0000001c", snapshot=0x00000001310439a0) at async.c:2069:13
frame #7: 0x0000000102de47ec postgres`asyncQueueReadAllNotifications at async.c:1981:18
frame #8: 0x0000000102de29f4 postgres`Exec_ListenPreCommit at async.c:1127:3
frame #9: 0x0000000102de24e4 postgres`PreCommit_Notify at async.c:881:6
frame #10: 0x0000000102ca877c postgres`CommitTransaction at xact.c:2341:2
frame #11: 0x0000000102ca3b2c postgres`CommitTransactionCommandInternal at xact.c:3214:4
frame #12: 0x0000000102ca3a44 postgres`CommitTransactionCommand at xact.c:3175:10
frame #13: 0x0000000103238d1c postgres`finish_xact_command at postgres.c:2833:3
frame #14: 0x00000001032368d4 postgres`exec_simple_query(query_string="LISTEN test_chan;") at postgres.c:1298:4
frame #15: 0x000000010323599c postgres`PostgresMain(dbname="test", username="alex.wang") at postgres.c:4767:7
frame #16: 0x000000010322cca8 postgres`BackendMain(startup_data=0x000000016d27de48, startup_data_len=24) at backend_startup.c:124:2
frame #17: 0x0000000103104bcc postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=78, startup_data=0x000000016d27de48, startup_data_len=24, client_sock=0x000000016d27ded8) at launch_backend.c:290:3
frame #18: 0x000000010310c5a0 postgres`BackendStartup(client_sock=0x000000016d27ded8) at postmaster.c:3587:8
frame #19: 0x000000010310a404 postgres`ServerLoop at postmaster.c:1702:6
frame #20: 0x0000000103108ebc postgres`PostmasterMain(argc=3, argv=0x00006000018854e0) at postmaster.c:1400:11
frame #21: 0x0000000102f8f6b8 postgres`main(argc=3, argv=0x00006000018854e0) at main.c:231:4
frame #22: 0x00000001940a2b98 dyld`start + 6076
the channel. This new listener needs to process previously committed
notifications. If no other active connections are listening to the
same channel, it starts processing from the tail of the queue.
This problem occurs when an AsyncQueueEntry is still present, but the
sender's xid it contains is older than the database's frozenxid. This
can happen if a previous listener disconnects, not enough messages for
the notifier to advance the async queue's tail, but enough transactions
for VACUUM to truncate the pg_xact logs.
> > > We could prevent the trouble if vac_truncate_clog() had access to the oldest
> > > xid in the notification queue; it would set frozenXID to that value if
> > > frozenXID would otherwise be older.
> >
> > Perhaps. I'm not sure how hard it is to extract the oldest xid in the
> > queue (where "oldest" is defined as "numerically smallest"). The entries
> > are in xid commit order which is a different thing.
>
> Yeah, it wouldn't be cheap in the general case. The value could be a field in
> pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also
> run roughly as often as autovacuum visits template0.
As a temporary solution, which almost completely eliminates the
possibility of such a situation, I suggest adding a
AsyncQueueAdvanceTail call to vacuum (inside vac_update_datfrozenxid).
I mean 'static asyncQueueAdvanceTail()' that should be made
'external'. Thus, we can get rid of the problem that I described
above.
But there is one more : if the listener client is lagging, the
AsyncQueueAdvanceTail call will not save us (because it cannot advance
tail any further than the listener's position in the queue).
Again, this may be due to a very high load, but not because the client
keeps the transaction open for a very long time.
The best solution is to teach vacuum to recognize the minimum xid in
constant time, but I didn't come up with any sane implementations.
What do you think?
determines the cutoff xid, it doesn't check asyncQueueControl for the
sender's xids stored in each AsyncQueueEntry. It make sense that the
async queue itself isn't expected to survive a database restart, but
should the sender's xids it stores be considered?
What are your thoughts?
[1] https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2%40postgresql.org
PS: To reproduce the bug, you must first disconnect any backend
listening on the channel (backend 1 in Andrei Varashen's repro) before
registering the new listener that triggers the error.
Hi, On Fri, Aug 1, 2025 at 10:41 AM Alexandra Wang <alexandra.wang.oss@gmail.com> wrote: > > I'm not sure what's the best solution here. When ComputeXidHorizons() > determines the cutoff xid, it doesn't check asyncQueueControl for the > sender's xids stored in each AsyncQueueEntry. It make sense that the > async queue itself isn't expected to survive a database restart, but > should the sender's xids it stores be considered? > > What are your thoughts? > Finding the minimum xid in async queue may consume a lot of time, so I don't think that we should teach ComputeXidHorizons to look into async queue : 1) This is pretty "hot" function 2) We don't need minimal xid from async queue each time ComputeXidHorizons is called IMO, the best solution is to create something like the "AsyncQueueMinXid" function and call it in the beginning of vac_update_datfrozenxid. Thus, newFrozenXid will be capped by minimal sender's xid and autovacuum could not advance it too far. If we want to find out the minimum xid in a reasonable amount of time, we need something like a tree structure for xids from async queue (which will be maintained together with SLRU structures). I still don't have a solution for it. Maybe we should write to the pgsql-hackers mailing list with this problem? BTW, I'll attach an interim solution of this problem to this letter (explicitly advance queue tail before advancing datfrozenxid). -- Best regards, Daniil Davydov
Вложения
Hi,
On Fri, Aug 1, 2025 at 10:41 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
>
> I'm not sure what's the best solution here. When ComputeXidHorizons()
> determines the cutoff xid, it doesn't check asyncQueueControl for the
> sender's xids stored in each AsyncQueueEntry. It make sense that the
> async queue itself isn't expected to survive a database restart, but
> should the sender's xids it stores be considered?
>
> What are your thoughts?
>
Finding the minimum xid in async queue may consume a lot of time, so I
don't think
that we should teach ComputeXidHorizons to look into async queue :
1) This is pretty "hot" function
2) We don't need minimal xid from async queue each time
ComputeXidHorizons is called
IMO, the best solution is to create something like the
"AsyncQueueMinXid" function and
call it in the beginning of vac_update_datfrozenxid. Thus,
newFrozenXid will be capped
by minimal sender's xid and autovacuum could not advance it too far.
If we want to find out the minimum xid in a reasonable amount of time, we need
something like a tree structure for xids from async queue (which will
be maintained
together with SLRU structures).
I still don't have a solution for it. Maybe we should write to the pgsql-hackers
mailing list with this problem?
BTW, I'll attach an interim solution of this problem to this letter
(explicitly advance
queue tail before advancing datfrozenxid).
--
Best regards,
Daniil Davydov
that there is an existing built-in function "pg_notification_queue_usage()"[1]
pg_notification_queue_usage () → double precision
Returns the fraction (0–1) of the asynchronous notification queue's maximum
size that is currently occupied by notifications that are waiting to be
processed. See LISTEN and NOTIFY for more information.
to your patch, but without code change.
Hi, On Wed, Aug 6, 2025 at 6:30 AM Alexandra Wang <alexandra.wang.oss@gmail.com> wrote: > > Thank you for sharing your interim solution! Inspired by your solution, I found > that there is an existing built-in function "pg_notification_queue_usage()"[1] > >> pg_notification_queue_usage () → double precision >> Returns the fraction (0–1) of the asynchronous notification queue's maximum >> size that is currently occupied by notifications that are waiting to be >> processed. See LISTEN and NOTIFY for more information. > > > This function calls asyncQueueAdvanceTail(). I think it's a similar workaround > to your patch, but without code change. > > What do you think? > Yep, it can be used as a workaround. But obviously the user doesn't know when this function should be called - it will become clear only when an error occurs, that is, post factum. Thus, I think that it is better to add such a functionality to the autovacuum. This simple code will ensure that there are no errors in most cases. -- Best regards, Daniil Davydov