Обсуждение: BUG #16961: Could not access status of transaction

Поиск
Список
Период
Сортировка

BUG #16961: Could not access status of transaction

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16961
Logged by:          Sergey Zhuravlev
Email address:      sergii.zhuravlev@smartnet.ua
PostgreSQL version: 13.2
Operating system:   CentOS Linux release 7.9.2009 (Core)
Description:

Hello

Command  - LISTEN missed_trades_empty_instrument  
"
ERROR:  could not access status of transaction 1954017648
DETAIL:  Could not open file "pg_xact/0747": No such file or directory.
STATEMENT:  LISTEN missed_trades_empty_instrument
"
Current transaction
# select txid_current();
 txid_current
--------------
   6985716158

Parameter -  autovacuum_freeze_max_age = 200000000


Re: BUG #16961: Could not access status of transaction

От
Stepan Yankevych
Дата:
Hi Guys!

Let me clarify few things things about the issue.  

The issue happening each morning when application starts on the production DataBase during about a month. 
Always the same transaction id is mentioned in the error (1954017648)
We tried to do UNLISTEN - no changes. the same issue.
LISTEN works good for any other channels.

Can it be related to some hanged transaction? 1954017648? (for example while some network interruption)
Is it possible to kill/clean it somehow without DB restart?
Can it be related to some non-vacuumed system table or so?

Any other ideas?

Thanks!

--- Original message ---
From: "PG Bug reporting form" <noreply@postgresql.org>
Date: 13 April 2021, 18:50:26

The following bug has been logged on the website:

Bug reference:      16961
Logged by:          Sergey Zhuravlev
Email address:      sergii.zhuravlev@smartnet.ua
PostgreSQL version: 13.2
Operating system:   CentOS Linux release 7.9.2009 (Core)
Description:        

Hello

Command  - LISTEN missed_trades_empty_instrument  
"
ERROR:  could not access status of transaction 1954017648
DETAIL:  Could not open file "pg_xact/0747": No such file or directory.
STATEMENT:  LISTEN missed_trades_empty_instrument
"
Current transaction
# select txid_current(); txid_current
--------------   6985716158

Parameter -  autovacuum_freeze_max_age = 200000000

RE: BUG #16961: Could not access status of transaction

От
Stepan Yankevych
Дата:

The database was rebooted and the issue disappeared

Hopefully it was one time issue.

 

 

 

 

From: Stepan Yankevych <stepya@ukr.net>
Sent: Wednesday, April 14, 2021 7:12 PM
To: sergii.zhuravlev@smartnet.ua; pgsql-bugs@lists.postgresql.org
Cc: sergii.zhuravlev@smartnet.ua
Subject: Re: BUG #16961: Could not access status of transaction

 

Hi Guys!

 

Let me clarify few things things about the issue.  

 

The issue happening each morning when application starts on the production DataBase during about a month. 

Always the same transaction id is mentioned in the error (1954017648)

We tried to do UNLISTEN - no changes. the same issue.

LISTEN works good for any other channels.



Can it be related to some hanged transaction? 1954017648? (for example while some network interruption)

Is it possible to kill/clean it somehow without DB restart?

Can it be related to some non-vacuumed system table or so?

Any other ideas?

Thanks!

 

--- Original message ---
From: "PG Bug reporting form" <noreply@postgresql.org>
Date: 13 April 2021, 18:50:26

 

The following bug has been logged on the website:
 
Bug reference:      16961
Logged by:          Sergey Zhuravlev
Email address:      sergii.zhuravlev@smartnet.ua
PostgreSQL version: 13.2
Operating system:   CentOS Linux release 7.9.2009 (Core)
Description:        
 
Hello
 
Command  - LISTEN missed_trades_empty_instrument  
"
ERROR:  could not access status of transaction 1954017648
DETAIL:  Could not open file "pg_xact/0747": No such file or directory.
STATEMENT:  LISTEN missed_trades_empty_instrument
"
Current transaction
# select txid_current();
 txid_current
--------------
   6985716158
 
Parameter -  autovacuum_freeze_max_age = 200000000
 

Re: BUG #16961: Could not access status of transaction

От
Noah Misch
Дата:
On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote:
> The database was rebooted and the issue disappeared

Each postmaster restart clears the LISTEN/NOTIFY queue, so that fits.

> The issue happening each morning when application starts on the production DataBase during about a month.
> Always the same transaction id is mentioned in the error (1954017648)
> We tried to do UNLISTEN - no changes. the same issue.
> LISTEN works good for any other channels.
> 
> 
> Can it be related to some hanged transaction? 1954017648? (for example while some network interruption)

Something like that; see below.

> Is it possible to kill/clean it somehow without DB restart?

Not to my knowledge.

> Can it be related to some non-vacuumed system table or so?

Probably not.

> Command  - LISTEN missed_trades_empty_instrument
> 
> ERROR:  could not access status of transaction 1954017648
> DETAIL:  Could not open file "pg_xact/0747": No such file or directory.
> STATEMENT:  LISTEN missed_trades_empty_instrument

The LISTEN/NOTIFY queue stores a transaction id for each notification
(internally, each AsyncQueueEntry).  I can imagine the "could not access
status" happening if a sequence of events like this happened since the last
postmaster restart:

backend 1: LISTEN missed_trades_empty_instrument
backend 2: BEGIN; NOTIFY missed_trades_empty_instrument [TransactionId N]
backend 1: BEGIN
backend 2: COMMIT
backend 1: CREATE TEMP TABLE x (); [sets TransactionId N+K]
autovacuum: freezes tuples, deletes pg_xact data covering TransactionId N
backend 1: COMMIT

On the other hand, if that's what happened, your report that "LISTEN works
good for any other channels" surprises me.  Perhaps something completely
different happened on your system.

We could prevent the trouble if vac_truncate_clog() had access to the oldest
xid in the notification queue; it would set frozenXID to that value if
frozenXID would otherwise be older.



Re: BUG #16961: Could not access status of transaction

От
Tom Lane
Дата:
Noah Misch <noah@leadboat.com> writes:
> On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote:
>> The issue happening each morning when application starts on the production DataBase during about a month.
>> Always the same transaction id is mentioned in the error (1954017648)
>> We tried to do UNLISTEN - no changes. the same issue.
>> LISTEN works good for any other channels.

> ...
> On the other hand, if that's what happened, your report that "LISTEN works
> good for any other channels" surprises me.  Perhaps something completely
> different happened on your system.

I suspect the true state of affairs was more like "LISTEN works good in
any other databases".  The described symptoms are consistent with there
being a message in the NOTIFY queue that has a pruned-away xid.  An
incoming listener would try to scan over already-committed notify
messages, but testing to see whether this stale message is committed
would fail.  However, since asyncQueueProcessPageEntries just ignores
messages not targeted for the current database, incoming listeners
in other databases wouldn't notice the problem.

ISTM the interesting question here is what was holding back truncation
of the NOTIFY queue.  Could there have been an open transaction somewhere
that was failing to collect NOTIFY data?  But a transaction sitting open
for a month is likely to cause far more severe problems than that one.

> We could prevent the trouble if vac_truncate_clog() had access to the oldest
> xid in the notification queue; it would set frozenXID to that value if
> frozenXID would otherwise be older.

Perhaps.  I'm not sure how hard it is to extract the oldest xid in the
queue (where "oldest" is defined as "numerically smallest").  The entries
are in xid commit order which is a different thing.

            regards, tom lane



Re: BUG #16961: Could not access status of transaction

От
Noah Misch
Дата:
On Sat, Jun 05, 2021 at 05:25:39PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Apr 19, 2021 at 06:49:15PM +0000, Stepan Yankevych wrote:
> >> The issue happening each morning when application starts on the production DataBase during about a month.
> >> Always the same transaction id is mentioned in the error (1954017648)
> >> We tried to do UNLISTEN - no changes. the same issue.
> >> LISTEN works good for any other channels.
> 
> > ...
> > On the other hand, if that's what happened, your report that "LISTEN works
> > good for any other channels" surprises me.  Perhaps something completely
> > different happened on your system.
> 
> I suspect the true state of affairs was more like "LISTEN works good in
> any other databases".  The described symptoms are consistent with there
> being a message in the NOTIFY queue that has a pruned-away xid.  An
> incoming listener would try to scan over already-committed notify
> messages, but testing to see whether this stale message is committed
> would fail.  However, since asyncQueueProcessPageEntries just ignores
> messages not targeted for the current database, incoming listeners
> in other databases wouldn't notice the problem.
> 
> ISTM the interesting question here is what was holding back truncation
> of the NOTIFY queue.  Could there have been an open transaction somewhere
> that was failing to collect NOTIFY data?

Open transactions always decline to collect notify data, don't they?  See
ProcessNotifyInterrupt().

> But a transaction sitting open
> for a month is likely to cause far more severe problems than that one.

True.

> > We could prevent the trouble if vac_truncate_clog() had access to the oldest
> > xid in the notification queue; it would set frozenXID to that value if
> > frozenXID would otherwise be older.
> 
> Perhaps.  I'm not sure how hard it is to extract the oldest xid in the
> queue (where "oldest" is defined as "numerically smallest").  The entries
> are in xid commit order which is a different thing.

Yeah, it wouldn't be cheap in the general case.  The value could be a field in
pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also
run roughly as often as autovacuum visits template0.



Re: BUG #16961: Could not access status of transaction

От
Daniil Davydov
Дата:
Hi,

On Wed, Jun 25, 2025 at 10:29 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Sat, Jun 05, 2021 at 05:25:39PM -0400, Tom Lane wrote:
> >
> > ISTM the interesting question here is what was holding back truncation
> > of the NOTIFY queue.  Could there have been an open transaction somewhere
> > that was failing to collect NOTIFY data?
>
> Open transactions always decline to collect notify data, don't they?  See
> ProcessNotifyInterrupt().
>
> > But a transaction sitting open
> > for a month is likely to cause far more severe problems than that one.
>
> True.
>

One of our clients faced this problem, and I found out that long
transactions may have nothing to do with it.

We have the following logic in the notify queue :
If there are no listeners within all databases, and we are calling
LISTEN, then we must iterate from 'tail' to 'head' of the queue and
check statuses of transactions (see Exec_ListenPreCommit).
If there is a pruned-away xid in the queue, we will try to access its
status and get an error.

Because the tail of the queue is not necessarily always advanced
forward by the listeners, we can get such error without any long lived
transactions.

> > > We could prevent the trouble if vac_truncate_clog() had access to the oldest
> > > xid in the notification queue; it would set frozenXID to that value if
> > > frozenXID would otherwise be older.
> >
> > Perhaps.  I'm not sure how hard it is to extract the oldest xid in the
> > queue (where "oldest" is defined as "numerically smallest").  The entries
> > are in xid commit order which is a different thing.
>
> Yeah, it wouldn't be cheap in the general case.  The value could be a field in
> pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also
> run roughly as often as autovacuum visits template0.

As a temporary solution, which almost completely eliminates the
possibility of such a situation, I suggest adding a
AsyncQueueAdvanceTail call to vacuum (inside vac_update_datfrozenxid).
I mean 'static asyncQueueAdvanceTail()' that should be made
'external'. Thus, we can get rid of the problem that I described
above.

But there is one more : if the listener client is lagging, the
AsyncQueueAdvanceTail call will not save us (because it cannot advance
tail any further than the listener's position in the queue).
Again, this may be due to a very high load, but not because the client
keeps the transaction open for a very long time.

The best solution is to teach vacuum to recognize the minimum xid in
constant time, but I didn't come up with any sane implementations.
What do you think?

--
Best regards,
Daniil Davydov
Postgres Professional



Re: BUG #16961: Could not access status of transaction

От
Alexandra Wang
Дата:
Hello,

A customer encountered this issue, and thanks to Andrei Varashen's
very well described steps in the duplicate thread[1], I can reproduce
it on 16.3 and on the master branch.

On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:
One of our clients faced this problem, and I found out that long
transactions may have nothing to do with it.

We have the following logic in the notify queue :
If there are no listeners within all databases, and we are calling
LISTEN, then we must iterate from 'tail' to 'head' of the queue and
check statuses of transactions (see Exec_ListenPreCommit).
If there is a pruned-away xid in the queue, we will try to access its
status and get an error.

Because the tail of the queue is not necessarily always advanced
forward by the listeners, we can get such error without any long lived
transactions.

Thank you Daniil, you are exactly correct!

Here's the stack trace on master branch:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
  * frame #0: 0x0000000102c958c8 postgres`SlruReportIOError(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:1084:4
    frame #1: 0x0000000102c952b0 postgres`SimpleLruReadPage(ctl=0x00000001037aa5e0, pageno=0, write_ok=true, xid=757) at slru.c:603:4
    frame #2: 0x0000000102c95f7c postgres`SimpleLruReadPage_ReadOnly(ctl=0x00000001037aa5e0, pageno=0, xid=757) at slru.c:661:9
    frame #3: 0x0000000102c866bc postgres`TransactionIdGetStatus(xid=757, lsn=0x000000016d27b5e8) at clog.c:745:11
    frame #4: 0x0000000102c9a924 postgres`TransactionLogFetch(transactionId=757) at transam.c:79:14
    frame #5: 0x0000000102c9a74c postgres`TransactionIdDidCommit(transactionId=757) at transam.c:130:14
    frame #6: 0x0000000102de4a2c postgres`asyncQueueProcessPageEntries(current=0x000000016d27b720, stop=QueuePosition @ 0x000000016d27b690, page_buffer="\U0000001c", snapshot=0x00000001310439a0) at async.c:2069:13
    frame #7: 0x0000000102de47ec postgres`asyncQueueReadAllNotifications at async.c:1981:18
    frame #8: 0x0000000102de29f4 postgres`Exec_ListenPreCommit at async.c:1127:3
    frame #9: 0x0000000102de24e4 postgres`PreCommit_Notify at async.c:881:6
    frame #10: 0x0000000102ca877c postgres`CommitTransaction at xact.c:2341:2
    frame #11: 0x0000000102ca3b2c postgres`CommitTransactionCommandInternal at xact.c:3214:4
    frame #12: 0x0000000102ca3a44 postgres`CommitTransactionCommand at xact.c:3175:10
    frame #13: 0x0000000103238d1c postgres`finish_xact_command at postgres.c:2833:3
    frame #14: 0x00000001032368d4 postgres`exec_simple_query(query_string="LISTEN test_chan;") at postgres.c:1298:4
    frame #15: 0x000000010323599c postgres`PostgresMain(dbname="test", username="alex.wang") at postgres.c:4767:7
    frame #16: 0x000000010322cca8 postgres`BackendMain(startup_data=0x000000016d27de48, startup_data_len=24) at backend_startup.c:124:2
    frame #17: 0x0000000103104bcc postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=78, startup_data=0x000000016d27de48, startup_data_len=24, client_sock=0x000000016d27ded8) at launch_backend.c:290:3
    frame #18: 0x000000010310c5a0 postgres`BackendStartup(client_sock=0x000000016d27ded8) at postmaster.c:3587:8
    frame #19: 0x000000010310a404 postgres`ServerLoop at postmaster.c:1702:6
    frame #20: 0x0000000103108ebc postgres`PostmasterMain(argc=3, argv=0x00006000018854e0) at postmaster.c:1400:11
    frame #21: 0x0000000102f8f6b8 postgres`main(argc=3, argv=0x00006000018854e0) at main.c:231:4
    frame #22: 0x00000001940a2b98 dyld`start + 6076

This stack trace is from a LISTEN command issued by a new listener on
the channel. This new listener needs to process previously committed
notifications. If no other active connections are listening to the
same channel, it starts processing from the tail of the queue.

This problem occurs when an AsyncQueueEntry is still present, but the
sender's xid it contains is older than the database's frozenxid. This
can happen if a previous listener disconnects, not enough messages for
the notifier to advance the async queue's tail, but enough transactions
for VACUUM to truncate the pg_xact logs.
 
On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo@gmail.com> wrote:
> > > We could prevent the trouble if vac_truncate_clog() had access to the oldest
> > > xid in the notification queue; it would set frozenXID to that value if
> > > frozenXID would otherwise be older.
> >
> > Perhaps.  I'm not sure how hard it is to extract the oldest xid in the
> > queue (where "oldest" is defined as "numerically smallest").  The entries
> > are in xid commit order which is a different thing.
>
> Yeah, it wouldn't be cheap in the general case.  The value could be a field in
> pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would also
> run roughly as often as autovacuum visits template0.

As a temporary solution, which almost completely eliminates the
possibility of such a situation, I suggest adding a
AsyncQueueAdvanceTail call to vacuum (inside vac_update_datfrozenxid).
I mean 'static asyncQueueAdvanceTail()' that should be made
'external'. Thus, we can get rid of the problem that I described
above.

But there is one more : if the listener client is lagging, the
AsyncQueueAdvanceTail call will not save us (because it cannot advance
tail any further than the listener's position in the queue).
Again, this may be due to a very high load, but not because the client
keeps the transaction open for a very long time.

The best solution is to teach vacuum to recognize the minimum xid in
constant time, but I didn't come up with any sane implementations.
What do you think?

I'm not sure what's the best solution here. When ComputeXidHorizons()
determines the cutoff xid, it doesn't check asyncQueueControl for the
sender's xids stored in each AsyncQueueEntry.  It make sense that the
async queue itself isn't expected to survive a database restart, but
should the sender's xids it stores be considered?

What are your thoughts?

Best,
Alex
 

Re: BUG #16961: Could not access status of transaction

От
Alexandra Wang
Дата:
Forgot to paste the reference link:

[1] https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2%40postgresql.org

PS: To reproduce the bug, you must first disconnect any backend
listening on the channel (backend 1 in Andrei Varashen's repro) before
registering the new listener that triggers the error.

Best,
Alex

Re: BUG #16961: Could not access status of transaction

От
Daniil Davydov
Дата:
Hi,

On Fri, Aug 1, 2025 at 10:41 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
>
> I'm not sure what's the best solution here. When ComputeXidHorizons()
> determines the cutoff xid, it doesn't check asyncQueueControl for the
> sender's xids stored in each AsyncQueueEntry.  It make sense that the
> async queue itself isn't expected to survive a database restart, but
> should the sender's xids it stores be considered?
>
> What are your thoughts?
>

Finding the minimum xid in async queue may consume a lot of time, so I
don't think
that we should teach ComputeXidHorizons to look into async queue :
1) This is pretty "hot" function
2) We don't need minimal xid from async queue each time
ComputeXidHorizons is called

IMO, the best solution is to create something like the
"AsyncQueueMinXid" function and
call it in the beginning of vac_update_datfrozenxid. Thus,
newFrozenXid will be capped
by minimal sender's xid and autovacuum could not advance it too far.

If we want to find out the minimum xid in a reasonable amount of time, we need
something like a tree structure for xids from async queue (which will
be maintained
together with SLRU structures).

I still don't have a solution for it. Maybe we should write to the pgsql-hackers
mailing list with this problem?

BTW, I'll attach an interim solution of this problem to this letter
(explicitly advance
queue tail before advancing datfrozenxid).

--
Best regards,
Daniil Davydov

Вложения

Re: BUG #16961: Could not access status of transaction

От
Alexandra Wang
Дата:
Hi Daniil,

On Mon, Aug 4, 2025 at 4:08 AM Daniil Davydov <3danissimo@gmail.com> wrote:
Hi,

On Fri, Aug 1, 2025 at 10:41 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
>
> I'm not sure what's the best solution here. When ComputeXidHorizons()
> determines the cutoff xid, it doesn't check asyncQueueControl for the
> sender's xids stored in each AsyncQueueEntry.  It make sense that the
> async queue itself isn't expected to survive a database restart, but
> should the sender's xids it stores be considered?
>
> What are your thoughts?
>

Finding the minimum xid in async queue may consume a lot of time, so I
don't think
that we should teach ComputeXidHorizons to look into async queue :
1) This is pretty "hot" function
2) We don't need minimal xid from async queue each time
ComputeXidHorizons is called

IMO, the best solution is to create something like the
"AsyncQueueMinXid" function and
call it in the beginning of vac_update_datfrozenxid. Thus,
newFrozenXid will be capped
by minimal sender's xid and autovacuum could not advance it too far.

If we want to find out the minimum xid in a reasonable amount of time, we need
something like a tree structure for xids from async queue (which will
be maintained
together with SLRU structures).

I still don't have a solution for it. Maybe we should write to the pgsql-hackers
mailing list with this problem?

BTW, I'll attach an interim solution of this problem to this letter
(explicitly advance
queue tail before advancing datfrozenxid).

--
Best regards,
Daniil Davydov

Thank you for sharing your interim solution! Inspired by your solution, I found
that there is an existing built-in function "pg_notification_queue_usage()"[1]

pg_notification_queue_usage () → double precision
Returns the fraction (0–1) of the asynchronous notification queue's maximum
size that is currently occupied by notifications that are waiting to be
processed. See LISTEN and NOTIFY for more information.

This function calls asyncQueueAdvanceTail(). I think it's a similar workaround
to your patch, but without code change.

What do you think?


Best,
Alex

Re: BUG #16961: Could not access status of transaction

От
Daniil Davydov
Дата:
Hi,

On Wed, Aug 6, 2025 at 6:30 AM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
>
> Thank you for sharing your interim solution! Inspired by your solution, I found
> that there is an existing built-in function "pg_notification_queue_usage()"[1]
>
>> pg_notification_queue_usage () → double precision
>> Returns the fraction (0–1) of the asynchronous notification queue's maximum
>> size that is currently occupied by notifications that are waiting to be
>> processed. See LISTEN and NOTIFY for more information.
>
>
> This function calls asyncQueueAdvanceTail(). I think it's a similar workaround
> to your patch, but without code change.
>
> What do you think?
>

Yep, it can be used as a workaround. But obviously the user doesn't know when
this function should be called - it will become clear only when an error occurs,
that is, post factum. Thus, I think that it is better to add such a
functionality
to the autovacuum. This simple code will ensure that there are no
errors in most cases.

--
Best regards,
Daniil Davydov



Re: BUG #16961: Could not access status of transaction

От
Alexandra Wang
Дата:
I started a thread in the pg-hackers list, hoping to get more attention: 

Alex