Обсуждение: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries

Поиск
Список
Период
Сортировка

BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16605
Logged by:          Tushar Takate
Email address:      tushar11.takate@gmail.com
PostgreSQL version: 10.5
Operating system:   OEL 6.9
Description:

Hi Team ,

Problem/bug/issues statement : -  PostgreSQL recovery startup process is
blocking the application queries .

Env Details - 

PostgtreSQL community version - 10.5
OS - OEL 6.9 
Replication type - Streaming - async 

DB parameters at replica side - 

max_standby_streaming_delay = -1
max_standby_archive_delay = -1
hot_standby_feedback = off

Startup process by ps -ef  

postgres           24489 24487  0  2019 ?        1-20:23:35 postgres:
startup process   recovering 00000006000021D4000000ED waiting

Startup process from pg_stat_activity [Here its not showing process is in
waiting state ]

postgres=# select * from pg_stat_activity where pid='24489';
-[ RECORD 1 ]----+------------------------------
datid            | 
datname          | 
pid              | 24489
usesysid         | 
usename          | 
application_name | 
client_addr      | 
client_hostname  | 
client_port      | 
backend_start    | 2019-06-17 22:45:33.679547+00
xact_start       | 
query_start      | 
state_change     | 
wait_event_type  | 
wait_event       | 
state            | 
backend_xid      | 
backend_xmin     | 
query            | 
backend_type     | startup


Startup process blocking all the application queries more than 4 hours .

2020-08-29 09:50:07 UTC [4596]: [3-1]
user=postgres,db=pgprd_production,app=psql,client=[local] LOG:  process 4596
still waiting for AccessShareLock on relation 17280 of database 16408 after
1000.076 ms at character 15
2020-08-29 09:50:07 UTC [4596]: [4-1]
user=postgres,db=pgprd_production,app=psql,client=[local] DETAIL:  Process
holding the lock: 24489. Wait queue: 25091, 27676, 27186, 28443, 24754,
27109, 27679, 26314, 28682, 24756, 25348, 26938, 30990, 29405, 28770, 29408,
25351, 27625, 26091, 26470, 29170, 30756, 28683, 28589, 27195, 29336, 25694,
30637, 25191, 25346, 29660, 27872, 6563, 6625, 26491, 29492, 30443, 28767,
27742, 28877, 25629, 5941, 26311, 27993, 6951, 25026, 30581, 27746, 30527,
30757, 6564, 6566, 29582, 27188, 30085, 25021, 27041, 25352, 27187, 27042,
29173, 26028, 25073, 29649, 26793, 25194, 28590, 7666, 11080, 26808, 26798,
28528, 27499, 24948, 29694, 13580, 6444, 30684, 6565, 30532, 4596.
2020-08-29 09:50:07 UTC [4596]: [5-1]
user=postgres,db=pgprd_production,app=psql,client=[local] STATEMENT:  select
* from mytable limit 1;

All are the pid which are in wait queue are executing the select queries on
same table that is (mytable) .

Can you please help us to understand the issues and possible
solution/workaround/fix for it .

For temp fix we have restarted the DB service ( Before that we have tried to
kill the process by -15 flag which not worked )

Thanks,
Tushar Takate .


Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries

От
Kyotaro Horiguchi
Дата:
At Wed, 02 Sep 2020 10:58:49 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> The following bug has been logged on the website:
> 
> Bug reference:      16605
> Logged by:          Tushar Takate
> Email address:      tushar11.takate@gmail.com
> PostgreSQL version: 10.5
> Operating system:   OEL 6.9
> Description:        
> 
> Hi Team ,
> 
> Problem/bug/issues statement : -  PostgreSQL recovery startup process is
> blocking the application queries .
> 
> Env Details - 
> 
> PostgtreSQL community version - 10.5
> OS - OEL 6.9 
> Replication type - Streaming - async 
> 
> DB parameters at replica side - 
> 
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> hot_standby_feedback = off
> 
> Startup process by ps -ef  
> 
> postgres           24489 24487  0  2019 ?        1-20:23:35 postgres:
> startup process   recovering 00000006000021D4000000ED waiting

This is the cause of all succeeding blocking chain. "recovering xxx
waiting" means the process is waiting for a recovery-conflict to be
resolved. In other words, the startup process is being blocked by some
backend. In your case the startup process should have taken
AccessExclusiveLock on the relation 17280 before being blocked. With
the setting max_standby_streaming_delay -1, the tartup process waits
forever.

The most common cause of recovery-conflict when "hot_standby_feedback
= off" is snapshot conflict. In other words, vacuum on the primary
side cannot be replayed on the standby since some transaction on the
standby side may refer to-be-vacuumed table rows.

I'm not sure about easier way but the cause can be examined by the
following steps.

=# select sent_lsn, replay_lsn, file, upper(to_hex(off)) from pg_stat_replication, lateral
pg_walfile_name_offset(replay_lsn)as o(file, off);
 
 sent_lsn  | replay_lsn |           file           | to_hex 
-----------+------------+--------------------------+--------
 0/5882B30 | 0/5874878  | 000000010000000000000005 | 874878

You will see that the replay_lsn is behind sent_lsn. and the last two
columns show the location of the blocked record.

$ pg_waldump .../000000010000000000000005 | grep 874878
rmgr: Heap2       len (rec/tot):    506/   506, tx:          0, lsn: 0/05874878, prev 0/05874850, desc: CLEAN remxid
600,blkref #0: rel 1663/12348/16385 blk 0
 

The Heap2-CLEAN is emitted by vacuum. You can avoid this kind of
conflict by turning hot_standby_feedback on and/or preferably setting
max_standby_streaming_delay to an effective value breaks any kind of
conflicts by terminating conflicting backends.  As a workaround after
snapshot-conflict happens, manually terminate backends with
backend_xmin <= 600, then startup will continue recovery.

Another common cause is another access exclusive lock that is blocked
by standby transaction.  In this case you will see Standby-LOCK as the
problem WAL record but pg_locks also shows such lock conflicts.

I think other kinds of recovery-conflicts rarely happens.

> All are the pid which are in wait queue are executing the select queries on
> same table that is (mytable) .
> 
> Can you please help us to understand the issues and possible
> solution/workaround/fix for it .
> 
> For temp fix we have restarted the DB service ( Before that we have tried to
> kill the process by -15 flag which not worked )

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries

От
Tushar Takate
Дата:
Hi Kyotaro , 

Thanks for sharing your thoughts on this issue .

Is it expected to arrive again with the below setting or is this the bug with this setting ?

hot_standby_feedback= off and max_standby_streaming_delay -1 .

Because we have kept this setting intentionally on this server to avoid the bloat and query termination after some set time .

If it's the bug ,Request you to please let me know which version contains the fix for it or in which version fix is going to arrive .

 

-
Thanks & Regards,
Tushar K Takate .




On Thu, Sep 3, 2020 at 8:52 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Wed, 02 Sep 2020 10:58:49 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> The following bug has been logged on the website:
>
> Bug reference:      16605
> Logged by:          Tushar Takate
> Email address:      tushar11.takate@gmail.com
> PostgreSQL version: 10.5
> Operating system:   OEL 6.9
> Description:       
>
> Hi Team ,
>
> Problem/bug/issues statement : -  PostgreSQL recovery startup process is
> blocking the application queries .
>
> Env Details -
>
> PostgtreSQL community version - 10.5
> OS - OEL 6.9
> Replication type - Streaming - async
>
> DB parameters at replica side -
>
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> hot_standby_feedback = off
>
> Startup process by ps -ef 
>
> postgres           24489 24487  0  2019 ?        1-20:23:35 postgres:
> startup process   recovering 00000006000021D4000000ED waiting

This is the cause of all succeeding blocking chain. "recovering xxx
waiting" means the process is waiting for a recovery-conflict to be
resolved. In other words, the startup process is being blocked by some
backend. In your case the startup process should have taken
AccessExclusiveLock on the relation 17280 before being blocked. With
the setting max_standby_streaming_delay -1, the tartup process waits
forever.

The most common cause of recovery-conflict when "hot_standby_feedback
= off" is snapshot conflict. In other words, vacuum on the primary
side cannot be replayed on the standby since some transaction on the
standby side may refer to-be-vacuumed table rows.

I'm not sure about easier way but the cause can be examined by the
following steps.

=# select sent_lsn, replay_lsn, file, upper(to_hex(off)) from pg_stat_replication, lateral pg_walfile_name_offset(replay_lsn) as o(file, off);
 sent_lsn  | replay_lsn |           file           | to_hex
-----------+------------+--------------------------+--------
 0/5882B30 | 0/5874878  | 000000010000000000000005 | 874878

You will see that the replay_lsn is behind sent_lsn. and the last two
columns show the location of the blocked record.

$ pg_waldump .../000000010000000000000005 | grep 874878
rmgr: Heap2       len (rec/tot):    506/   506, tx:          0, lsn: 0/05874878, prev 0/05874850, desc: CLEAN remxid 600, blkref #0: rel 1663/12348/16385 blk 0

The Heap2-CLEAN is emitted by vacuum. You can avoid this kind of
conflict by turning hot_standby_feedback on and/or preferably setting
max_standby_streaming_delay to an effective value breaks any kind of
conflicts by terminating conflicting backends.  As a workaround after
snapshot-conflict happens, manually terminate backends with
backend_xmin <= 600, then startup will continue recovery.

Another common cause is another access exclusive lock that is blocked
by standby transaction.  In this case you will see Standby-LOCK as the
problem WAL record but pg_locks also shows such lock conflicts.

I think other kinds of recovery-conflicts rarely happens.

> All are the pid which are in wait queue are executing the select queries on
> same table that is (mytable) .
>
> Can you please help us to understand the issues and possible
> solution/workaround/fix for it .
>
> For temp fix we have restarted the DB service ( Before that we have tried to
> kill the process by -15 flag which not worked )

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #16605: PostgreSQL recovery startup process waiting and blocking to application queries

От
Kyotaro Horiguchi
Дата:
At Fri, 11 Sep 2020 22:18:27 +0530, Tushar Takate <tushar11.takate@gmail.com> wrote in 
> Hi Kyotaro ,
> 
> Thanks for sharing your thoughts on this issue .
> 
> Is it expected to arrive again with the below setting or is this the bug
> with this setting ?
> 
> hot_standby_feedback= off and max_standby_streaming_delay -1 .
> 
> Because we have kept this setting intentionally on this server to avoid the
> bloat and query termination after some set time .

The table bloat means there's long transactions on the
standby. hot_standby_feedback=off means such long transactions are not
protected from query-canceling from vacuumed rows on the primary. Even
if you had such a bloat by setting it to "on", that means you would
have the same bloat when such transactions were run on the primary.

When "resolving" such bloat by setting hot_s_f to off,
max_standby_streaming_delay=-1 makes things worse by inhibiting the
system from resolving the replication-stall.

> If it's the bug ,Request you to please let me know which version contains
> the fix for it or in which version fix is going to arrive .

No. AFAICS in this report, it's the designed behavior, or a
restriction of streaming replication.

https://www.postgresql.org/docs/10/hot-standby.html

Since PG12 vacuum can run without file truncation (which leads to
access exclusive locks, see the vacuum_truncate table option) and only
the dead-lock might be avoidable by that. However even with the
option, replication doesn't advance until the causal transaction of
the replication-stall is gone.

https://www.postgresql.org/docs/13/sql-createtable.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center