Re: Backends stalled in 'startup' state

Поиск
Список
Период
Сортировка
От Ashwin Agrawal
Тема Re: Backends stalled in 'startup' state
Дата
Msg-id CAKSySwc5OdOWuvWCu+-njdDNjwW=U7WHbxri+5ft5mVRrhB8nA@mail.gmail.com
обсуждение исходный текст
Ответ на Backends stalled in 'startup' state  (Ashwin Agrawal <ashwinstar@gmail.com>)
Список pgsql-hackers
On Tue, Jan 17, 2023 at 4:52 PM Ashwin Agrawal <ashwinstar@gmail.com> wrote:

We recently saw many backends (close to max_connection limit) get stalled in 'startup' in one of the production environments for Greenplum (fork of PostgreSQL). Tracing the reason, it was found all the tuples created by bootstrap (xmin=1) in pg_attribute were at super high block numbers (for example beyond 30,000). Tracing the reason for the backend startup stall exactly matched Tom's reasoning in [1]. Stalls became much longer in presence of sub-transaction overflow or presence of long running transactions as tuple visibility took longer. The thread ruled out the possibility of system catalog rows to be present in higher block numbers instead of in front for pg_attribute.

This thread provides simple reproduction on the latest version of PostgreSQL and RCA for how bootstrap catalog entries can move to higher blocks and as a result cause stalls for backend starts. Simple fix to avoid the issue provided at the end.

The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites the table by performing the seqscan as well. And heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence logic to not start from block 0 instead some other block already in cache is possible and opens the possibility to move the bootstrap tuples to anywhere else in the table.

------------------------------------------------------------------
Repro
------------------------------------------------------------------
-- create database to play
drop database if exists test;
create database test;
\c test

-- function just to create many tables to increase pg_attribute size
-- (ideally many column table might do the job more easily)
CREATE OR REPLACE FUNCTION public.f(id integer)  
 RETURNS void  
 LANGUAGE plpgsql  
 STRICT  
AS $function$  
declare  
  sql text;  
  i int;  
begin  
  for i in id..id+9999 loop  
    sql='create table if not exists tbl'||i||' (id int)';  
    execute sql;  
  end loop;  
end;  
$function$;

select f(10000);
select f(20000);
select f(30000);
select f(40000);

-- validate pg_attribute size is greater than 1/4 of shared_buffers
-- for syncscan to triggger
show shared_buffers;
select pg_size_pretty(pg_relation_size('pg_attribute'));
select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit 5;

-- perform seq scan of pg_attribute to page past bootstrapped tuples
copy (select * from pg_attribute limit 2000) to '/tmp/p';

-- this will internally use syncscan starting with block after bootstrap tuples 
-- and hence move bootstrap tuples last to higher block numbers
vacuum full pg_attribute;

------------------------------------------------------------------
Sample run
------------------------------------------------------------------
show shared_buffers;
 shared_buffers
----------------
 128MB
(1 row)

select pg_size_pretty(pg_relation_size('pg_attribute'));
 pg_size_pretty
----------------
 40 MB
(1 row)

select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit 5;
 ctid  | xmin | attrelid |   attname    
-------+------+----------+--------------
 (0,1) |    1 |     1255 | oid
 (0,2) |    1 |     1255 | proname
 (0,3) |    1 |     1255 | pronamespace
 (0,4) |    1 |     1255 | proowner
 (0,5) |    1 |     1255 | prolang
(5 rows)

copy (select * from pg_attribute limit 2000) to '/tmp/p';
COPY 2000
vacuum full pg_attribute;
VACUUM
select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit 5;
   ctid    | xmin | attrelid |   attname    
-----------+------+----------+--------------
 (5115,14) |    1 |     1255 | oid
 (5115,15) |    1 |     1255 | proname
 (5115,16) |    1 |     1255 | pronamespace
 (5115,17) |    1 |     1255 | proowner
 (5115,18) |    1 |     1255 | prolang
(5 rows)


Note:
-- used logic causing the problem to fix it as well on the system :-)
-- scan till block where bootstrap tuples are located
select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 limit 5;
-- now due to syncscan triggering it will pick the blocks with bootstrap tuples first and help to bring them back to front
vacuum full pg_attribute;

------------------------------------------------------------------
Patch to avoid the problem:
------------------------------------------------------------------
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a3414a76e8..4c031914a3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -757,7 +757,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
                pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
                                                                         PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
 
-               tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+               /*
+                * For system catalog tables avoid syncscan, so that scan always
+                * starts from block 0 during rewrite and helps retain bootstrap
+                * tuples in initial pages only. If using syncscan, then bootstrap
+                * tuples may move to higher blocks, which will lead to degraded
+                * performance for relcache initialization during connection starts.
+                */
+               if (is_system_catalog)
+                       tableScan = table_beginscan_strat(OldHeap, SnapshotAny, 0, (ScanKey) NULL, true, false);
+               else
+                       tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
                heapScan = (HeapScanDesc) tableScan;
                indexScan = NULL;
------------------------------------------------------------------



Missed to receive comment/reply to earlier email on pgsql-hackers@lists.postgresql.org hence trying via pgsql-hackers@postgresql.org this time (as not sure was missed or no interest).

Also, I wish to add more scenarios where the problem manifests. During RelationCacheInitializePhase3() -> load_critical_index() performs sequential search for tuples in pg_class for ClassOidIndexId, AttributeRelidNumIndexId, IndexRelidIndexId, OpclassOidIndexId, AccessMethodProcedureIndexId, RewriteRelRulenameIndexId and TriggerRelidNameIndexId. We found on systems that tuples corresponding to these indexes are not always present in starting blocks of pg_class. Specially for pg_opclass_oid_index, pg_rewrite_rel_rulename_index, pg_amproc_fam_proc_index, pg_trigger_tgrelid_tgname_index, pg_index_indexrelid_index to be present many times in block numbers over 2000 and such. Not fully sure on reasoning for this - maybe REINDEX (moves them to higher block numbers). Under any situation where tuple visibility slows down (let's say due to sub-transaction overflow) and relcache is invalidated, a lot of backends were seen stalled in the "startup" phase.


-- 
Ashwin Agrawal (VMware)

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: Re: CI and test improvements
Следующее
От: Andrey Chudnovsky
Дата:
Сообщение: Re: [PoC] Federated Authn/z with OAUTHBEARER