Обсуждение: Hot Standby Conflict on pg_attribute
Hello,
A client has recently had a couple of hot standby query conflict pile-ups around AccessShare lock waits on pg_attribute. Here is an example from the log:
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-1] sql_error_code = 00000 LOG: process 118946 still waiting for AccessShareLock on relation 1249 of database 16401 after 1000.127 ms at character 92
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-2] sql_error_code = 00000 DETAIL: Process holding the lock: 9. Wait queue: 118948, 118950, 118708, 118818, 118886, 118961, 118960, 118806, 118963, 118959, 118881, 118887, 118878, 118896, 118964, 118965, 118945, 118949, 118946, 118743, 118966, 118947, 118967, 118968.
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-3] sql_error_code = 00000 STATEMENT: SELECT uc.id,
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-4] uc.some_id,
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-5] uc.utr_id,
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-6] utr.name
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-7] FROM usertable1 uc
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-8] INNER JOIN usertable2 utr
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-9] ON uc.utr_id = utr.id
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-10] WHERE uc.some_id = $1
Mar 27 12:06:37 ip-10-0-125-5 7dc68e48_fbd9_41d7_9ab1_65599036dd75[118946]: [9-11] ORDER BY name
Relation 1249 is pg_attribute and process 9 that was holding the lock was RecoveryWalAll process. I've confirmed that autovacuum had removed some pages from pg_attribute shortly before this, which happens somewhat regularly since this client runs a couple thousand REFERSH MATARIALIZED VIEW queries per day which look to cause inserts and deletes there so it having an exclusive lock on pg_attribute makes sense.
The question then is: Why would these user queries be waiting on an AccessShare lock on pg_attribute? Thus far we've been unable to recreate any transacitons with the above query (and others) that show any pg_attribute locks. There is no ORM in play here and these queries are being sent as single query transactions via this Node.js postgres adapter: https://github.com/brianc/node-postgres which is pretty bare bones.
Hi, On 2019-05-09 13:03:50 -0700, Erik Jones wrote: > The question then is: Why would these user queries be waiting on an > AccessShare lock on pg_attribute? Thus far we've been unable to recreate > any transacitons with the above query (and others) that show any > pg_attribute locks. There is no ORM in play here and these queries are > being sent as single query transactions via this Node.js postgres adapter: > https://github.com/brianc/node-postgres which is pretty bare bones. Queries that access a table for the *first* time after DDL happened (including truncating the relation), need an AccessShareLock on pg_attribute (and pg_class, pg_index, ...) for a short time. You can reproduce that fairly easily: S1: CREATE TABLE foo(); S2: BEGIN; LOCK pg_attribute; S1: SELECT * FROM foo; S2: COMMIT; S1 could execute the select, because it has a cached view of the way the relation looks. S2: ALTER TABLE foo ADD COLUMN bar INT; S2: BEGIN; LOCK pg_attribute; S1: SELECT * FROM foo; Here S1 is blocked, because it needs to look at pg_attribute to figure out the "shape" of the table, but it's currently locked. Greetings, Andres Freund
Hi Andres,
Thank you very much! That's exactly what I needed.
On Fri, May 10, 2019 at 12:14 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-05-09 13:03:50 -0700, Erik Jones wrote:
> The question then is: Why would these user queries be waiting on an
> AccessShare lock on pg_attribute? Thus far we've been unable to recreate
> any transacitons with the above query (and others) that show any
> pg_attribute locks. There is no ORM in play here and these queries are
> being sent as single query transactions via this Node.js postgres adapter:
> https://github.com/brianc/node-postgres which is pretty bare bones.
Queries that access a table for the *first* time after DDL happened
(including truncating the relation), need an AccessShareLock on
pg_attribute (and pg_class, pg_index, ...) for a short time.
You can reproduce that fairly easily:
S1: CREATE TABLE foo();
S2: BEGIN; LOCK pg_attribute;
S1: SELECT * FROM foo;
S2: COMMIT;
S1 could execute the select, because it has a cached view of the way the
relation looks.
S2: ALTER TABLE foo ADD COLUMN bar INT;
S2: BEGIN; LOCK pg_attribute;
S1: SELECT * FROM foo;
Here S1 is blocked, because it needs to look at pg_attribute to figure
out the "shape" of the table, but it's currently locked.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-05-09 13:03:50 -0700, Erik Jones wrote: >> The question then is: Why would these user queries be waiting on an >> AccessShare lock on pg_attribute? > Queries that access a table for the *first* time after DDL happened > (including truncating the relation), need an AccessShareLock on > pg_attribute (and pg_class, pg_index, ...) for a short time. Also, it seems likely that what's really triggering the issue is autovacuum on pg_attribute trying to truncate off empty pages in pg_attribute (after a bunch of dead rows were generated there by DDL activity). That requires exclusive lock on pg_attribute, which would propagate down to the standby. regards, tom lane
On Fri, May 10, 2019 at 12:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
> On 2019-05-09 13:03:50 -0700, Erik Jones wrote:
>> The question then is: Why would these user queries be waiting on an
>> AccessShare lock on pg_attribute?
> Queries that access a table for the *first* time after DDL happened
> (including truncating the relation), need an AccessShareLock on
> pg_attribute (and pg_class, pg_index, ...) for a short time.
Also, it seems likely that what's really triggering the issue is
autovacuum on pg_attribute trying to truncate off empty pages
in pg_attribute (after a bunch of dead rows were generated there
by DDL activity). That requires exclusive lock on pg_attribute,
which would propagate down to the standby.
regards, tom lane
Right, that part I understood after checking out pg_attribute's insert/delete counts in pg_stat_sys_tables before and after some REFRESH MATERIALIZED VIEW runs on an otherwise idle server. With them running 2k+ refreshes per day autovac is regularly working on their catalog tables.
Thanks!
--
Just a quick footnote: If autovac truncations are frequently causing replica lag, and if this is a problem for you, IIUC one way you can stop autovac from doing the truncations even on older versions is setting old_snapshot_threshold to any value at all besides zero. (On 12+ you can directly control the truncation behavior.)
-Jeremy
Sent from my TI-83
On Fri, May 10, 2019 at 12:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Andres Freund <andres@anarazel.de> writes:
> On 2019-05-09 13:03:50 -0700, Erik Jones wrote:
>> The question then is: Why would these user queries be waiting on an
>> AccessShare lock on pg_attribute?
> Queries that access a table for the *first* time after DDL happened
> (including truncating the relation), need an AccessShareLock on
> pg_attribute (and pg_class, pg_index, ...) for a short time.
Also, it seems likely that what's really triggering the issue is
autovacuum on pg_attribute trying to truncate off empty pages
in pg_attribute (after a bunch of dead rows were generated there
by DDL activity). That requires exclusive lock on pg_attribute,
which would propagate down to the standby.
regards, tom laneRight, that part I understood after checking out pg_attribute's insert/delete counts in pg_stat_sys_tables before and after some REFRESH MATERIALIZED VIEW runs on an otherwise idle server. With them running 2k+ refreshes per day autovac is regularly working on their catalog tables.Thanks!--
Thanks for the ti
On Sat, May 11, 2019 at 9:15 AM Jeremy Schneider <schneider@ardentperf.com> wrote:
Just a quick footnote: If autovac truncations are frequently causing replica lag, and if this is a problem for you, IIUC one way you can stop autovac from doing the truncations even on older versions is setting old_snapshot_threshold to any value at all besides zero. (On 12+ you can directly control the truncation behavior.)-Jeremy
Thanks for the tip!