Обсуждение: Vacuum Issues
Hi All,
Having an issue where vacuum has been unable to clear dead rows since March 12. We are running an older version 9.2.4 with three hot standby servers.
We have standby feedback enabled, but according to pg_stat_activity there are no long running queries on the master or standby servers.
Also, we have no hung transactions on the master server:
SELECT * FROM pg_stat_activity where xact_start < (now() - interval '5 mins') ORDER BY xact_start NULLS LAST LIMIT 1;
Produces no results.
Anyone have any thoughts on what we should be checking next?
Thanks,
Darron Harrison
On Mar 26, 2020, at 2:10 PM, Darron Harrison <darron@realtyserver.com> wrote:Anyone have any thoughts on what we should be checking next?
Check the backend_xmin in pg_stat_activity; is the holdup in one of replicas or on the master? 
select pid
 , usename
 , client_addr
 , state
 , backend_type
 , backend_xid
 , backend_xmin
from pg_stat_activity
where backend_xmin is not null
order by age(backend_xmin) desc
limit 10
;
Two-phase transactions:
select * from pg_prepared_xacts;
Replication slots: 
select * from pg_replication_slots;
Unable to run the first query and third query. I believe it's a version issue, as we are running 9.2.4.
No results from select * from pg_prepared_xacts on any of the servers.
Darron
---- On Thu, 26 Mar 2020 11:50:10 -0700 Rui DeSousa <rui@crazybean.net> wrote ----
On Mar 26, 2020, at 2:10 PM, Darron Harrison <darron@realtyserver.com> wrote:Anyone have any thoughts on what we should be checking next?Check the backend_xmin in pg_stat_activity; is the holdup in one of replicas or on the master?select pid, usename, client_addr, state, backend_type, backend_xid, backend_xminfrom pg_stat_activitywhere backend_xmin is not nullorder by age(backend_xmin) desclimit 10;Two-phase transactions:select * from pg_prepared_xacts;Replication slots:select * from pg_replication_slots;
On Mar 26, 2020, at 3:01 PM, Darron Harrison <darron@realtyserver.com> wrote:Unable to run the first query and third query. I believe it's a version issue, as we are running 9.2.4.
Looks like backend_xmin was added in 9.4; shoot
Replication slots was added in 9.4, too; no slots to worry about.
Since you don’t have access to backend_xmin; you have check the master and all the replicas as you have feedback enabled.  
I would look at all non “idle” sessions — i.e. “idle in tran” could be holding on to an old xmin. 
select pid
  , usename
  , client_addr
  , client_port
  , state
  , age(now(), state_change) as state_time
  , age(now(), xact_start) as xact_time
from pg_stat_activity
where state != 'idle'
order by xact_time desc
;
I am not seeing anything suspicious on the master or slaves.
master:
  pid  |     usename      |  client_addr  | client_port | state  |    state_time    |    xact_time 
-------+------------------+---------------+-------------+--------+------------------+------------------
17627 | xposure          | 64.40.104.231 |       13231 | active | 00:00:02.11561   | 00:00:02.116436
31197 | vancouver_island | 64.40.104.242 |        8174 | active | 00:00:00.067581  | 00:00:00.068289
17616 | xposure          | 64.40.104.231 |       13216 | active | 00:00:00.015216  | 00:00:00.015392
17631 | xposure          | 64.40.104.231 |       13230 | active | 00:00:00.007903  | 00:00:00.010017
  4589 | postgres         |               |          -1 | active | -00:00:00.000002 | 00:00:00
31578 | south_okanagan   | 64.40.104.230 |       32926 | active | -00:00:00.00138  | -00:00:00.001307
Slave:
  pid  |     usename      |  client_addr  | client_port | state  |    state_time    |    xact_time
-------+------------------+---------------+-------------+--------+------------------+-----------------
187058 | postgres |             |          -1 | active | -00:00:00.000004 | 00:00:00
Slave:
  pid  |     usename      |  client_addr  | client_port | state  |    state_time    |    xact_time 
-------+------------------+---------------+-------------+--------+------------------+-----------------
19411 | vancouver_island | 64.40.104.242 |       30274 | active | 00:00:00.08489   | 00:00:00.085521
18180 | pgsql            |               |          -1 | active | -00:00:00.000003 | 00:00:00
Slave:
  pid  | usename  | client_addr | client_port | state  |    state_time    | xact_time 
-------+----------+-------------+-------------+--------+------------------+-----------
29524 | postgres |             |          -1 | active | -00:00:00.000004 | 00:00:00
Darron
---- On Thu, 26 Mar 2020 12:22:57 -0700 Rui DeSousa <rui@crazybean.net> wrote ----
On Mar 26, 2020, at 3:01 PM, Darron Harrison <darron@realtyserver.com> wrote:Unable to run the first query and third query. I believe it's a version issue, as we are running 9.2.4.Looks like backend_xmin was added in 9.4; shootReplication slots was added in 9.4, too; no slots to worry about.Since you don’t have access to backend_xmin; you have check the master and all the replicas as you have feedback enabled.I would look at all non “idle” sessions — i.e. “idle in tran” could be holding on to an old xmin.select pid, usename, client_addr, client_port, state, age(now(), state_change) as state_time, age(now(), xact_start) as xact_timefrom pg_stat_activitywhere state != 'idle'order by xact_time desc;
On Mar 26, 2020, at 4:18 PM, Darron Harrison <darron@realtyserver.com> wrote:I am not seeing anything suspicious on the master or slaves.
I would ask what replication timeout values are but those where added in 9.3 (wal_reciver_timeout and wal_sender_timeout).
Are all the replicas receiving updates and current or are they lagging? Does replication traverse any firewalls?
There is no current lag on the replicas. Replication does traverse a firewall, but we have made no changes recently.
I will say that one of the hot standbys was only recently attached, and it seems like the issues started when we began sending some longer running queries it's way. We have since placed those queries back on the master, but the vacuum issues remain.
One side effect of whatever is happening, is that nightly backups are taking twice as long as normal.
Darron
---- On Thu, 26 Mar 2020 13:47:21 -0700 Rui DeSousa <rui@crazybean.net> wrote ----
On Mar 26, 2020, at 4:18 PM, Darron Harrison <darron@realtyserver.com> wrote:I am not seeing anything suspicious on the master or slaves.I would ask what replication timeout values are but those where added in 9.3 (wal_reciver_timeout and wal_sender_timeout).Are all the replicas receiving updates and current or are they lagging? Does replication traverse any firewalls?
On Mar 26, 2020, at 5:40 PM, Darron Harrison <darron@realtyserver.com> wrote:There is no current lag on the replicas. Replication does traverse a firewall, but we have made no changes recently.I will say that one of the hot standbys was only recently attached, and it seems like the issues started when we began sending some longer running queries it's way. We have since placed those queries back on the master, but the vacuum issues remain.One side effect of whatever is happening, is that nightly backups are taking twice as long as normal.
It could sill be a bad/sale replication session.  If the firewall drops the replication stream and does not send a reset packets (bad pratice) then the replication session might still be lingering on the Postgres server and holding on to a very old xmin.  Do you know if you have TCP/IP Keepalive enabled? I don’t think in 9.2 replication sessions are listed in pg_stat_activity; thus, you’ll have to look for TCP/IP connections to the replics that should not exist. 
Check all the upstream servers for stale TCP/IP replication connections; using netstat.  I would also look at the system’s process list for walsender processes to see if there more more than there should be; i.e. two of them for single replica.
If you do find one; the best way to terminate it is to drop the TCP/IP connection.  i.e. In FreeBSD it would be the command “tcpdrop”; for Linux there are few utilities that do same -- I just don’t recall the name of them.
I am currently seeing three WAL sender processes on the master, and no stale replication connections on the slaves. 
To clarify, are you saying there should be two WAL sender processes for each slave for a total of six?
Darron  
---- On Thu, 26 Mar 2020 15:26:01 -0700 Rui DeSousa <rui@crazybean.net> wrote ----
On Mar 26, 2020, at 5:40 PM, Darron Harrison <darron@realtyserver.com> wrote:There is no current lag on the replicas. Replication does traverse a firewall, but we have made no changes recently.I will say that one of the hot standbys was only recently attached, and it seems like the issues started when we began sending some longer running queries it's way. We have since placed those queries back on the master, but the vacuum issues remain.One side effect of whatever is happening, is that nightly backups are taking twice as long as normal.It could sill be a bad/sale replication session. If the firewall drops the replication stream and does not send a reset packets (bad pratice) then the replication session might still be lingering on the Postgres server and holding on to a very old xmin. Do you know if you have TCP/IP Keepalive enabled? I don’t think in 9.2 replication sessions are listed in pg_stat_activity; thus, you’ll have to look for TCP/IP connections to the replics that should not exist.Check all the upstream servers for stale TCP/IP replication connections; using netstat. I would also look at the system’s process list for walsender processes to see if there more more than there should be; i.e. two of them for single replica.If you do find one; the best way to terminate it is to drop the TCP/IP connection. i.e. In FreeBSD it would be the command “tcpdrop”; for Linux there are few utilities that do same -- I just don’t recall the name of them.
Darron Harrison <darron@realtyserver.com> writes: > I am currently seeing three WAL sender processes on the master, and > no stale replication connections on the slaves. > > To clarify, are you saying there should be two WAL sender processes > for each slave for a total of six? > Silly question... Have you verified that autovac is actually running? launcher still alive, not malconfig'd as to do nothing... etc? Silly question #2; supposing autovac is trying to do as intended, have you inspected the logs to insure that it's not erroring out on one or more tables, catalogs, indexes... etc? I've reade most of this thread and not seeing any indication if the problem is confined to just one object or several. FWIW > Darron > > > ---- On Thu, 26 Mar 2020 15:26:01 -0700 Rui DeSousa > <rui@crazybean.net> wrote ---- > > > > > > On Mar 26, 2020, at 5:40 PM, Darron Harrison < > darron@realtyserver.com> wrote: > > There is no current lag on the replicas. Replication does > traverse a firewall, but we have made no changes recently. > > I will say that one of the hot standbys was only recently > attached, and it seems like the issues started when we began > sending some longer running queries it's way. We have since > placed those queries back on the master, but the vacuum > issues remain. > > One side effect of whatever is happening, is that nightly > backups are taking twice as long as normal. > > > It could sill be a bad/sale replication session. If the firewall > drops the replication stream and does not send a reset packets > (bad pratice) then the replication session might still be > lingering on the Postgres server and holding on to a very old > xmin. Do you know if you have TCP/IP Keepalive enabled? I don’t > think in 9.2 replication sessions are listed in pg_stat_activity; > thus, you’ll have to look for TCP/IP connections to the replics > that should not exist. > > Check all the upstream servers for stale TCP/IP replication > connections; using netstat. I would also look at the system’s > process list for walsender processes to see if there more more > than there should be; i.e. two of them for single replica. > > If you do find one; the best way to terminate it is to drop the > TCP/IP connection. i.e. In FreeBSD it would be the command > “tcpdrop”; for Linux there are few utilities that do same -- I > just don’t recall the name of them. > > > > > > > > -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net
On Thu, Mar 26, 2020 at 8:05 PM Jerry Sievers <gsievers19@comcast.net> wrote:
Darron Harrison <darron@realtyserver.com> writes:
> I am currently seeing three WAL sender processes on the master, and
> no stale replication connections on the slaves.
>
> To clarify, are you saying there should be two WAL sender processes
> for each slave for a total of six?
>
Silly question...
Have you verified that autovac is actually running? launcher still
alive, not malconfig'd as to do nothing... etc?
Silly question #2; supposing autovac is trying to do as intended, have
you inspected the logs to insure that it's not erroring out on one or
more tables, catalogs, indexes... etc?
I've reade most of this thread and not seeing any indication if the
problem is confined to just one object or several.
FWIW
Also, to clarify, are you seeing actual dead rows not being recycled or just free space not being recovered? These are two completely different things. Actual dead tuples will show up for tables in the pg_stat_all_tables system catalog under the n_dead_tup column. Note also that this is just an estimate since the last time an analyze was run on that table, also available from the same system catalog via the last_analyze or last_autoanalyze columns. I'd recommend running an ANALYZE on the table(s) in question and checking this catalog again.
For actual data on dead rows or whether this is just free space, you can use the pgstattuple contrib module: https://www.postgresql.org/docs/current/pgstattuple.html
For large tables, this can take a while to run because it is gathering the actual statistics of used space, dead rows and free space, not just the estimates that analyze does.
If it's actual dead tuples and vacuum is not clearing it up, you are running an early version of 9.2 which is 20 release behind the latest (9.2.24). There could very well be a bug if you've exhausted all other possible reasons for the dead tuples sticking around. If you're not able to upgrade to a more recent major version, I would highly recommend at least upgrading to the latest 9.2.
If it's just free space (also called bloat), you will have to perform a VACUUM FULL to full return the disk space to the operating system. This will lock the table for the duration and completely rewrite it and all its indexes. However, once complete, if the table was bloated you should see your backups return to their normal runtimes.
I've also written another tool for making bloat monitoring easier if you find this is a frequent problem: https://github.com/keithf4/pg_bloat_check
Keith
On Mar 26, 2020, at 7:24 PM, Darron Harrison <darron@realtyserver.com> wrote:To clarify, are you saying there should be two WAL sender processes for each slave for a total of six?
Nope, only should see one process for each slave.