Обсуждение: [ADMIN] VACUUM ANALYZE Issues

Поиск
Список
Период
Сортировка

[ADMIN] VACUUM ANALYZE Issues

От
Harold Falkmeyer
Дата:
Greetings,

My firm manages a somewhat large PostgreSQL database (400'ish tables, 5 TiB'ish on disk, 200'ish GiB pg_dumps, etc.).  We're presently running PostgreSQL 8.4.22 on Linux 2.6.32 with Slony-I 2.0.8 for replication.

Over the past several weeks, our weekly VACUUM ANALYZE job has become disruptive to other queries, seemingly spinning on one particular table (one of our largest in rows, columns, indexes, foreign keys, and general use), ultimately requiring that the job be terminated.  With the number of INSERTs, UPDATEs, and DELETEs going through our system, it's imperative that VACUUM ANALYZEs be completed periodically.  In the past, and given our usage patterns, we've opted for weekly VACUUM ANALYZEs in lieu of autovacuum, which is presently disabled.  If necessary, we can schedule downtime, dereplicate the affected table, entirely rebuild the affected table, and rereplicate.  Though, if possible, we're looking for a solution without significant downtime.

After about 60-90 minutes working on the aforementioned table, and with the VACUUM ANALYZE at the top of our longest running queries list, other operations attempting to read from or write to on that table either complete extremely slowly or are awaiting locks held by the VACUUM ANALYZE.  Sampling during the most recent disruptive period, the VACUUM ANALYZE pid held 20 locks: 17 RowExclusiveLocks, one AccessExclusiveLock, and one ShareUpdateExclusive, all on the aforementioned table, and one ExclusiveLock for the containing transaction.  After sending SIGTERMs to all other pids enumerated in pg_locks as holding or wanting locks on that table, the VACUUM ANALYZE continued without apparent progress.  At the time, the containing process was consuming a rough average of about 6% of one CPU and 4.9% memory (no swapping), was holding 641 descriptors, was in a seeming loop of lseek, read, memcpy, and memcmp, with occasional semop calls.  During this process, we attempted to SIGTERM all other connections to ensure no stale transactions, locks, etc. were preventing the VACUUM ANALYZE from finishing.  Overall system load and I/O activity was quite low.

Ultimately, we've felt forced to terminate the VACUUM ANALYZEs.  After sending it a SIGTERM and after about 30 seconds of seeing no change, we attempted a service postgresql-8.4 stop.  After 30 additional seconds, we sent a SIGQUIT to the VACUUM ANALYZE process (which we always try to avoid), which terminated it and, in turn, PostgreSQL stopped.  Thereafter, the restarts seemed quick and uneventful, with cleanup happening without reported exception.  Only once in the last month or so have we seen a VACUUM ANALYZE on this table complete.

Thank you in advance,

Harold Falkmeyer
hfalkmeyer@yahoo.com

Re: [ADMIN] VACUUM ANALYZE Issues

От
Andres Freund
Дата:
Hi,

On 2017-06-05 14:20:12 -0700, Harold Falkmeyer wrote:
> My firm manages a somewhat large PostgreSQL database (400'ish tables, 5
> TiB'ish on disk, 200'ish GiB pg_dumps, etc.).  We're presently running
> PostgreSQL 8.4.22 on Linux 2.6.32 with Slony-I 2.0.8 for replication.

First off, you should seriously consider upgrading. The situation has
gotten considerably better since 8.4, particularly in 9.6.  Also the 8.4
branch hasn't received bugfixed (including security fixes) for nearly
three years.

> Over the past several weeks, our weekly VACUUM ANALYZE job has become
> disruptive to other queries, seemingly spinning on one particular table
> (one of our largest in rows, columns, indexes, foreign keys, and general
> use), ultimately requiring that the job be terminated.  With the number of
> INSERTs, UPDATEs, and DELETEs going through our system, it's imperative
> that VACUUM ANALYZEs be completed periodically.  In the past, and given our
> usage patterns, we've opted for weekly VACUUM ANALYZEs in lieu of
> autovacuum, which is presently disabled.

That sounds like it might seriously exascerbate the problem - weekly is
really infrequent for both vacuum and analyze.


> After about 60-90 minutes working on the aforementioned table, and with the
> VACUUM ANALYZE at the top of our longest running queries list, other
> operations attempting to read from or write to on that table either
> complete extremely slowly or are awaiting locks held by the VACUUM
> ANALYZE.  Sampling during the most recent disruptive period, the VACUUM
> ANALYZE pid held 20 locks: 17 RowExclusiveLocks, one AccessExclusiveLock,
> and one ShareUpdateExclusive, all on the aforementioned table, and one
> ExclusiveLock for the containing transaction.

An ExclusiveLock on the transaction is completely normal, and only used
for other sessions to be able to wait for the completion of the the
holder's transaction.

It does not entirely sound like you're suffering from locking problems.
Are there any locks that are not granted (e.g. SELECT * FROM pg_locks
WHERE NOT granted;) during that period?   If so, what commands are
using these locks, you should be able to do normal table manipulations
while vacuum is in progress.


> After sending SIGTERMs to
> all other pids enumerated in pg_locks as holding or wanting locks on that
> table, the VACUUM ANALYZE continued without apparent progress.  At the
> time, the containing process was consuming a rough average of about 6% of
> one CPU and 4.9% memory (no swapping), was holding 641 descriptors, was in
> a seeming loop of lseek, read, memcpy, and memcmp, with occasional semop
> calls.  During this process, we attempted to SIGTERM all other connections
> to ensure no stale transactions, locks, etc. were preventing the VACUUM
> ANALYZE from finishing.  Overall system load and I/O activity was quite low.

> Ultimately, we've felt forced to terminate the VACUUM ANALYZEs.  After
> sending it a SIGTERM and after about 30 seconds of seeing no change, we
> attempted a service postgresql-8.4 stop.  After 30 additional seconds, we
> sent a SIGQUIT to the VACUUM ANALYZE process (which we always try to
> avoid), which terminated it and, in turn, PostgreSQL stopped.  Thereafter,
> the restarts seemed quick and uneventful, with cleanup happening without
> reported exception.  Only once in the last month or so have we seen a
> VACUUM ANALYZE on this table complete.

Well, that'll make the problem worse and worse.  I suspect your setting
might not be appropriately aggressive to be able to collect the
accumulated cruft.  What's your maintenance_work_mem setting, and what
are your vacuum_cost_delay/vacuum_cost_limit settings?

Greetings,

Andres Freund


Re: [ADMIN] VACUUM ANALYZE Issues

От
Harold Falkmeyer
Дата:
Andres,

Thank you for your reply.  Please find my comments inline below.

On Mon, Jun 5, 2017 at 3:47 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-06-05 14:20:12 -0700, Harold Falkmeyer wrote:
> My firm manages a somewhat large PostgreSQL database (400'ish tables, 5
> TiB'ish on disk, 200'ish GiB pg_dumps, etc.).  We're presently running
> PostgreSQL 8.4.22 on Linux 2.6.32 with Slony-I 2.0.8 for replication.

First off, you should seriously consider upgrading. The situation has
gotten considerably better since 8.4, particularly in 9.6.  Also the 8.4
branch hasn't received bugfixed (including security fixes) for nearly
three years.

Agreed.  We plan on upgrading to at least 9.4.9, if not 9.6.x within the next several months; however, we're not quite there yet.

> Over the past several weeks, our weekly VACUUM ANALYZE job has become
> disruptive to other queries, seemingly spinning on one particular table
> (one of our largest in rows, columns, indexes, foreign keys, and general
> use), ultimately requiring that the job be terminated.  With the number of
> INSERTs, UPDATEs, and DELETEs going through our system, it's imperative
> that VACUUM ANALYZEs be completed periodically.  In the past, and given our
> usage patterns, we've opted for weekly VACUUM ANALYZEs in lieu of
> autovacuum, which is presently disabled.

That sounds like it might seriously exascerbate the problem - weekly is
really infrequent for both vacuum and analyze.
 
Agreed.  The intention for the new setup is to enable autovacuum.

> After about 60-90 minutes working on the aforementioned table, and with the
> VACUUM ANALYZE at the top of our longest running queries list, other
> operations attempting to read from or write to on that table either
> complete extremely slowly or are awaiting locks held by the VACUUM
> ANALYZE.  Sampling during the most recent disruptive period, the VACUUM
> ANALYZE pid held 20 locks: 17 RowExclusiveLocks, one AccessExclusiveLock,
> and one ShareUpdateExclusive, all on the aforementioned table, and one
> ExclusiveLock for the containing transaction.

An ExclusiveLock on the transaction is completely normal, and only used
for other sessions to be able to wait for the completion of the the
holder's transaction.

It does not entirely sound like you're suffering from locking problems.
Are there any locks that are not granted (e.g. SELECT * FROM pg_locks
WHERE NOT granted;) during that period?   If so, what commands are
using these locks, you should be able to do normal table manipulations
while vacuum is in progress.

Many other processes were listed as having ungranted locks, including many SELECT statements having ungranted AccessShareLocks for the affected relation.  In fact, the ONLY granted locks for the affected relation were the AccessExclusiveLock and ShareUpdateExclusive locks held by the VACUUM ANALYZE.  Under what circumstances does VACUUM ANALYZE attempt to acquire an AccessExclusiveLock?

> After sending SIGTERMs to
> all other pids enumerated in pg_locks as holding or wanting locks on that
> table, the VACUUM ANALYZE continued without apparent progress.  At the
> time, the containing process was consuming a rough average of about 6% of
> one CPU and 4.9% memory (no swapping), was holding 641 descriptors, was in
> a seeming loop of lseek, read, memcpy, and memcmp, with occasional semop
> calls.  During this process, we attempted to SIGTERM all other connections
> to ensure no stale transactions, locks, etc. were preventing the VACUUM
> ANALYZE from finishing.  Overall system load and I/O activity was quite low.

> Ultimately, we've felt forced to terminate the VACUUM ANALYZEs.  After
> sending it a SIGTERM and after about 30 seconds of seeing no change, we
> attempted a service postgresql-8.4 stop.  After 30 additional seconds, we
> sent a SIGQUIT to the VACUUM ANALYZE process (which we always try to
> avoid), which terminated it and, in turn, PostgreSQL stopped.  Thereafter,
> the restarts seemed quick and uneventful, with cleanup happening without
> reported exception.  Only once in the last month or so have we seen a
> VACUUM ANALYZE on this table complete.

Well, that'll make the problem worse and worse.  I suspect your setting
might not be appropriately aggressive to be able to collect the
accumulated cruft.  What's your maintenance_work_mem setting, and what
are your vacuum_cost_delay/vacuum_cost_limit settings?
 
In what way do the SIGQUITs aggravate problems?  I was under the impression that any residual artifacts left behind by a SIGQUIT (or immediate shutdown) are resolved when the database is next started!?

Our maintenance_work_mem is 16 GiB, our vacuum_cost_delay is 0 ms, and our vacuum_cost_limit is 200.  Host memory is 126 GiB.  Perhaps a significant increase to maintenance_work_mem is warranted (and safe, given that we don't presently have the risk of multiple autovacuum workers).  Do you have general rule-of-thumb recommendations for these settings (e.g., relative to other settings)?

Are there other sources I should be querying for useful information?  For example, I understand that pg_locks doesn't show all PostgreSQL locks that could cause interference (e.g., as mentioned in http://rhaas.blogspot.com/2011/03/troubleshooting-stuck-vacuums.html)!?

Greetings,

Andres Freund

Re: [ADMIN] VACUUM ANALYZE Issues

От
Harold Falkmeyer
Дата:
I'd forgotten to mention that the VACUUM VERBOSE ANALYZEs were/are consistently completing their passes over indices.  While the VACUUM ANALYZEs were disruptive, and looping as described above, the VERBOSE progress report was last indicating that "DETAIL:  N dead row versions cannot be removed yet."

Harold

On Mon, Jun 5, 2017 at 7:18 PM, Harold Falkmeyer <hfalkmeyer@gmail.com> wrote:
Andres,

Thank you for your reply.  Please find my comments inline below.

On Mon, Jun 5, 2017 at 3:47 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-06-05 14:20:12 -0700, Harold Falkmeyer wrote:
> My firm manages a somewhat large PostgreSQL database (400'ish tables, 5
> TiB'ish on disk, 200'ish GiB pg_dumps, etc.).  We're presently running
> PostgreSQL 8.4.22 on Linux 2.6.32 with Slony-I 2.0.8 for replication.

First off, you should seriously consider upgrading. The situation has
gotten considerably better since 8.4, particularly in 9.6.  Also the 8.4
branch hasn't received bugfixed (including security fixes) for nearly
three years.

Agreed.  We plan on upgrading to at least 9.4.9, if not 9.6.x within the next several months; however, we're not quite there yet.

> Over the past several weeks, our weekly VACUUM ANALYZE job has become
> disruptive to other queries, seemingly spinning on one particular table
> (one of our largest in rows, columns, indexes, foreign keys, and general
> use), ultimately requiring that the job be terminated.  With the number of
> INSERTs, UPDATEs, and DELETEs going through our system, it's imperative
> that VACUUM ANALYZEs be completed periodically.  In the past, and given our
> usage patterns, we've opted for weekly VACUUM ANALYZEs in lieu of
> autovacuum, which is presently disabled.

That sounds like it might seriously exascerbate the problem - weekly is
really infrequent for both vacuum and analyze.
 
Agreed.  The intention for the new setup is to enable autovacuum.

> After about 60-90 minutes working on the aforementioned table, and with the
> VACUUM ANALYZE at the top of our longest running queries list, other
> operations attempting to read from or write to on that table either
> complete extremely slowly or are awaiting locks held by the VACUUM
> ANALYZE.  Sampling during the most recent disruptive period, the VACUUM
> ANALYZE pid held 20 locks: 17 RowExclusiveLocks, one AccessExclusiveLock,
> and one ShareUpdateExclusive, all on the aforementioned table, and one
> ExclusiveLock for the containing transaction.

An ExclusiveLock on the transaction is completely normal, and only used
for other sessions to be able to wait for the completion of the the
holder's transaction.

It does not entirely sound like you're suffering from locking problems.
Are there any locks that are not granted (e.g. SELECT * FROM pg_locks
WHERE NOT granted;) during that period?   If so, what commands are
using these locks, you should be able to do normal table manipulations
while vacuum is in progress.

Many other processes were listed as having ungranted locks, including many SELECT statements having ungranted AccessShareLocks for the affected relation.  In fact, the ONLY granted locks for the affected relation were the AccessExclusiveLock and ShareUpdateExclusive locks held by the VACUUM ANALYZE.  Under what circumstances does VACUUM ANALYZE attempt to acquire an AccessExclusiveLock?

> After sending SIGTERMs to
> all other pids enumerated in pg_locks as holding or wanting locks on that
> table, the VACUUM ANALYZE continued without apparent progress.  At the
> time, the containing process was consuming a rough average of about 6% of
> one CPU and 4.9% memory (no swapping), was holding 641 descriptors, was in
> a seeming loop of lseek, read, memcpy, and memcmp, with occasional semop
> calls.  During this process, we attempted to SIGTERM all other connections
> to ensure no stale transactions, locks, etc. were preventing the VACUUM
> ANALYZE from finishing.  Overall system load and I/O activity was quite low.

> Ultimately, we've felt forced to terminate the VACUUM ANALYZEs.  After
> sending it a SIGTERM and after about 30 seconds of seeing no change, we
> attempted a service postgresql-8.4 stop.  After 30 additional seconds, we
> sent a SIGQUIT to the VACUUM ANALYZE process (which we always try to
> avoid), which terminated it and, in turn, PostgreSQL stopped.  Thereafter,
> the restarts seemed quick and uneventful, with cleanup happening without
> reported exception.  Only once in the last month or so have we seen a
> VACUUM ANALYZE on this table complete.

Well, that'll make the problem worse and worse.  I suspect your setting
might not be appropriately aggressive to be able to collect the
accumulated cruft.  What's your maintenance_work_mem setting, and what
are your vacuum_cost_delay/vacuum_cost_limit settings?
 
In what way do the SIGQUITs aggravate problems?  I was under the impression that any residual artifacts left behind by a SIGQUIT (or immediate shutdown) are resolved when the database is next started!?

Our maintenance_work_mem is 16 GiB, our vacuum_cost_delay is 0 ms, and our vacuum_cost_limit is 200.  Host memory is 126 GiB.  Perhaps a significant increase to maintenance_work_mem is warranted (and safe, given that we don't presently have the risk of multiple autovacuum workers).  Do you have general rule-of-thumb recommendations for these settings (e.g., relative to other settings)?

Are there other sources I should be querying for useful information?  For example, I understand that pg_locks doesn't show all PostgreSQL locks that could cause interference (e.g., as mentioned in http://rhaas.blogspot.com/2011/03/troubleshooting-stuck-vacuums.html)!?

Greetings,

Andres Freund


Re: [ADMIN] VACUUM ANALYZE Issues

От
Harold Falkmeyer
Дата:
What would cause a VACUUM ANALYZE to acquire an ACCESS EXCLUSIVE lock (not running as a VACUUM FULL)?

Harold Falkmeyer

On Tue, Jun 6, 2017 at 11:43 AM, Harold Falkmeyer <hfalkmeyer@gmail.com> wrote:
I'd forgotten to mention that the VACUUM VERBOSE ANALYZEs were/are consistently completing their passes over indices.  While the VACUUM ANALYZEs were disruptive, and looping as described above, the VERBOSE progress report was last indicating that "DETAIL:  N dead row versions cannot be removed yet."

Harold

On Mon, Jun 5, 2017 at 7:18 PM, Harold Falkmeyer <hfalkmeyer@gmail.com> wrote:
Andres,

Thank you for your reply.  Please find my comments inline below.

On Mon, Jun 5, 2017 at 3:47 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-06-05 14:20:12 -0700, Harold Falkmeyer wrote:
> My firm manages a somewhat large PostgreSQL database (400'ish tables, 5
> TiB'ish on disk, 200'ish GiB pg_dumps, etc.).  We're presently running
> PostgreSQL 8.4.22 on Linux 2.6.32 with Slony-I 2.0.8 for replication.

First off, you should seriously consider upgrading. The situation has
gotten considerably better since 8.4, particularly in 9.6.  Also the 8.4
branch hasn't received bugfixed (including security fixes) for nearly
three years.

Agreed.  We plan on upgrading to at least 9.4.9, if not 9.6.x within the next several months; however, we're not quite there yet.

> Over the past several weeks, our weekly VACUUM ANALYZE job has become
> disruptive to other queries, seemingly spinning on one particular table
> (one of our largest in rows, columns, indexes, foreign keys, and general
> use), ultimately requiring that the job be terminated.  With the number of
> INSERTs, UPDATEs, and DELETEs going through our system, it's imperative
> that VACUUM ANALYZEs be completed periodically.  In the past, and given our
> usage patterns, we've opted for weekly VACUUM ANALYZEs in lieu of
> autovacuum, which is presently disabled.

That sounds like it might seriously exascerbate the problem - weekly is
really infrequent for both vacuum and analyze.
 
Agreed.  The intention for the new setup is to enable autovacuum.

> After about 60-90 minutes working on the aforementioned table, and with the
> VACUUM ANALYZE at the top of our longest running queries list, other
> operations attempting to read from or write to on that table either
> complete extremely slowly or are awaiting locks held by the VACUUM
> ANALYZE.  Sampling during the most recent disruptive period, the VACUUM
> ANALYZE pid held 20 locks: 17 RowExclusiveLocks, one AccessExclusiveLock,
> and one ShareUpdateExclusive, all on the aforementioned table, and one
> ExclusiveLock for the containing transaction.

An ExclusiveLock on the transaction is completely normal, and only used
for other sessions to be able to wait for the completion of the the
holder's transaction.

It does not entirely sound like you're suffering from locking problems.
Are there any locks that are not granted (e.g. SELECT * FROM pg_locks
WHERE NOT granted;) during that period?   If so, what commands are
using these locks, you should be able to do normal table manipulations
while vacuum is in progress.

Many other processes were listed as having ungranted locks, including many SELECT statements having ungranted AccessShareLocks for the affected relation.  In fact, the ONLY granted locks for the affected relation were the AccessExclusiveLock and ShareUpdateExclusive locks held by the VACUUM ANALYZE.  Under what circumstances does VACUUM ANALYZE attempt to acquire an AccessExclusiveLock?

> After sending SIGTERMs to
> all other pids enumerated in pg_locks as holding or wanting locks on that
> table, the VACUUM ANALYZE continued without apparent progress.  At the
> time, the containing process was consuming a rough average of about 6% of
> one CPU and 4.9% memory (no swapping), was holding 641 descriptors, was in
> a seeming loop of lseek, read, memcpy, and memcmp, with occasional semop
> calls.  During this process, we attempted to SIGTERM all other connections
> to ensure no stale transactions, locks, etc. were preventing the VACUUM
> ANALYZE from finishing.  Overall system load and I/O activity was quite low.

> Ultimately, we've felt forced to terminate the VACUUM ANALYZEs.  After
> sending it a SIGTERM and after about 30 seconds of seeing no change, we
> attempted a service postgresql-8.4 stop.  After 30 additional seconds, we
> sent a SIGQUIT to the VACUUM ANALYZE process (which we always try to
> avoid), which terminated it and, in turn, PostgreSQL stopped.  Thereafter,
> the restarts seemed quick and uneventful, with cleanup happening without
> reported exception.  Only once in the last month or so have we seen a
> VACUUM ANALYZE on this table complete.

Well, that'll make the problem worse and worse.  I suspect your setting
might not be appropriately aggressive to be able to collect the
accumulated cruft.  What's your maintenance_work_mem setting, and what
are your vacuum_cost_delay/vacuum_cost_limit settings?
 
In what way do the SIGQUITs aggravate problems?  I was under the impression that any residual artifacts left behind by a SIGQUIT (or immediate shutdown) are resolved when the database is next started!?

Our maintenance_work_mem is 16 GiB, our vacuum_cost_delay is 0 ms, and our vacuum_cost_limit is 200.  Host memory is 126 GiB.  Perhaps a significant increase to maintenance_work_mem is warranted (and safe, given that we don't presently have the risk of multiple autovacuum workers).  Do you have general rule-of-thumb recommendations for these settings (e.g., relative to other settings)?

Are there other sources I should be querying for useful information?  For example, I understand that pg_locks doesn't show all PostgreSQL locks that could cause interference (e.g., as mentioned in http://rhaas.blogspot.com/2011/03/troubleshooting-stuck-vacuums.html)!?

Greetings,

Andres Freund



Re: [ADMIN] VACUUM ANALYZE Issues

От
Andres Freund
Дата:
Hi,

Please don't full-quote on this list.

On 2017-06-08 21:52:38 -0700, Harold Falkmeyer wrote:
> What would cause a VACUUM ANALYZE to acquire an ACCESS EXCLUSIVE lock (not
> running as a VACUUM FULL)?

It does so when truncating the relation, to remove free space at the end
of the relation. If there's a lot of free space, that can take a while.
This has been much improved in releases since 8.4...

- Andres


Re: [ADMIN] VACUUM ANALYZE Issues

От
Harold Falkmeyer
Дата:
On Thu, Jun 8, 2017 at 10:25 PM, Andres Freund <andres@anarazel.de> wrote:

It does so when truncating the relation, to remove free space at the end
of the relation. If there's a lot of free space, that can take a while.
This has been much improved in releases since 8.4.

Thank you for this information.  Is it correct to assume that the VACUUM would NOT release its AccessExclusive lock during a vacuum_cost_delay sleep?  Other than reading the PostgreSQL manual regarding Routine Vacuuming and the SQL reference for VACUUM, are there other documents or books you could recommend that would cover VACUUM internals (other than attempting to read source :-) )?

Thank you again,

Harold Falkmeyer