Обсуждение: PG 7.2.4 and waiting backends

Поиск
Список
Период
Сортировка

PG 7.2.4 and waiting backends

От
Rob Fielding
Дата:
Hi,

Currently experiencing a problem, apparently triggered by vacuumdb -a -f
   which results in all backends (including the vacuum) becoming locked
with a status of 'waiting'.

The intention is to migrate the DB from 7.2 to a later version. The
interim has been to get up to the latest 7.2.4 release inorder to
circumvent a vacuumdb corner case which was resulting in a FULL failing.
The upgrade to 7.2.4 resolved this only to present a new issue. That
after a significant runtime (300 minutes) all backends processing
selects, inserts and even the vacuum itself get stuck 'waiting'. The
result is that all backends become stuck and we run out of backends.

The reason why we have a need to perform a full vacuum is that a problem
with one of our own subsystems handling data inserts caused the db size
to rise from 25GB to 70GB. Unfortunately the nature of the data means
duplicates are a valid case. We have issued a script which hunts and
removes this additional data and this has completed, but pg wont free
this space back to the filesystem until a full vacuum is performed. I'd
like to see where we are up by looking at the real space taken up by the
system. After 5 hours of running vacuum full, before the freeze, the db
is 68GB, so we either still have lots of bogus dupes or we are not
anywhere near completing the vacuum.

The system is live, so I don't have too much to go on right now as I've
only tried the full vacuum 3 times due to the long runtime and the
effects being quite catastrophic if not caught at the right time. I'm at
the first stage of attempting to diagnose the problem and understand the
details I've provided aren't particularly scientific, but I'd like to
try and gain some further insight.

The final logs follow. Notice that prior to the problem, the same
pg_toast_64432 object is being processed. I'm uncertain of the
significance of this as I don't fully understand the log output.
Interestingly we suddenly run alot of log recycling right after this
point (see checkpoint_segments setting). It is after this point we stop.
At which point I manually perform a fast shutdown, as indicated in the
logs after my snip. Resuming the database is fine.


2004-03-29 13:50:58 [14147]  NOTICE:  --Relation pg_toast_64432--
2004-03-29 13:50:59 [14147]  NOTICE:  Pages 975: Changed 334, reaped
275, Empty 0, New 0; Tup 4562: Vac 0, Keep/VTL 630/0, UnUsed 338, MinLen
45, MaxLen 2034; Re-using: Free/Avail. Space 376416/372248;
EndEmpty/Avail. Pages 0/673.
         CPU 0.06s/0.00u sec elapsed 1.33 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Index pg_toast_64432_idx: Pages
161; Tuples 4562: Deleted 0.
         CPU 0.01s/0.00u sec elapsed 0.76 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Rel pg_toast_64432: Pages: 975 -->
974; Tuple(s) moved: 5.
         CPU 0.01s/0.05u sec elapsed 0.07 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Index pg_toast_64432_idx: Pages
161; Tuples 4562: Deleted 5.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
2004-03-29 13:51:00 [14147]  NOTICE:  Analyzing mail_16
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
000000AB00000001
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
000000AB00000002
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
000000AB00000003
2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
000000AB00000004
<snip total of 32 recycled logs>
2004-03-29 13:58:30 [17415]  ERROR:  Query was cancelled.
2004-03-29 13:59:19 [23878]  DEBUG:  fast shutdown request
2004-03-29 13:59:19 [23878]  DEBUG:  aborting any active transactions

next log:

2004-03-29 17:42:12 [26480]  DEBUG:  --Relation pg_toast_64432--
2004-03-29 17:42:13 [26480]  DEBUG:  Pages 984: Changed 10, reaped 477,
Empty 0, New 0; Tup 3977: Vac 630, Keep/VTL 0/0, UnUsed 338, MinLen 45,
MaxLen 2034; Re-using: Free/Avail. Space 1465424/1462296;
EndEmpty/Avail. Pages 0/739.
         CPU 0.05s/0.00u sec elapsed 1.62 sec.
2004-03-29 17:42:14 [26480]  DEBUG:  Index pg_toast_64432_idx: Pages
161; Tuples 3977: Deleted 630.
         CPU 0.00s/0.00u sec elapsed 0.58 sec.
2004-03-29 17:42:14 [26480]  DEBUG:  Rel pg_toast_64432: Pages: 984 -->
841; Tuple(s) moved: 542.
         CPU 0.04s/0.13u sec elapsed 0.20 sec.
2004-03-29 17:42:14 [26480]  DEBUG:  Index pg_toast_64432_idx: Pages
163; Tuples 3977: Deleted 542.
         CPU 0.00s/0.00u sec elapsed 0.00 sec.
2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file
000000AD0000009E
2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file
000000AD0000009F
2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file
000000AD000000A0
<snip total of 32 recycled logs>
2004-03-29 18:32:25 [26346]  DEBUG:  fast shutdown request
2004-03-29 18:32:25 [26346]  DEBUG:  aborting any active transactions

Pertinent config :

max_fsm_relations = 1000
max_fsm_pages = 20000
vacuum_mem = 65536
effective_cache_size = 95694
random_page_cost = 2
sort_mem=65536
max_connections = 128
shared_buffers = 15732
wal_buffers = 64 # need to determin
wal_files = 64 # range 0-64
wal_sync_method = fsync   # the default varies across platforms:
wal_debug = 0             # range 0-16


# hopefully this should see less LogFlushes per LogInsert - use more WAL
though.
commit_delay = 10000     # range 0-100000
commit_siblings = 2       # range 1-1000


checkpoint_segments = 32  # in logfile segments (16MB each), min 1
checkpoint_timeout = 600  # in seconds, range 30-3600
fsync = false # scotty!
#fsync = true

Prior to this new incident, I've been successfully playing around with
the WAL set-up thanks to guys on the performance list (see post in
performance titled "[PERFORM] WAL Optimisation - configuration and
usage"). I'm still experiementing with good settings for checkpoint_seg
and timeout however the significance of recycling and
checkpoint_segments worries me slightly. I shall try lower values in the
mean time.

Best regards,

--

Rob Fielding
rob@dsvr.net

www.dsvr.co.uk              Development             Designer Servers Ltd

Re: PG 7.2.4 and waiting backends

От
Ericson Smith
Дата:
Why not just COPY the data to disk, truncate the table, then COPY it
back in. We do that with some pretty large tables as well, where vacuum
full is not an option.

Think of it as having a short and predictable downtime, instead of a
very long downtime when vacuum full is running.

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+---------------------------------+
| http://www.did-it.com | "When you have to shoot, shoot, |
| eric@did-it.com       | don't talk!             - Tuco  |
| 516-255-0500          |                                 |
+-----------------------+---------------------------------+



Rob Fielding wrote:

> Hi,
>
> Currently experiencing a problem, apparently triggered by vacuumdb -a
> -f   which results in all backends (including the vacuum) becoming
> locked with a status of 'waiting'.
>
> The intention is to migrate the DB from 7.2 to a later version. The
> interim has been to get up to the latest 7.2.4 release inorder to
> circumvent a vacuumdb corner case which was resulting in a FULL
> failing. The upgrade to 7.2.4 resolved this only to present a new
> issue. That after a significant runtime (300 minutes) all backends
> processing selects, inserts and even the vacuum itself get stuck
> 'waiting'. The result is that all backends become stuck and we run out
> of backends.
>
> The reason why we have a need to perform a full vacuum is that a
> problem with one of our own subsystems handling data inserts caused
> the db size to rise from 25GB to 70GB. Unfortunately the nature of the
> data means duplicates are a valid case. We have issued a script which
> hunts and removes this additional data and this has completed, but pg
> wont free this space back to the filesystem until a full vacuum is
> performed. I'd like to see where we are up by looking at the real
> space taken up by the system. After 5 hours of running vacuum full,
> before the freeze, the db is 68GB, so we either still have lots of
> bogus dupes or we are not anywhere near completing the vacuum.
>
> The system is live, so I don't have too much to go on right now as
> I've only tried the full vacuum 3 times due to the long runtime and
> the effects being quite catastrophic if not caught at the right time.
> I'm at the first stage of attempting to diagnose the problem and
> understand the details I've provided aren't particularly scientific,
> but I'd like to try and gain some further insight.
>
> The final logs follow. Notice that prior to the problem, the same
> pg_toast_64432 object is being processed. I'm uncertain of the
> significance of this as I don't fully understand the log output.
> Interestingly we suddenly run alot of log recycling right after this
> point (see checkpoint_segments setting). It is after this point we
> stop. At which point I manually perform a fast shutdown, as indicated
> in the logs after my snip. Resuming the database is fine.
>
>
> 2004-03-29 13:50:58 [14147]  NOTICE:  --Relation pg_toast_64432--
> 2004-03-29 13:50:59 [14147]  NOTICE:  Pages 975: Changed 334, reaped
> 275, Empty 0, New 0; Tup 4562: Vac 0, Keep/VTL 630/0, UnUsed 338,
> MinLen 45, MaxLen 2034; Re-using: Free/Avail. Space 376416/372248;
> EndEmpty/Avail. Pages 0/673.
>         CPU 0.06s/0.00u sec elapsed 1.33 sec.
> 2004-03-29 13:51:00 [14147]  NOTICE:  Index pg_toast_64432_idx: Pages
> 161; Tuples 4562: Deleted 0.
>         CPU 0.01s/0.00u sec elapsed 0.76 sec.
> 2004-03-29 13:51:00 [14147]  NOTICE:  Rel pg_toast_64432: Pages: 975
> --> 974; Tuple(s) moved: 5.
>         CPU 0.01s/0.05u sec elapsed 0.07 sec.
> 2004-03-29 13:51:00 [14147]  NOTICE:  Index pg_toast_64432_idx: Pages
> 161; Tuples 4562: Deleted 5.
>         CPU 0.00s/0.00u sec elapsed 0.00 sec.
> 2004-03-29 13:51:00 [14147]  NOTICE:  Analyzing mail_16
> 2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
> 000000AB00000001
> 2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
> 000000AB00000002
> 2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
> 000000AB00000003
> 2004-03-29 13:56:42 [25752]  DEBUG:  recycled transaction log file
> 000000AB00000004
> <snip total of 32 recycled logs>
> 2004-03-29 13:58:30 [17415]  ERROR:  Query was cancelled.
> 2004-03-29 13:59:19 [23878]  DEBUG:  fast shutdown request
> 2004-03-29 13:59:19 [23878]  DEBUG:  aborting any active transactions
>
> next log:
>
> 2004-03-29 17:42:12 [26480]  DEBUG:  --Relation pg_toast_64432--
> 2004-03-29 17:42:13 [26480]  DEBUG:  Pages 984: Changed 10, reaped
> 477, Empty 0, New 0; Tup 3977: Vac 630, Keep/VTL 0/0, UnUsed 338,
> MinLen 45, MaxLen 2034; Re-using: Free/Avail. Space 1465424/1462296;
> EndEmpty/Avail. Pages 0/739.
>         CPU 0.05s/0.00u sec elapsed 1.62 sec.
> 2004-03-29 17:42:14 [26480]  DEBUG:  Index pg_toast_64432_idx: Pages
> 161; Tuples 3977: Deleted 630.
>         CPU 0.00s/0.00u sec elapsed 0.58 sec.
> 2004-03-29 17:42:14 [26480]  DEBUG:  Rel pg_toast_64432: Pages: 984
> --> 841; Tuple(s) moved: 542.
>         CPU 0.04s/0.13u sec elapsed 0.20 sec.
> 2004-03-29 17:42:14 [26480]  DEBUG:  Index pg_toast_64432_idx: Pages
> 163; Tuples 3977: Deleted 542.
>         CPU 0.00s/0.00u sec elapsed 0.00 sec.
> 2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file
> 000000AD0000009E
> 2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file
> 000000AD0000009F
> 2004-03-29 17:51:44 [16371]  DEBUG:  recycled transaction log file
> 000000AD000000A0
> <snip total of 32 recycled logs>
> 2004-03-29 18:32:25 [26346]  DEBUG:  fast shutdown request
> 2004-03-29 18:32:25 [26346]  DEBUG:  aborting any active transactions
>
> Pertinent config :
>
> max_fsm_relations = 1000
> max_fsm_pages = 20000
> vacuum_mem = 65536
> effective_cache_size = 95694
> random_page_cost = 2
> sort_mem=65536
> max_connections = 128
> shared_buffers = 15732
> wal_buffers = 64 # need to determin
> wal_files = 64 # range 0-64
> wal_sync_method = fsync   # the default varies across platforms:
> wal_debug = 0             # range 0-16
>
>
> # hopefully this should see less LogFlushes per LogInsert - use more
> WAL though.
> commit_delay = 10000     # range 0-100000
> commit_siblings = 2       # range 1-1000
>
>
> checkpoint_segments = 32  # in logfile segments (16MB each), min 1
> checkpoint_timeout = 600  # in seconds, range 30-3600
> fsync = false # scotty!
> #fsync = true
>
> Prior to this new incident, I've been successfully playing around with
> the WAL set-up thanks to guys on the performance list (see post in
> performance titled "[PERFORM] WAL Optimisation - configuration and
> usage"). I'm still experiementing with good settings for
> checkpoint_seg and timeout however the significance of recycling and
> checkpoint_segments worries me slightly. I shall try lower values in
> the mean time.
>
> Best regards,
>

Вложения

Re: PG 7.2.4 and waiting backends

От
Tom Lane
Дата:
Rob Fielding <rob@dsvr.net> writes:
> The intention is to migrate the DB from 7.2 to a later version. The
> interim has been to get up to the latest 7.2.4 release inorder to
> circumvent a vacuumdb corner case which was resulting in a FULL failing.

Hmm ... why do you feel that you have to do a VACUUM FULL before you
can dump and reload to go to a later version?  AFAICS a vacuum in such a
context is just a waste of cycles.  It's certainly not going to reduce
the size of the dump.

The problems you are describing sound like index bloat to me.  If the
real issue is you haven't got enough disk space to do the dump, I'd
suggest dropping all your indexes, after making notes so you can
recreate them after finishing the upgrade.

            regards, tom lane