Обсуждение: Vacuum Full Analyze Stalled

Поиск
Список
Период
Сортировка

Vacuum Full Analyze Stalled

От
"Jeff Kirby"
Дата:
I'm going to attempt to do my best in describing the situation... so please forgive any ignorance on my part since I am
verynew to the Postgres community. 

We have 2 identical Postgres databases, one on a Windows 2003 platform, and the other on a SUSe 9.3 Linux platform
(bothplatforms have identical hardware).  They both have the same schema, the same configuration settings (as far as I
cantell), and relatively the same amount of data (approx 32GB).  OK... here is the scoop, I started a "vacuum full
analyzeverbose" on both last night... the windows box completed in about 1 1/2 hours... the Linux box however is still
chuggingaway this morning... and appears to be stuck on vacuuming "pg_constraint_contypid_index".  How do I know...
wellI don't really... I'm inferring based on the order of the log output on the Windows box. 

So the question is... has anyone experienced this problem in the past?  If so, what do you recommend to rectify the
problem? Is it normal to be running for more than 12 hours on a table that only has 35 rows?  Do you need me to attach
thelog output thus far from both the Windows and Linux box?  Any help here would be greatly appreciated.  Any commands
youcan suggest to run on the Linux machine... no problem. 

Jeffrey Kirby
CCAP Web Team
jeff.kirby@wicourts.gov
608-264-6253


Re: Vacuum Full Analyze Stalled

От
Tom Lane
Дата:
"Jeff Kirby" <Jeff.Kirby@wicourts.gov> writes:
> the Linux box however is still chugging away this morning... and
> appears to be stuck on vacuuming "pg_constraint_contypid_index".  How
> do I know... well I don't really... I'm inferring based on the order
> of the log output on the Windows box.

Looking in pg_locks would give you a more reliable indicator of what the
VACUUM is currently working on.

Is the Linux box otherwise idle?  There was another report recently of a
vacuum hung up for a long time on pg_constraint_contypid_index, but it
seemed to be due to extremely heavy usage of domain types ...

            regards, tom lane

Re: Vacuum Full Analyze Stalled

От
"Tomeh, Husam"
Дата:
I'd be anxious to see any insightful help on this as I encountered this
four weeks ago (although I'm not sure at what stage it got stuck), and
had to cancel and re-run.


--
 Husam


-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Jeff Kirby
Sent: Sunday, October 02, 2005 9:43 AM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] Vacuum Full Analyze Stalled

I'm going to attempt to do my best in describing the situation... so
please forgive any ignorance on my part since I am very new to the
Postgres community.

We have 2 identical Postgres databases, one on a Windows 2003 platform,
and the other on a SUSe 9.3 Linux platform (both platforms have
identical hardware).  They both have the same schema, the same
configuration settings (as far as I can tell), and relatively the same
amount of data (approx 32GB).  OK... here is the scoop, I started a
"vacuum full analyze verbose" on both last night... the windows box
completed in about 1 1/2 hours... the Linux box however is still
chugging away this morning... and appears to be stuck on vacuuming
"pg_constraint_contypid_index".  How do I know... well I don't really...
I'm inferring based on the order of the log output on the Windows box.

So the question is... has anyone experienced this problem in the past?
If so, what do you recommend to rectify the problem?  Is it normal to be
running for more than 12 hours on a table that only has 35 rows?  Do you
need me to attach the log output thus far from both the Windows and
Linux box?  Any help here would be greatly appreciated.  Any commands
you can suggest to run on the Linux machine... no problem.

Jeffrey Kirby
CCAP Web Team
jeff.kirby@wicourts.gov
608-264-6253


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

**********************************************************************
This message contains confidential information intended only for the
use of the addressee(s) named above and may contain information that
is legally privileged.  If you are not the addressee, or the person
responsible for delivering it to the addressee, you are hereby
notified that reading, disseminating, distributing or copying this
message is strictly prohibited.  If you have received this message by
mistake, please immediately notify us by replying to the message and
delete the original message immediately thereafter.

Thank you.                                       FADLD Tag
**********************************************************************


Re: Vacuum Full Analyze Stalled

От
"Uwe C. Schroeder"
Дата:
On the linux box check if any application left a "idle in transaction" backend
around.
just
ps -ef | grep transaction

if that shows any postgres process then you most likely found the culprit.
An application that's still running can have a bug that starts a transaction
which never finishes for some reason (commit or rollback it never called).
I had a couple of those errors in my app - always leading to a stuck vacuum
full. If that's the case just kill the backend in question and the vauum will
continue. Don't kill -9 it!!
After that try to figure out what caused the application not to commit or
rollback a transaction. It's usually an error handler of some sort.

just an idea.

UC

On Sunday 02 October 2005 09:42, Jeff Kirby wrote:
> I'm going to attempt to do my best in describing the situation... so please
> forgive any ignorance on my part since I am very new to the Postgres
> community.
>
> We have 2 identical Postgres databases, one on a Windows 2003 platform, and
> the other on a SUSe 9.3 Linux platform (both platforms have identical
> hardware).  They both have the same schema, the same configuration settings
> (as far as I can tell), and relatively the same amount of data (approx
> 32GB).  OK... here is the scoop, I started a "vacuum full analyze verbose"
> on both last night... the windows box completed in about 1 1/2 hours... the
> Linux box however is still chugging away this morning... and appears to be
> stuck on vacuuming "pg_constraint_contypid_index".  How do I know... well I
> don't really... I'm inferring based on the order of the log output on the
> Windows box.
>
> So the question is... has anyone experienced this problem in the past?  If
> so, what do you recommend to rectify the problem?  Is it normal to be
> running for more than 12 hours on a table that only has 35 rows?  Do you
> need me to attach the log output thus far from both the Windows and Linux
> box?  Any help here would be greatly appreciated.  Any commands you can
> suggest to run on the Linux machine... no problem.
>
> Jeffrey Kirby
> CCAP Web Team
> jeff.kirby@wicourts.gov
> 608-264-6253
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq


--
Open Source Solutions 4U, LLC    2570 Fleetwood Drive
Phone:  +1 650 872 2425        San Bruno, CA 94066
Cell:   +1 650 302 2405        United States
Fax:    +1 650 872 2417

Re: Vacuum Full Analyze Stalled

От
"Kevin Grittner"
Дата:
The previous report was from the same databases -- I'm a consultant
tasked with evaluating the various open source options and making
one of them work with our existing framework.  Jeff developed a new
app (which is in beta testing) which is our first use of PostgreSQL
with real production load for prolonged periods of time.

Jeff was monitoring the app over the weekend, and posted here when
he had problems.  Unfortunately, he didn't capture enough information
to diagnose much, but I'm passing along the sequence of events
since they lead up to a failure this morning for which we have more
detail.

Saturday, 9:00 a.m. -- Noticed performace beginning to degrade.
Saturday, 1:00 p.m. -- After continued degradation, stops client app,
   and runs vacuum full analyze on both Windows and Linux boxes.
Saturday, 3:00 p.m. -- Windows box completes the vacuum.
Sunday, 1:00 p.m. -- Vacuum on Linux still not done.  CPU on Linux
   box is completely idle.  Killed the vacuum process.  Ran vacuum
   full analyze on each table individually.  All fine except for the 72 row
   table with very frequent updates.  Left this one running.
Sunday, 9:00 p.m. -- Vacuum of 72 row table still running.  Killed all
   processes and restarted postgres.  Started a vacuum full analyze
   of the 72 row table again.
Sunday, 11:00 p.m. -- Vacuum of 72 row table finishes.  Started
   applications, which had been stopped since Saturday, 1:00 p.m.
Monday, 9:20 a.m. -- Performance is beginning to degrade again.
   Detailed information from this point below.

dtr=# select * from pg_locks order by pid;
   locktype    | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid  |
         mode           | granted 

---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
 transactionid |          |          |      |       |      36430565 |         |       |          |    36430565 | 2771 |
ExclusiveLock           | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430565 | 2771 |
AccessShareLock         | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430565 | 2771 |
RowExclusiveLock        | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430565 | 2771 |
AccessShareLock         | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430565 | 2771 |
RowExclusiveLock        | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430561 | 2777 |
AccessShareLock         | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430561 | 2777 |
RowExclusiveLock        | t 
 transactionid |          |          |      |       |      36430561 |         |       |          |    36430561 | 2777 |
ExclusiveLock           | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430561 | 2777 |
AccessShareLock         | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430561 | 2777 |
RowExclusiveLock        | t 
 transactionid |          |          |      |       |      36430563 |         |       |          |    36430563 | 2783 |
ExclusiveLock           | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430563 | 2783 |
AccessShareLock         | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430563 | 2783 |
RowExclusiveLock        | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430563 | 2783 |
AccessShareLock         | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430563 | 2783 |
RowExclusiveLock        | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430559 | 2790 |
AccessShareLock         | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    36430559 | 2790 |
RowExclusiveLock        | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430559 | 2790 |
AccessShareLock         | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    36430559 | 2790 |
RowExclusiveLock        | t 
 transactionid |          |          |      |       |      36430559 |         |       |          |    36430559 | 2790 |
ExclusiveLock           | t 
 relation      |    30793 |    30976 |      |       |               |         |       |          |    35633343 | 3158 |
ShareUpdateExclusiveLock| t 
 transactionid |          |          |      |       |      35633343 |         |       |          |    35633343 | 3158 |
ExclusiveLock           | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    35633343 | 3158 |
ShareUpdateExclusiveLock| t 
 relation      |    30793 |    10340 |      |       |               |         |       |          |    36430567 | 4238 |
AccessShareLock         | t 
 transactionid |          |          |      |       |      36430567 |         |       |          |    36430567 | 4238 |
ExclusiveLock           | t 
(25 rows)

dtr=# \q
postgres@linpost:/var/pgsql/data> ps aux|grep ^postgres
postgres 28937  0.0  0.0  2436  992 pts/1    S    Sep30   0:00 su postgres
postgres 28938  0.0  0.0  2944 1620 pts/1    S    Sep30   0:00 bash
postgres  2612  0.0  0.0 170968 4952 ?       S    Oct02   0:00 /usr/local/pgsql/bin/postmaster
postgres  2614  0.0  1.9 171200 165528 ?     S    Oct02   0:08 postgres: writer process
postgres  2615  0.0  0.0  7240 2788 ?        S    Oct02   0:27 postgres: stats buffer process
postgres  2617  0.0  0.0  6556 1988 ?        S    Oct02   0:22 postgres: stats collector process
postgres  2724  0.0  0.0 171828 7156 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32986) idle
postgres  2771  6.4  2.0 171996 167176 ?     S    Oct02  44:33 postgres: dtr dtr 127.0.0.1(32990) idle
postgres  2773  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32991) idle
postgres  2774  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32992) idle
postgres  2775  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32993) idle
postgres  2776  0.0  0.0 171720 6848 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32994) idle
postgres  2777  6.4  2.0 171996 167228 ?     R    Oct02  44:29 postgres: dtr dtr 127.0.0.1(32995) UPDATE
postgres  2778  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32996) idle
postgres  2779  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32997) idle
postgres  2780  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32998) idle
postgres  2781  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32999) idle
postgres  2782  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33000) idle
postgres  2783  6.4  2.0 172016 167292 ?     S    Oct02  44:38 postgres: dtr dtr 127.0.0.1(33001) idle
postgres  2784  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33002) idle
postgres  2785  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33003) idle
postgres  2786  6.4  2.0 172000 167216 ?     S    Oct02  44:38 postgres: dtr dtr 127.0.0.1(33004) idle
postgres  2787  6.4  2.0 172000 167236 ?     R    Oct02  44:33 postgres: dtr dtr 127.0.0.1(33005) UPDATE
postgres  2788  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33006) idle
postgres  2789  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33007) idle
postgres  2790  6.4  2.0 171992 167280 ?     R    Oct02  44:27 postgres: dtr dtr 127.0.0.1(33008) UPDATE
postgres  2791  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33009) idle
postgres  3158  0.0  0.0 188324 8044 ?       S    Oct02   0:00 postgres: autovacuum process   dtr
postgres  4289  0.0  0.0  2148  672 pts/1    R+   09:21   0:00 ps aux
postgres  4290  0.0  0.0  1688  688 pts/1    R+   09:21   0:00 grep ^postgres

oid 30871 is the 72 row table.
oid 30976 is the index for its primary key.

We stopped the applications, and the autovacuum process didn't
wake up.

dtr=# select * from pg_locks order by pid;
   locktype    | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid  |
         mode           | granted 

---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
 relation      |    30793 |    30976 |      |       |               |         |       |          |    35633343 | 3158 |
ShareUpdateExclusiveLock| t 
 transactionid |          |          |      |       |      35633343 |         |       |          |    35633343 | 3158 |
ExclusiveLock           | t 
 relation      |    30793 |    30871 |      |       |               |         |       |          |    35633343 | 3158 |
ShareUpdateExclusiveLock| t 
 relation      |    30793 |    10340 |      |       |               |         |       |          |    36600517 | 4498 |
AccessShareLock         | t 
 transactionid |          |          |      |       |      36600517 |         |       |          |    36600517 | 4498 |
ExclusiveLock           | t 
(5 rows)

dtr=# \q
postgres@linpost:/var/pgsql/data> ps aux|grep ^postgres
postgres  2612  0.0  0.0 170968 4952 ?       S    Oct02   0:00 /usr/local/pgsql/bin/postmaster
postgres  2614  0.0  1.9 171200 165528 ?     S    Oct02   0:10 postgres: writer process
postgres  2615  0.0  0.0  7240 2788 ?        S    Oct02   0:32 postgres: stats buffer process
postgres  2617  0.0  0.0  6556 1988 ?        S    Oct02   0:26 postgres: stats collector process
postgres  3158  0.0  0.0 188324 8044 ?       S    Oct02   0:00 postgres: autovacuum process   dtr
postgres  4481  0.0  0.0  2432  988 pts/1    S    10:48   0:00 su postgres
postgres  4482  0.0  0.0  2944 1604 pts/1    S    10:48   0:00 bash
postgres  4499  0.0  0.0  2148  672 pts/1    R+   10:49   0:00 ps aux
postgres  4500  0.0  0.0  1680  636 pts/1    S+   10:49   0:00 grep ^postgres
postgres@linpost:/var/pgsql/data> uptime
 10:48am  up 6 days 19:20,  2 users,  load average: 0.00, 0.28, 0.73
postgres@linpost:/var/pgsql/data> vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      8 278376  62676 7770684    0    0    15     7   17    18 11  1 87  1
 0  0      8 278376  62700 7770660    0    0     0    11 1035     0  1  0 99  0
 0  0      8 278440  62700 7770660    0    0     0     0 1013     0  0  0 100  0
 0  0      8 278440  62700 7770660    0    0     0     0 1016     0  1  0 99  0
 0  0      8 278440  62700 7770660    0    0     0     4 1018     0  1  0 99  0

We found the 72 row table somewhat bloated, so we cleaned it up.

dtr=# vacuum analyze verbose "DbTranImageStatus";
INFO:  vacuuming "public.DbTranImageStatus"
INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6665 pages
DETAIL:  6555 index pages have been deleted, 6555 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 479137 unused item pointers.
0 pages are entirely empty.
CPU 0.01s/0.01u sec elapsed 0.02 sec.
INFO:  vacuuming "pg_toast.pg_toast_272793"
INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.DbTranImageStatus"
INFO:  "DbTranImageStatus": scanned 3000 of 4244 pages, containing 68 live rows and 0 dead rows; 68 rows in sample, 96
estimatedtotal rows 
VACUUM
dtr=# vacuum full analyze verbose "DbTranImageStatus";
INFO:  vacuuming "public.DbTranImageStatus"
INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 68 to 68 bytes long.
There were 479137 unused item pointers.
Total free space (including removable row versions) is 32760236 bytes.
4227 pages are or will become empty, including 0 at the end of the table.
4244 pages containing 32760236 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
DETAIL:  0 index row versions were removed.
6544 index pages have been deleted, 6544 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "DbTranImageStatus": moved 22 row versions, truncated 4244 to 1 pages
DETAIL:  CPU 0.00s/0.02u sec elapsed 0.03 sec.
INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
DETAIL:  22 index row versions were removed.
6544 index pages have been deleted, 6544 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  vacuuming "pg_toast.pg_toast_272793"
INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 0 to 0 bytes long.
There were 0 unused item pointers.
Total free space (including removable row versions) is 0 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.DbTranImageStatus"
INFO:  "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72
estimatedtotal rows 
VACUUM
dtr=# reindex table "DbTranImageStatus";
REINDEX
dtr=# vacuum analyze verbose "DbTranImageStatus";
INFO:  vacuuming "public.DbTranImageStatus"
INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 2 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 48 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  vacuuming "pg_toast.pg_toast_272793"
INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.DbTranImageStatus"
INFO:  "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72
estimatedtotal rows 
VACUUM

These all ran sub-second.  We updated postgresql.conf for more
aggressive autovacuum, and restarted postgres, then restarted the
application.

This failure seems rather different from the one I previously posted,
since it was blocking on the application table, rather than
pg_constraint_contypid_index, and it did not wake up when all
other processes where stopped.

As before, both boxes are running 8.1beta2.  Windows is not showing
these problems with the autovacuum blocking; just Linux.

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 10/02/05 8:53 PM >>>
"Jeff Kirby" <Jeff.Kirby@wicourts.gov> writes:
> the Linux box however is still chugging away this morning... and
> appears to be stuck on vacuuming "pg_constraint_contypid_index".  How
> do I know... well I don't really... I'm inferring based on the order
> of the log output on the Windows box.

Looking in pg_locks would give you a more reliable indicator of what the
VACUUM is currently working on.

Is the Linux box otherwise idle?  There was another report recently of a
vacuum hung up for a long time on pg_constraint_contypid_index, but it
seemed to be due to extremely heavy usage of domain types ...

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org


Re: Vacuum Full Analyze Stalled

От
"Kevin Grittner"
Дата:
In my previous post I failed to mention that after we stopped the
applications and found that autovacuum remained idle for several
minutes, we restarted postgres before cleaning up the bloat on
the problem table.

Also, the log shows that autovacuum stopped kicking in after
4:43 p.m. on Friday.  It was at about this point that I was
dropping and creating indexes through psql.  I did not use
BEGIN TRANSACTION, but I went home with an index being
built on a large table, so that connection remained open until
Sunday at 9:00 p.m.  I don't know if that matters, but I'm trying
to be thorough.

-Kevin




Re: Vacuum Full Analyze Stalled

От
"Jim C. Nasby"
Дата:
I haven't looked in detail at autovacuum in 8.1 yet, but I know that in
previous versions it was a bad idea to depend on it to vacuum a small
table that has a lot of update activity frequently enough. The issue is
that since it's single-threaded if it starts a vacuum on a large table
it could be hours before it gets around to vacuuming the small table
again.

In a case like this, your best bet is to either have the application
periodically vacuum the small table or just do it once a minute via
cron. It should take next to no time to vacuum a 72 row table, so
running vacuum once a minute shouldn't really present much overhead when
the vacuum isn't needed.

On Mon, Oct 03, 2005 at 02:09:37PM -0500, Kevin Grittner wrote:
> The previous report was from the same databases -- I'm a consultant
> tasked with evaluating the various open source options and making
> one of them work with our existing framework.  Jeff developed a new
> app (which is in beta testing) which is our first use of PostgreSQL
> with real production load for prolonged periods of time.
>
> Jeff was monitoring the app over the weekend, and posted here when
> he had problems.  Unfortunately, he didn't capture enough information
> to diagnose much, but I'm passing along the sequence of events
> since they lead up to a failure this morning for which we have more
> detail.
>
> Saturday, 9:00 a.m. -- Noticed performace beginning to degrade.
> Saturday, 1:00 p.m. -- After continued degradation, stops client app,
>    and runs vacuum full analyze on both Windows and Linux boxes.
> Saturday, 3:00 p.m. -- Windows box completes the vacuum.
> Sunday, 1:00 p.m. -- Vacuum on Linux still not done.  CPU on Linux
>    box is completely idle.  Killed the vacuum process.  Ran vacuum
>    full analyze on each table individually.  All fine except for the 72 row
>    table with very frequent updates.  Left this one running.
> Sunday, 9:00 p.m. -- Vacuum of 72 row table still running.  Killed all
>    processes and restarted postgres.  Started a vacuum full analyze
>    of the 72 row table again.
> Sunday, 11:00 p.m. -- Vacuum of 72 row table finishes.  Started
>    applications, which had been stopped since Saturday, 1:00 p.m.
> Monday, 9:20 a.m. -- Performance is beginning to degrade again.
>    Detailed information from this point below.
>
> dtr=# select * from pg_locks order by pid;
>    locktype    | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid
|          mode           | granted 
>
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
>  transactionid |          |          |      |       |      36430565 |         |       |          |    36430565 | 2771
|ExclusiveLock            | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430565 | 2771
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430565 | 2771
|RowExclusiveLock         | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430565 | 2771
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430565 | 2771
|RowExclusiveLock         | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430561 | 2777
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430561 | 2777
|RowExclusiveLock         | t 
>  transactionid |          |          |      |       |      36430561 |         |       |          |    36430561 | 2777
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430561 | 2777
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430561 | 2777
|RowExclusiveLock         | t 
>  transactionid |          |          |      |       |      36430563 |         |       |          |    36430563 | 2783
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430563 | 2783
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430563 | 2783
|RowExclusiveLock         | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430563 | 2783
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430563 | 2783
|RowExclusiveLock         | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430559 | 2790
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430559 | 2790
|RowExclusiveLock         | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430559 | 2790
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430559 | 2790
|RowExclusiveLock         | t 
>  transactionid |          |          |      |       |      36430559 |         |       |          |    36430559 | 2790
|ExclusiveLock            | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  transactionid |          |          |      |       |      35633343 |         |       |          |    35633343 | 3158
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  relation      |    30793 |    10340 |      |       |               |         |       |          |    36430567 | 4238
|AccessShareLock          | t 
>  transactionid |          |          |      |       |      36430567 |         |       |          |    36430567 | 4238
|ExclusiveLock            | t 
> (25 rows)
>
> dtr=# \q
> postgres@linpost:/var/pgsql/data> ps aux|grep ^postgres
> postgres 28937  0.0  0.0  2436  992 pts/1    S    Sep30   0:00 su postgres
> postgres 28938  0.0  0.0  2944 1620 pts/1    S    Sep30   0:00 bash
> postgres  2612  0.0  0.0 170968 4952 ?       S    Oct02   0:00 /usr/local/pgsql/bin/postmaster
> postgres  2614  0.0  1.9 171200 165528 ?     S    Oct02   0:08 postgres: writer process
> postgres  2615  0.0  0.0  7240 2788 ?        S    Oct02   0:27 postgres: stats buffer process
> postgres  2617  0.0  0.0  6556 1988 ?        S    Oct02   0:22 postgres: stats collector process
> postgres  2724  0.0  0.0 171828 7156 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32986) idle
> postgres  2771  6.4  2.0 171996 167176 ?     S    Oct02  44:33 postgres: dtr dtr 127.0.0.1(32990) idle
> postgres  2773  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32991) idle
> postgres  2774  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32992) idle
> postgres  2775  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32993) idle
> postgres  2776  0.0  0.0 171720 6848 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32994) idle
> postgres  2777  6.4  2.0 171996 167228 ?     R    Oct02  44:29 postgres: dtr dtr 127.0.0.1(32995) UPDATE
> postgres  2778  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32996) idle
> postgres  2779  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32997) idle
> postgres  2780  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32998) idle
> postgres  2781  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32999) idle
> postgres  2782  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33000) idle
> postgres  2783  6.4  2.0 172016 167292 ?     S    Oct02  44:38 postgres: dtr dtr 127.0.0.1(33001) idle
> postgres  2784  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33002) idle
> postgres  2785  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33003) idle
> postgres  2786  6.4  2.0 172000 167216 ?     S    Oct02  44:38 postgres: dtr dtr 127.0.0.1(33004) idle
> postgres  2787  6.4  2.0 172000 167236 ?     R    Oct02  44:33 postgres: dtr dtr 127.0.0.1(33005) UPDATE
> postgres  2788  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33006) idle
> postgres  2789  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33007) idle
> postgres  2790  6.4  2.0 171992 167280 ?     R    Oct02  44:27 postgres: dtr dtr 127.0.0.1(33008) UPDATE
> postgres  2791  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33009) idle
> postgres  3158  0.0  0.0 188324 8044 ?       S    Oct02   0:00 postgres: autovacuum process   dtr
> postgres  4289  0.0  0.0  2148  672 pts/1    R+   09:21   0:00 ps aux
> postgres  4290  0.0  0.0  1688  688 pts/1    R+   09:21   0:00 grep ^postgres
>
> oid 30871 is the 72 row table.
> oid 30976 is the index for its primary key.
>
> We stopped the applications, and the autovacuum process didn't
> wake up.
>
> dtr=# select * from pg_locks order by pid;
>    locktype    | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid
|          mode           | granted 
>
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  transactionid |          |          |      |       |      35633343 |         |       |          |    35633343 | 3158
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  relation      |    30793 |    10340 |      |       |               |         |       |          |    36600517 | 4498
|AccessShareLock          | t 
>  transactionid |          |          |      |       |      36600517 |         |       |          |    36600517 | 4498
|ExclusiveLock            | t 
> (5 rows)
>
> dtr=# \q
> postgres@linpost:/var/pgsql/data> ps aux|grep ^postgres
> postgres  2612  0.0  0.0 170968 4952 ?       S    Oct02   0:00 /usr/local/pgsql/bin/postmaster
> postgres  2614  0.0  1.9 171200 165528 ?     S    Oct02   0:10 postgres: writer process
> postgres  2615  0.0  0.0  7240 2788 ?        S    Oct02   0:32 postgres: stats buffer process
> postgres  2617  0.0  0.0  6556 1988 ?        S    Oct02   0:26 postgres: stats collector process
> postgres  3158  0.0  0.0 188324 8044 ?       S    Oct02   0:00 postgres: autovacuum process   dtr
> postgres  4481  0.0  0.0  2432  988 pts/1    S    10:48   0:00 su postgres
> postgres  4482  0.0  0.0  2944 1604 pts/1    S    10:48   0:00 bash
> postgres  4499  0.0  0.0  2148  672 pts/1    R+   10:49   0:00 ps aux
> postgres  4500  0.0  0.0  1680  636 pts/1    S+   10:49   0:00 grep ^postgres
> postgres@linpost:/var/pgsql/data> uptime
>  10:48am  up 6 days 19:20,  2 users,  load average: 0.00, 0.28, 0.73
> postgres@linpost:/var/pgsql/data> vmstat 10
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  0  0      8 278376  62676 7770684    0    0    15     7   17    18 11  1 87  1
>  0  0      8 278376  62700 7770660    0    0     0    11 1035     0  1  0 99  0
>  0  0      8 278440  62700 7770660    0    0     0     0 1013     0  0  0 100  0
>  0  0      8 278440  62700 7770660    0    0     0     0 1016     0  1  0 99  0
>  0  0      8 278440  62700 7770660    0    0     0     4 1018     0  1  0 99  0
>
> We found the 72 row table somewhat bloated, so we cleaned it up.
>
> dtr=# vacuum analyze verbose "DbTranImageStatus";
> INFO:  vacuuming "public.DbTranImageStatus"
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6665 pages
> DETAIL:  6555 index pages have been deleted, 6555 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.01 sec.
> INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 479137 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.01s/0.01u sec elapsed 0.02 sec.
> INFO:  vacuuming "pg_toast.pg_toast_272793"
> INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  analyzing "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": scanned 3000 of 4244 pages, containing 68 live rows and 0 dead rows; 68 rows in sample,
96estimated total rows 
> VACUUM
> dtr=# vacuum full analyze verbose "DbTranImageStatus";
> INFO:  vacuuming "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 68 to 68 bytes long.
> There were 479137 unused item pointers.
> Total free space (including removable row versions) is 32760236 bytes.
> 4227 pages are or will become empty, including 0 at the end of the table.
> 4244 pages containing 32760236 free bytes are potential move destinations.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
> DETAIL:  0 index row versions were removed.
> 6544 index pages have been deleted, 6544 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "DbTranImageStatus": moved 22 row versions, truncated 4244 to 1 pages
> DETAIL:  CPU 0.00s/0.02u sec elapsed 0.03 sec.
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
> DETAIL:  22 index row versions were removed.
> 6544 index pages have been deleted, 6544 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  vacuuming "pg_toast.pg_toast_272793"
> INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 0 to 0 bytes long.
> There were 0 unused item pointers.
> Total free space (including removable row versions) is 0 bytes.
> 0 pages are or will become empty, including 0 at the end of the table.
> 0 pages containing 0 free bytes are potential move destinations.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  analyzing "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72
estimatedtotal rows 
> VACUUM
> dtr=# reindex table "DbTranImageStatus";
> REINDEX
> dtr=# vacuum analyze verbose "DbTranImageStatus";
> INFO:  vacuuming "public.DbTranImageStatus"
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 2 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 1 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 48 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  vacuuming "pg_toast.pg_toast_272793"
> INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  analyzing "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72
estimatedtotal rows 
> VACUUM
>
> These all ran sub-second.  We updated postgresql.conf for more
> aggressive autovacuum, and restarted postgres, then restarted the
> application.
>
> This failure seems rather different from the one I previously posted,
> since it was blocking on the application table, rather than
> pg_constraint_contypid_index, and it did not wake up when all
> other processes where stopped.
>
> As before, both boxes are running 8.1beta2.  Windows is not showing
> these problems with the autovacuum blocking; just Linux.
>
> -Kevin
>
>
> >>> Tom Lane <tgl@sss.pgh.pa.us> 10/02/05 8:53 PM >>>
> "Jeff Kirby" <Jeff.Kirby@wicourts.gov> writes:
> > the Linux box however is still chugging away this morning... and
> > appears to be stuck on vacuuming "pg_constraint_contypid_index".  How
> > do I know... well I don't really... I'm inferring based on the order
> > of the log output on the Windows box.
>
> Looking in pg_locks would give you a more reliable indicator of what the
> VACUUM is currently working on.
>
> Is the Linux box otherwise idle?  There was another report recently of a
> vacuum hung up for a long time on pg_constraint_contypid_index, but it
> seemed to be due to extremely heavy usage of domain types ...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Vacuum Full Analyze Stalled

От
Alvaro Herrera
Дата:
On Mon, Oct 03, 2005 at 02:41:14PM -0500, Jim C. Nasby wrote:
> I haven't looked in detail at autovacuum in 8.1 yet, but I know that in
> previous versions it was a bad idea to depend on it to vacuum a small
> table that has a lot of update activity frequently enough. The issue is
> that since it's single-threaded if it starts a vacuum on a large table
> it could be hours before it gets around to vacuuming the small table
> again.

Does it really make a lot of difference?  While the big vacuum is
running, old tuples accumulate on the small table, because it shows as a
running transaction and the small vacuum won't delete them.  The new
autovacuum is no different than the old one in this regard.  (There's a
patch by Hannu Krossing IIRC that, if included in 8.2, will make things
better.)

However, one thing that is different from old autovacuum is that you can
set better thresholds for small tables.  You don't have to wait until a
thousand tuples have accumulated -- you can change the threshold
inserting a tuple in pg_autovacuum.  In the 72-tuples case, I'd try
setting the vacuum threshold to 100 and see if it improves matters.


However, I'm looking at the autovacuum code to see why it's sitting
holding locks on the small table and not vacuuming it.  I see on the
pg_locks output that process 3158 (autovacuum) has got locks on the
table and index, but it apparently isn't vacuuming the table.  If this
is correct, it's a bug.  However I can't seem to find out why this
happens.

Kevin, Jeff, next time this happens please attach gdb to the autovacuum
process and get a stack trace ("bt" to gdb), if at all possible, and/or
strace it to see what it's doing.

--
Alvaro Herrera                                http://www.PlanetPostgreSQL.org
"I personally became interested in Linux while I was dating an English major
who wouldn't know an operating system if it walked up and bit him."
(Val Henson)

Re: Vacuum Full Analyze Stalled

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> However, I'm looking at the autovacuum code to see why it's sitting
> holding locks on the small table and not vacuuming it.  I see on the
> pg_locks output that process 3158 (autovacuum) has got locks on the
> table and index, but it apparently isn't vacuuming the table.  If this
> is correct, it's a bug.  However I can't seem to find out why this
> happens.

We can see clearly from the pg_locks output that VACUUM isn't waiting
for an lmgr lock, so the problem must be at a lower level.  The
hypothesis I'm thinking about is that VACUUM is trying to do
LockBufferForCleanup() and for some reason it never finishes.  There are
a number of possible scenarios that could explain this: leaked buffer
pin, dropped signal, etc.

> Kevin, Jeff, next time this happens please attach gdb to the autovacuum
> process and get a stack trace ("bt" to gdb), if at all possible, and/or
> strace it to see what it's doing.

Please!

Also, we need to keep a little clarity about what we are dealing with.
This thread has mentioned hangups in both plain vacuum (autovacuum) and
VACUUM FULL.  It seems very likely to me that there are different
mechanisms involved --- since VACUUM FULL takes an exclusive lock on the
whole table, that eliminates an entire class of possible explanations
for the plain-VACUUM case, while introducing a whole new set of
explanations having to do with the VACUUM being queued up behind
ordinary table locks.  Please be perfectly clear about which scenario
each report is about.

Finally, I'm wondering whether this bug is new in 8.1 or is
pre-existing.  Has this same application been running successfully
in 8.0?

            regards, tom lane

Re: Vacuum Full Analyze Stalled

От
Simon Riggs
Дата:
On Mon, 2005-10-03 at 16:48 -0400, Tom Lane wrote:
> Finally, I'm wondering whether this bug is new in 8.1 or is
> pre-existing.  Has this same application been running successfully
> in 8.0?

My money would be on a pre-existing bug. I recall two cases of
unexplained VACUUM hang in the last year.

Best Regards, Simon Riggs


Re: Vacuum Full Analyze Stalled

От
"Jim C. Nasby"
Дата:
On Mon, Oct 03, 2005 at 04:12:25PM -0400, Alvaro Herrera wrote:
> On Mon, Oct 03, 2005 at 02:41:14PM -0500, Jim C. Nasby wrote:
> > I haven't looked in detail at autovacuum in 8.1 yet, but I know that in
> > previous versions it was a bad idea to depend on it to vacuum a small
> > table that has a lot of update activity frequently enough. The issue is
> > that since it's single-threaded if it starts a vacuum on a large table
> > it could be hours before it gets around to vacuuming the small table
> > again.
>
> Does it really make a lot of difference?  While the big vacuum is
> running, old tuples accumulate on the small table, because it shows as a
> running transaction and the small vacuum won't delete them.  The new
> autovacuum is no different than the old one in this regard.  (There's a
> patch by Hannu Krossing IIRC that, if included in 8.2, will make things
> better.)

Hrm, http://archives.postgresql.org/pgsql-patches/2005-07/msg00086.php
indicates that multiple transactions are used during vacuum (see item
1). If that's the case, it is still better to run a stand-alone vacuum
if there are tables being autovacuumed that are large and have indexes.

In any case, Tom's got some serious objections to that patch, so it
might never make it in. I'd like to see an item on the TODO so this
doesn't slip through the cracks. It's also possible that a simpler,
cleaner version of this would be to have vacuum do it's work in batches
rather than messing around with visibility code (which I agree is
dangerous).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Vacuum Full Analyze Stalled

От
"Kevin Grittner"
Дата:
We will use gdb and strace the next time we see this.

I've tried to be specific about which vacuum is running in all cases.  If
the posts have been confusing on that issue, I apologize.  I'll try to be
clear on this in future posts.

To summarize past events, the case involving the constraint index
was indeed a "vacuum full" of the entire database under heavy load.
Autovacuum failed to keep the small, high-update table clean in that
scenario, but I am not sure whether that caused the failure of the
vacuum full, or was the result of it.  This weekend, it seemed like the
first thing which failed (and the last) were autovacuum attempts.
Vacuum full was run through psql during attempts to recover
performance after the failure of autovacuum caused performance
to slow noticably.  We didn't capture info which would tell us whether
the explicit vacuum was blocked by an autovacuum process.

There were a few very small single-source tests under 8.0.3, but all
tests involving any significant load were under 8.1beta1 or 8.1beta2.
We did not see this in any of those small tests under 8.0.3.

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 10/03/05 3:48 PM >>>
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> However, I'm looking at the autovacuum code to see why it's sitting
> holding locks on the small table and not vacuuming it.  I see on the
> pg_locks output that process 3158 (autovacuum) has got locks on the
> table and index, but it apparently isn't vacuuming the table.  If this
> is correct, it's a bug.  However I can't seem to find out why this
> happens.

We can see clearly from the pg_locks output that VACUUM isn't waiting
for an lmgr lock, so the problem must be at a lower level.  The
hypothesis I'm thinking about is that VACUUM is trying to do
LockBufferForCleanup() and for some reason it never finishes.  There are
a number of possible scenarios that could explain this: leaked buffer
pin, dropped signal, etc.

> Kevin, Jeff, next time this happens please attach gdb to the autovacuum
> process and get a stack trace ("bt" to gdb), if at all possible, and/or
> strace it to see what it's doing.

Please!

Also, we need to keep a little clarity about what we are dealing with.
This thread has mentioned hangups in both plain vacuum (autovacuum) and
VACUUM FULL.  It seems very likely to me that there are different
mechanisms involved --- since VACUUM FULL takes an exclusive lock on the
whole table, that eliminates an entire class of possible explanations
for the plain-VACUUM case, while introducing a whole new set of
explanations having to do with the VACUUM being queued up behind
ordinary table locks.  Please be perfectly clear about which scenario
each report is about.

Finally, I'm wondering whether this bug is new in 8.1 or is
pre-existing.  Has this same application been running successfully
in 8.0?

            regards, tom lane


Re: Vacuum Full Analyze Stalled

От
"Jim C. Nasby"
Дата:
On Mon, Oct 03, 2005 at 04:37:17PM -0500, Kevin Grittner wrote:
> We will use gdb and strace the next time we see this.
>
> I've tried to be specific about which vacuum is running in all cases.  If
> the posts have been confusing on that issue, I apologize.  I'll try to be
> clear on this in future posts.
>
> To summarize past events, the case involving the constraint index
> was indeed a "vacuum full" of the entire database under heavy load.
> Autovacuum failed to keep the small, high-update table clean in that
> scenario, but I am not sure whether that caused the failure of the
> vacuum full, or was the result of it.  This weekend, it seemed like the
> first thing which failed (and the last) were autovacuum attempts.
> Vacuum full was run through psql during attempts to recover
> performance after the failure of autovacuum caused performance
> to slow noticably.  We didn't capture info which would tell us whether
> the explicit vacuum was blocked by an autovacuum process.

Keep in mind that vacuum full is *very* aggressive for use in a
production environment. It aquires exclusive locks on tables, which
means everything else will grind to a complete halt while it's running.
Unless you have a very specific reason to use vacuum full, you should
just use plain vacuum (not related to autovacuum). If you are going to
vacuum full, you should consider using the cluster command which has
some added benefits.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Vacuum Full Analyze Stalled

От
"Kevin Grittner"
Дата:
My goal is to avoid vacuum full in production.  My understanding is
that it is never necessary if vacuums are done aggressively enough,
but I felt that while we were in beta test mode, it was worthwhile for
me to have it run periodically, with the verbose option, to provide
information about where we might need to adjust our vacuum
schedule or fsm settings.

Since the long-term blocking on the constraint index occurred, I have
asked that we run these during non-peak loads, and I'm getting to
the point where I think I can be satisfied with the verbose results of
a normal vacuum for these purposes, even though it provides less
detail.  Are there any tools which provide the level of detail you get
from vacuum full verbose without the problems?

When our small table has bloated, we have tried vacuum full in an
attempt to eliminate the bloat, but we have had to resort to reindex
table to clean things up adequately.  We have tried cluster, which
also worked -- but there doesn't seem to me to be any real
advantage over vacuum followed by reindex for our small, frequently
updated table, since it is rarely accessed sequentially.  Am I missing
something there?

-Kevin


>>> "Jim C. Nasby" <jnasby@pervasive.com> 10/03/05 4:48 PM >>>
On Mon, Oct 03, 2005 at 04:37:17PM -0500, Kevin Grittner wrote:
> We will use gdb and strace the next time we see this.
>
> I've tried to be specific about which vacuum is running in all cases.  If
> the posts have been confusing on that issue, I apologize.  I'll try to be
> clear on this in future posts.
>
> To summarize past events, the case involving the constraint index
> was indeed a "vacuum full" of the entire database under heavy load.
> Autovacuum failed to keep the small, high-update table clean in that
> scenario, but I am not sure whether that caused the failure of the
> vacuum full, or was the result of it.  This weekend, it seemed like the
> first thing which failed (and the last) were autovacuum attempts.
> Vacuum full was run through psql during attempts to recover
> performance after the failure of autovacuum caused performance
> to slow noticably.  We didn't capture info which would tell us whether
> the explicit vacuum was blocked by an autovacuum process.

Keep in mind that vacuum full is *very* aggressive for use in a
production environment. It aquires exclusive locks on tables, which
means everything else will grind to a complete halt while it's running.
Unless you have a very specific reason to use vacuum full, you should
just use plain vacuum (not related to autovacuum). If you are going to
vacuum full, you should consider using the cluster command which has
some added benefits.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Vacuum Full Analyze Stalled

От
"Kevin Grittner"
Дата:
I hate to clutter the list with another post on this, but I just did
exactly what Tom asked me not to do, which is to confuse what
type of vacuum was run.  The vacuum involved in the constraint
index problem was NOT a vacuum full, but a normal vacuum of
the database.  Sorry for mis-stating the issue a few minutes ago.

-Kevin


>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> 10/03/05 5:27 PM >>>
My goal is to avoid vacuum full in production.  My understanding is
that it is never necessary if vacuums are done aggressively enough,
but I felt that while we were in beta test mode, it was worthwhile for
me to have it run periodically, with the verbose option, to provide
information about where we might need to adjust our vacuum
schedule or fsm settings.

Since the long-term blocking on the constraint index occurred, I have
asked that we run these during non-peak loads, and I'm getting to
the point where I think I can be satisfied with the verbose results of
a normal vacuum for these purposes, even though it provides less
detail.  Are there any tools which provide the level of detail you get
from vacuum full verbose without the problems?

When our small table has bloated, we have tried vacuum full in an
attempt to eliminate the bloat, but we have had to resort to reindex
table to clean things up adequately.  We have tried cluster, which
also worked -- but there doesn't seem to me to be any real
advantage over vacuum followed by reindex for our small, frequently
updated table, since it is rarely accessed sequentially.  Am I missing
something there?

-Kevin


>>> "Jim C. Nasby" <jnasby@pervasive.com> 10/03/05 4:48 PM >>>
On Mon, Oct 03, 2005 at 04:37:17PM -0500, Kevin Grittner wrote:
> We will use gdb and strace the next time we see this.
>
> I've tried to be specific about which vacuum is running in all cases.  If
> the posts have been confusing on that issue, I apologize.  I'll try to be
> clear on this in future posts.
>
> To summarize past events, the case involving the constraint index
> was indeed a "vacuum full" of the entire database under heavy load.
> Autovacuum failed to keep the small, high-update table clean in that
> scenario, but I am not sure whether that caused the failure of the
> vacuum full, or was the result of it.  This weekend, it seemed like the
> first thing which failed (and the last) were autovacuum attempts.
> Vacuum full was run through psql during attempts to recover
> performance after the failure of autovacuum caused performance
> to slow noticably.  We didn't capture info which would tell us whether
> the explicit vacuum was blocked by an autovacuum process.

Keep in mind that vacuum full is *very* aggressive for use in a
production environment. It aquires exclusive locks on tables, which
means everything else will grind to a complete halt while it's running.
Unless you have a very specific reason to use vacuum full, you should
just use plain vacuum (not related to autovacuum). If you are going to
vacuum full, you should consider using the cluster command which has
some added benefits.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster


Re: Vacuum Full Analyze Stalled

От
Tom Lane
Дата:
[ I just noticed that this thread is happening on pgsql-admin, which is
  completely inappropriate for discussing bugs in a beta version.
  Please redirect followups to pgsql-hackers. ]

I wrote:
> ... The hypothesis I'm thinking about is that VACUUM is trying to do
> LockBufferForCleanup() and for some reason it never finishes.

I set up a simple-minded reproduction of Kevin's situation: I did

    create domain dint as int check (value > 0);
    create table manyd  (f1 dint, f2 dint, f3 dint,
    f4 dint, f5 dint, f6 dint, f7 dint, f8 dint, f9 dint, f10 dint);

and then ran ten concurrent clients doing this continuously:

    insert into manyd values(1,2,3,4,5,6,7,8,9,10);

which should be enough to force a lot of indexscans on
pg_constraint_contypid_index.  I added an additional client doing

    create domain d1 as int check (value > 0);
    drop domain d1;

to ensure that there were dead rows needing vacuuming in pg_constraint.
(BTW, Tatsuo's new version of pg_bench lets me do all this without
writing a line of code...)

Finally, I added some debug printouts to LockBufferForCleanup so I
could see if it was being executed or not.

Then I tried both manual and autovacuum-driven vacuums of pg_constraint.
I was able to see from the debug printouts that LockBufferForCleanup was
sometimes forced to wait in both cases.  But it never got "stuck".

This eliminates one thing I was worrying about, which was the
possibility that the LockBufferForCleanup waiting path was completely
broken inside autovacuum for some reason.  But it doesn't get us a whole
lot closer to a solution.

At this point I think we need more info from Kevin and Jeff before we
can go further.  There must be some additional special feature of their
application that makes the problem appear, but what?

A stack trace of the stuck process would definitely help...

            regards, tom lane

Re: Vacuum Full Analyze Stalled

От
"Jeff Kirby"
Дата:
Thanks Tom...

The Linux box was completely idle (as you already guessed).  There were multiple locks on the table(s) in question.
Andto answer your question, we are heavily using domain types.  I suspect (to refresh your memory) that the problem
reportedearlier was from Kevin Grittner.  Do you know if there are any plans to have better utilization of domain types
inPostgres? 

The steps I took to resolve the problem was to completely kill all processes, restarted the Postgres service, and then
ran"vacuum full analyze" upon a successful restart.  That went through without a hitch.   

Thanks again... let me know if you can think of anything that could/would prevent this problem in the future (outside
ofeliminating domain types). 

Jeff Kirby

>>> Tom Lane <tgl@sss.pgh.pa.us> 10/02/05 8:53 PM >>>
"Jeff Kirby" <Jeff.Kirby@wicourts.gov> writes:
> the Linux box however is still chugging away this morning... and
> appears to be stuck on vacuuming "pg_constraint_contypid_index".  How
> do I know... well I don't really... I'm inferring based on the order
> of the log output on the Windows box.

Looking in pg_locks would give you a more reliable indicator of what the
VACUUM is currently working on.

Is the Linux box otherwise idle?  There was another report recently of a
vacuum hung up for a long time on pg_constraint_contypid_index, but it
seemed to be due to extremely heavy usage of domain types ...

            regards, tom lane