Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`

Поиск
Список
Период
Сортировка
От Brian Ghidinelli
Тема Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
Дата
Msg-id F7A6B776-6DD9-466E-822D-C28CBC3A8A2B@vfive.com
обсуждение исходный текст
Ответы Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
Список pgsql-bugs
=C3=81lvaro Herrera wrote:
> This smells like it's looping waiting for a multixact to be fully
> written out ... except that the uninterruptibility part of that was =
fixed in
> time for 9.4.0,
>=20
> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Branch: master Release: REL9_5_BR [51f9ea25d] 2014-11-14 15:14:01 =
-0300
> Branch: REL9_4_STABLE Release: REL9_4_0 [137e4da6d] 2014-11-14 =
15:14:02 -0300
> Branch: REL9_3_STABLE Release: REL9_3_6 [d45e8dc52] 2014-11-14 =
15:14:02 -0300
>=20
> =
http://git.postgresql.org/gitweb/?p=3Dpostgresql.git;a=3Dcommit;h=3D137e4d=
a6d
>=20
>=20
> Can you attach to the looping process with gdb when it's doing the
> select() dance, and obtain a backtrace?  You need debug symbols
> installed.

Alvaro, thanks for the quick response. I=E2=80=99ve done some additional =
research and found your involvement on other threads in 2014 and 2015 =
that have allowed me to collect more supporting details. In a nutshell, =
it appears to be multixid related and /maybe/ pg_upgrade related; I =
don=E2=80=99t know if we used it on the 9.3 branch or not. I don=E2=80=99t=
 have the debug symbols but I did attach gdb and could see it was =
working on multixid. I don=E2=80=99t have the ability to compile from =
source on this production server unfortunately but I=E2=80=99m happy to =
do whatever else I can to help the team find a solution.

First, I checked the pg_upgrade changelog note in 9.4.4 and here=E2=80=99s=
 what I found in my install:

* pg_controldata DOES report =E2=80=9CLatest checkpoint=E2=80=99s =
oldestMultiXid=E2=80=9D =3D 1
* There IS a file named 0000 in PGDATA/pg_multixact/offsets

pg_controldata:

# /usr/pgsql-9.4/bin/pg_controldata ~postgres/9.4/data/
pg_control version number:            942
Catalog version number:               201409291
Database system identifier:           6123001933830237052
Database cluster state:               in production
pg_control last modified:             Thu 18 Feb 2016 09:53:29 AM PST
Latest checkpoint location:           130/6FBD9E58
Prior checkpoint location:            130/6F9ED950
Latest checkpoint's REDO location:    130/6FAEEFC0
Latest checkpoint's REDO WAL file:    00000001000001300000006F
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/90358137
Latest checkpoint's NextOID:          5523460
Latest checkpoint's NextMultiXactId:  84342
Latest checkpoint's NextMultiOffset:  167545
Latest checkpoint's oldestXID:        654
Latest checkpoint's oldestXID's DB:   16416
Latest checkpoint's oldestActiveXID:  90358137
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 16416
Time of latest checkpoint:            Thu 18 Feb 2016 09:52:31 AM PST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            hot_standby
Current wal_log_hints setting:        off
Current max_connections setting:      100
Current max_worker_processes setting: 8
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0


the autovacuum log status:

# select pg_class.relname, pg_locks.transactionid, pg_locks.mode,
   pg_locks.granted as "g", left(pg_stat_activity.query, 60) as query,
   pg_stat_activity.query_start,
   age(now(),pg_stat_activity.query_start) as "age",
   pg_stat_activity.pid=20
 from pg_stat_activity,pg_locks
 left outer join pg_class on (pg_locks.relation =3D pg_class.oid) =20
 where pg_locks.pid=3Dpg_stat_activity.pid
 and pg_stat_activity.pid =3D 14578
 order by query_start;

                 relname                 | transactionid |           =
mode           | g |                     query                     |     =
     query_start          |       age       |  pid =20
=
-----------------------------------------+---------------+----------------=
----------+---+-----------------------------------------------+-----------=
--------------------+-----------------+-------
 idx_clubmember_memberid                 |               | =
RowExclusiveLock         | t | autovacuum: VACUUM =
public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | =
06:55:04.883172 | 14578
 idx_lookupclubmember_uidmember          |               | =
RowExclusiveLock         | t | autovacuum: VACUUM =
public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | =
06:55:04.883172 | 14578
 idx_tbllookupclubmember_uidclubofrecord |               | =
RowExclusiveLock         | t | autovacuum: VACUUM =
public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | =
06:55:04.883172 | 14578
 xpktbllookupclubmember_mem              |               | =
RowExclusiveLock         | t | autovacuum: VACUUM =
public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | =
06:55:04.883172 | 14578
 tbllookupclubmember_pkey                |               | =
RowExclusiveLock         | t | autovacuum: VACUUM =
public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | =
06:55:04.883172 | 14578
                                         |               | ExclusiveLock =
           | t | autovacuum: VACUUM public.tbllookupclubmember | =
2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
 tbllookupclubmember                     |               | =
ShareUpdateExclusiveLock | t | autovacuum: VACUUM =
public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | =
06:55:04.883172 | 14578


gdb bt for pid 14578:

Program received signal SIGINT, Interrupt.
0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6
(gdb) bt
#0  0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6
#1  0x000000000077aa9a in pg_usleep ()
#2  0x00000000004b66af in GetMultiXactIdMembers ()
#3  0x0000000000487b93 in heap_prepare_freeze_tuple ()
#4  0x0000000000585e07 in lazy_vacuum_rel ()
#5  0x0000000000583a75 in ?? ()
#6  0x0000000000583e37 in vacuum ()
#7  0x0000000000616a06 in ?? ()
#8  0x00000000006171db in ?? ()
#9  0x0000000000617289 in StartAutoVacWorker ()
#10 0x0000000000624cf2 in ?? ()
#11 <signal handler called>
#12 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6
#13 0x0000000000621f76 in ?? ()
#14 0x000000000062619a in PostmasterMain ()
#15 0x00000000005bd52d in main ()
(gdb) cont
Continuing.

A couple other things I saw you request in other threads:

# select oid::regclass, relfrozenxid, relminmxid from pg_class where =
relname =3D 'tbllookupclubmember';
         oid         | relfrozenxid | relminmxid=20
---------------------+--------------+------------
 tbllookupclubmember |     22915818 |        616
(1 row)


# select datname, datfrozenxid, datminmxid from pg_database where =
datname =3D 'prod';
      datname       | datfrozenxid | datminmxid=20
--------------------+--------------+------------
prod |          654 |          1


# ls -al ~postgres/9.4/data/pg_multixact/offsets/
total 368
drwx------ 2 postgres postgres   4096 Mar  5  2015 .
drwx------ 4 postgres postgres   4096 Mar  5  2015 ..
-rw------- 1 postgres postgres 262144 Mar  5  2015 0000
-rw------- 1 postgres postgres  81920 Feb 18 09:47 0001



In this thread: =
http://postgresql.nabble.com/BUG-11264-Auto-vacuum-wraparound-job-blocking=
-everything-td5816237.html, I saw you recommend this fix:

> > How do we fix the current issue with this one server? Is there an =
easy=20
> >fix?=20
> > Thanks.=20
>=20
> As far as I am aware, you should=20
>   UPDATE pg_database SET datminmxid=3D20783=20
>=20
> and that should fix it.  The oldestMulti value in pg_control would get=20=

> updated by itself some time later.  If you experience stalls before=20
> oldestMulti fixes itself, you could stop the server (cleanly!) and =
then=20
> pg_resetxlog -m x,y where x is the correct nextMulti value from=20
> pg_controldata and y is 20783.=20



Is that something I should/could do here? I=E2=80=99ve temporarily =
disabled autovacuum on the offending table until we can figure it out.

Thank you!



Brian

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Jeff Janes
Дата:
Сообщение: Re: BUG #13971: SysV shm is used regardless of config
Следующее
От: Jim Nasby
Дата:
Сообщение: Re: BUG #13964: unexpected result from to_tsvector