[GENERAL] Autovacuum stuck for hours, blocking queries

Поиск
Список
Период
Сортировка
I have a postgres 9.3.4 database table which (intermittently but reliably) gets into a state where queries get blocked
indefinitely(at least for many hours) behind an automatic vacuum. I was under the impression that vacuum should never
takeany blocking locks for any significant period of time, and so would like help resolving the issue. 

The process blocking the query is:
postgres 21985 11304 98 Feb13 ?        1-14:20:52 postgres: autovacuum worker process   <db_name>
which is running the query
autovacuum: VACUUM public.<table_name>

The query being blocked is:
ALTER TABLE <table_name> ALTER COLUMN <column_name> DROP DEFAULT
(But I have seen this previously with other queries being blocked. I used the SQL in
https://wiki.postgresql.org/wiki/Lock_Monitoringto determine which queries were blocked) 

Notes:
  - This database table is used for about 6 million row writes per day, all of which are then deleted at the end of the
day.
  - We have a reasonable number of systems running identical databases and near identical load profiles, and this
problemhas only been seen on two of those systems. It occurs intermittently but reliably (once every few days). It
persistsuntil postgres is restarted (usually accomplished by a reboot) or the autovacuum thread is forcibly killed.
Althoughit often happens again after a short period of time when the autovacuum worker process starts. 
  - I don't have data on whether this vacuum ever finishes - it certainly takes longer than a few hours.
  - Our application kicks off a manual vacuum against this table each night which doesn't hit this problem, as far as
we'reaware. 
  - I have sometimes seen this behaviour when there was database corruption. In that case I see logs like the
following.But it also occurs when there are no logs indicating database corruption (which is the purpose of this email) 
    LOG:  no left sibling (concurrent deletion?) in "<index_name>"
    ERROR:  right sibling's left-link doesn't match: block 41 links to 127 instead of expected 79 in index
"<index_name>"

Things that I have tried:
  - Printing out backtrace a few times from connecting to that process with gdb (see below)
    - This indicates that the vacuum process isn't stuck, but I don't know what else it says
  - Reading the FAQ and release notes of other 9.3.x versions to see whether this looks like a known issue. I couldn't
seeanything obvious apart from an issue about 0000 files (which didn't apply to my system - there was SQL in the
releasenote to check). 
  - Forcibly killing the autovacuum thread (obviously not a good solution to the problem!). This causes the ALTER TABLE
queryto complete correctly. 

Questions:
  - Is this expected? I was under the impression that vacuum should never take any blocking locks for any significant
periodof time. 
  - Should I report this as a bug?
  - Should I change some config options to avoid this? (I know turning off autovacuum would avoid it, but that's not
goodpractice) I've put the vacuum settings for my database below. 
  - Should I avoid certain queries which autovacuum is happening?
  - Are there more diags that I should gather to help diagnose this issue?

Thanks!

Tim

--System details--
Postgres 9.3.4 running on Linux x64 VMWare virtual machine. 24 CPUs of 2.7 GHz. 48GB RAM
It is also running another application (so the database is not the only thing on the server).


---Backtrace (from gdb) ---

#0  0x00000000007328c3 in hash_search_with_hash_value ()
#1  0x00000000006369ba in BufTableLookup ()
#2  0x0000000000639192 in ?? ()
#3  0x0000000000639ade in ReadBufferExtended ()
#4  0x00000000004906d9 in _bt_getbuf ()
#5  0x00000000004910e1 in _bt_pagedel ()
#6  0x00000000004924d1 in ?? ()
#7  0x00000000004926da in ?? ()
#8  0x00000000004928fa in btbulkdelete ()
#9  0x000000000072cdbd in FunctionCall4Coll ()
#10 0x000000000048cf58 in index_bulk_delete ()
#11 0x000000000057bb85 in ?? ()
#12 0x000000000057cfe9 in lazy_vacuum_rel ()
#13 0x000000000057af56 in ?? ()
#14 0x000000000057b28c in vacuum ()
#15 0x000000000060c8fa in ?? ()
#16 0x000000000060cd96 in ?? ()
#17 0x000000000060ce66 in StartAutoVacWorker ()
#18 0x0000000000617602 in ?? ()
#19 <signal handler called>
#20 0x00007f9210c9d393 in __select_nocancel () from /lib64/libc.so.6
#21 0x0000000000618add in PostmasterMain ()
#22 0x00000000005b58d0 in main ()

#0  0x0000000000491107 in _bt_pagedel ()
#1  0x00000000004924d1 in ?? ()
#2  0x00000000004926da in ?? ()
#3  0x00000000004928fa in btbulkdelete ()
...

#0  0x000000000047a1ef in hash_any ()
#1  0x00000000007336e9 in tag_hash ()
#2  0x000000000063916c in ?? ()
#3  0x0000000000639ade in ReadBufferExtended ()
#4  0x00000000004906d9 in _bt_getbuf ()
#5  0x00000000004910e1 in _bt_pagedel ()
#6  0x00000000004924d1 in ?? ()
#7  0x00000000004926da in ?? ()
#8  0x00000000004928fa in btbulkdelete ()
...

---Vacuum settings in postgresql.conf---

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 10                  # 0-100 milliseconds
#vacuum_cost_page_hit = 1               # 0-10000 credits
#vacuum_cost_page_miss = 10             # 0-10000 credits
#vacuum_cost_page_dirty = 20            # 0-10000 credits
vacuum_cost_limit = 200         # 1-10000 credits

#vacuum_defer_cleanup_age = 0   # number of xacts by which cleanup is delayed

#------------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#------------------------------------------------------------------------------

autovacuum = on                        # Enable autovacuum subprocess?  'on'
                                        # requires track_counts to also be on.
#log_autovacuum_min_duration = -1       # -1 disables, 0 logs all actions and
                                        # their durations, > 0 logs only
                                        # actions running at least this number
                                        # of milliseconds.
#autovacuum_max_workers = 3             # max number of autovacuum subprocesses
                                        # (change requires restart)
#autovacuum_naptime = 1min              # time between autovacuum runs
#autovacuum_vacuum_threshold = 50       # min number of row updates before
                                        # vacuum
#autovacuum_analyze_threshold = 50      # min number of row updates before
                                        # analyze
#autovacuum_vacuum_scale_factor = 0.2   # fraction of table size before vacuum
#autovacuum_analyze_scale_factor = 0.1  # fraction of table size before analyze
#autovacuum_freeze_max_age = 200000000  # maximum XID age before forced vacuum
                                        # (change requires restart)
#autovacuum_multixact_freeze_max_age = 400000000        # maximum Multixact age
                                        # before forced vacuum
                                        # (change requires restart)
#autovacuum_vacuum_cost_delay = 20ms    # default vacuum cost delay for
                                        # autovacuum, in milliseconds;
                                        # -1 means use vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1      # default vacuum cost limit for
                                        # autovacuum, -1 means use
                                        # vacuum_cost_limit

#vacuum_freeze_min_age = 50000000
#vacuum_freeze_table_age = 150000000
#vacuum_multixact_freeze_min_age = 5000000
#vacuum_multixact_freeze_table_age = 150000000


В списке pgsql-general по дате отправления:

Предыдущее
От: Shawn Thomas
Дата:
Сообщение: Re: [GENERAL] Can't restart Postgres
Следующее
От: Adrian Klaver
Дата:
Сообщение: Re: [GENERAL] Can't restart Postgres