Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

Поиск

Список

Период

Сортировка

От	Andres Freund
Тема	Cluster "stuck" in "not accepting commands to avoid wraparound data loss"
Дата	11 декабря 2015 г. 00:55:36
Msg-id	20151210215527.GG14789@awork2.anarazel.de обсуждение исходный текст
Ответы	Re: Cluster "stuck" in "not accepting commands to avoid wraparound data loss" (Robert Haas <robertmhaas@gmail.com>) Re: Cluster "stuck" in "not accepting commands to avoid wraparound data loss" (Jeff Janes <jeff.janes@gmail.com>) Re: Cluster "stuck" in "not accepting commands to avoid wraparound data loss" (Jeff Janes <jeff.janes@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Hi,

I recently started a pgbench benchmark (to evaluate a piece of hardware,
not postgres) with master. Unfortunately, by accident, I started
postgres in a shell, not screen like pgbench.

Just logged back in and saw:
client 71 aborted in state 8: ERROR:  database is not accepting commands to avoid wraparound data loss in database
"postgres"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 97
number of threads: 97
duration: 300000 s
number of transactions actually processed: 2566862424
latency average: 3.214 ms
latency stddev: 7.336 ms
tps = 30169.374133 (including connections establishing)
tps = 30169.378406 (excluding connections establishing)

Hm. Bad news. We apparently didn't keep up vacuuming. But worse news is
that even now, days later, autovacuum hasn't progressed:
postgres=# select txid_current();
ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

Looking at datfrozenxid:
postgres=# select datname, datfrozenxid, age(datfrozenxid) FROM pg_database ; datname  | datfrozenxid |    age    
-----------+--------------+-----------template1 |   3357685367 |         0template0 |   3357685367 |         0postgres
|  3159867733 | 197817634
 
(3 rows)
reveals that the launcher doesn't do squat because it doesn't think it
needs to do anything.

(gdb) p *ShmemVariableCache
$3 = {nextOid = 24576, oidCount = 0, nextXid = 3357685367, oldestXid = 1211201715, xidVacLimit = 1411201715,
xidWarnLimit= 3347685362,  xidStopLimit = 3357685362, xidWrapLimit = 3358685362, oldestXidDB = 12380, oldestCommitTs =
0,newestCommitTs = 0,  latestCompletedXid = 3357685366}
 

'oldestXid' shows the problem: We're indeed pretty short before a
wraparound.


The question is, how did we get here? My current working theory, not
having any logs available, is that two autovacuum workers ran at the
same time. Both concurrently entered vac_update_datfrozenxid(). As both
haven't committed at that time, they can't see each other's updates to
datfrozenxid. And thus vac_truncate_clog(), called by both, won't see a
changed horizon.

Does that make sense?

If so, what can we do about it? After chatting a bit with Alvaro  I can
see two avenues:
1) Hold a self-conflicting lock on pg_database in vac_truncate_clog(),  and don't release the lock until the
transactionend. As the  pg_database scan uses a fresh snapshot, that ought to guarantee  progress.
 
2) Do something like vac_truncate_clog() in the autovacuum launcher,  once every idle cycle or so. That'd then unwedge
us.

Neither of these sound particularly pretty.


Additionally something else has to be going on here - why on earth
wasn't a autovacuum started earlier? The above kinda looks like the
vacuums on template* ran at a very similar time, and only pretty
recently.


I left the cluster hanging in it's stuck state for now, so we have a
chance to continue investigating.

Greetings,

Andres Freund

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Alvaro Herrera
Дата: 10 декабря 2015 г., 23:36:04
Сообщение: Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

Следующее

От: Robert Haas
Дата: 11 декабря 2015 г., 01:11:49
Сообщение: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

Предыдущее

Следующее