Обсуждение: Vacuum full crash

Поиск
Список
Период
Сортировка

Vacuum full crash

От
"Mikko Partio"
Дата:
Hello list

an interrupted vacuum full has just caused a PG instance to restart and recover. Background:

select version();
                                                 version
----------------------------------------------------------------------------------------------------------
 PostgreSQL 8.3.1 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14)
(1 row)

I have a largish ( >1TB ) database which is kind of a data warehouse. Recently I had to do some major operations to some of the tables (update all rows in a table etc) which caused major bloat. To remove the bloat, I run VACUUM FULL VERBOSE on the bloated tables. Before the vacuum got finished, I had to abort it due to problems with my own laptop. When I hit ctrl+c to the vacuum, the PG instance went suddenly to recover mode. The logs showed this:

2008-03-29 22:25:15 EET [26841]: [1-1] ERROR:  canceling statement due to user request
2008-03-29 22:25:15 EET [26841]: [2-1] STATEMENT:  vacuum full verbose xyz ;
2008-03-29 22:25:15 EET [26841]: [1-1] ERROR:  canceling statement due to user request
2008-03-29 22:25:15 EET [26841]: [2-1] STATEMENT:  vacuum full verbose xyz ;
2008-03-29 22:25:15 EET [26841]: [3-1] PANIC:  cannot abort transaction 3778747509, it was already committed
2008-03-29 22:25:15 EET [6476]: [4-1] LOG:  server process (PID 26841) was terminated by signal 6: Aborted
2008-03-29 22:25:15 EET [6476]: [5-1] LOG:  terminating any other active server processes
2008-03-29 22:25:15 EET [24814]: [48-1] WARNING:  terminating connection because of crash of another server process
2008-03-29 22:25:15 EET [24814]: [49-1] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-03-29 22:25:15 EET [24814]: [50-1] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2008-03-29 22:25:15 EET [26841]: [3-1] PANIC:  cannot abort transaction 3778747509, it was already committed
2008-03-29 22:25:15 EET [6476]: [4-1] LOG:  server process (PID 26841) was terminated by signal 6: Aborted
2008-03-29 22:25:15 EET [6476]: [5-1] LOG:  terminating any other active server processes
2008-03-29 22:25:15 EET [24814]: [48-1] WARNING:  terminating connection because of crash of another server process
2008-03-29 22:25:15 EET [24814]: [49-1] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-03-29 22:25:15 EET [24814]: [50-1] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2008-03-29 22:25:15 EET [6476]: [6-1] LOG:  archiver process (PID 6489) exited with exit code 1
2008-03-29 22:25:15 EET [18228]: [1-1] FATAL:  the database system is in recovery mode
2008-03-29 22:25:16 EET [6476]: [7-1] LOG:  all server processes terminated; reinitializing
2008-03-29 22:25:16 EET [18229]: [1-1] LOG:  database system was interrupted; last known up at 2008-03-29 22:20:24 EET
2008-03-29 22:25:16 EET [18229]: [2-1] LOG:  database system was not properly shut down; automatic recovery in progress
2008-03-29 22:25:16 EET [18229]: [3-1] LOG:  redo starts at CB5/16399698
2008-03-29 22:25:22 EET [18229]: [4-1] LOG:  unexpected pageaddr CB4/76FF6000 in log file 3253, segment 35, offset 16736256
2008-03-29 22:25:22 EET [18229]: [5-1] LOG:  redo done at CB5/23FF4C80
2008-03-29 22:25:22 EET [18229]: [6-1] LOG:  last completed transaction was at log time 2008-03-29 22:22:47.931231+02
2008-03-29 22:25:23 EET [18336]: [1-1] FATAL:  the database system is in recovery mode
2008-03-29 22:25:27 EET [18337]: [1-1] FATAL:  the database system is in recovery mode
2008-03-29 22:25:30 EET [18346]: [1-1] FATAL:  the database system is in recovery mode
2008-03-29 22:25:41 EET [18424]: [1-1] FATAL:  the database system is in recovery mode
2008-03-29 22:25:43 EET [18427]: [1-1] LOG:  autovacuum launcher started
2008-03-29 22:25:43 EET [6476]: [8-1] LOG:  database system is ready to accept connections


Seems quite serious to me ("cannot abort a transaction that has already committed"), what can cause such behaviour?

Regards

Mikko

Re: Vacuum full crash

От
Tom Lane
Дата:
"Mikko Partio" <mpartio@gmail.com> writes:
> 2008-03-29 22:25:15 EET [26841]: [3-1] PANIC:  cannot abort transaction
> 3778747509, it was already committed

Yeah, this is a well-known problem with VACUUM FULL ...

            regards, tom lane

Re: Vacuum full crash

От
"Mikko Partio"
Дата:


On Sun, Mar 30, 2008 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Mikko Partio" <mpartio@gmail.com> writes:
> 2008-03-29 22:25:15 EET [26841]: [3-1] PANIC:  cannot abort transaction
> 3778747509, it was already committed

Yeah, this is a well-known problem with VACUUM FULL ...



Ok, that's good to know. Is there a bugfix on the way or is it more like a 'feature'?

Regards

Mikko

Re: Vacuum full crash

От
Tom Lane
Дата:
"Mikko Partio" <mpartio@gmail.com> writes:
> On Sun, Mar 30, 2008 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Mikko Partio" <mpartio@gmail.com> writes:
>>> 2008-03-29 22:25:15 EET [26841]: [3-1] PANIC:  cannot abort transaction
>>> 3778747509, it was already committed
>>
>> Yeah, this is a well-known problem with VACUUM FULL ...

> Ok, that's good to know. Is there a bugfix on the way or is it more like a
> 'feature'?

It would be good to fix it I suppose, but what with VACUUM FULL being on
the edge of deprecation anyway, it's hard to muster enthusiasm for doing
a (probably) large additional amount of work on it,  I think most
hackers would prefer to spend time on making it unnecessary than making
it more bulletproof.

Note that there is (AFAIK) no risk of loss of committed data from this
bug.  The fact that the PANIC kills all your other sessions is annoying,
but in the scenarios where this is likely to happen you're trying to
stop them all anyway ...

            regards, tom lane

Re: Vacuum full crash

От
"Mikko Partio"
Дата:


On Sun, Mar 30, 2008 at 6:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It would be good to fix it I suppose, but what with VACUUM FULL being on
the edge of deprecation anyway, it's hard to muster enthusiasm for doing
a (probably) large additional amount of work on it,  I think most
hackers would prefer to spend time on making it unnecessary than making
it more bulletproof.


Interesting. What is planned substitute for VACUUM FULL? CLUSTER?

Regards

Mikko

Re: Vacuum full crash

От
Tom Lane
Дата:
"Mikko Partio" <mpartio@gmail.com> writes:
> Interesting. What is planned substitute for VACUUM FULL? CLUSTER?

What's been discussed is something that works roughly like CLUSTER
but doesn't bother to sort the data any particular way (ie, just do
a seqscan not an indexscan in the data-copying phase).  This would
certainly dominate VACUUM FULL in contexts where you've removed a
large fraction of the rows.  In cases where you haven't, the large
transient space requirement (2x the table size) would be objectionable,
but it's not real clear why you need a VACUUM FULL rather than plain
VACUUM in such cases anyway.

A lot of the context for this arises from recent and planned
improvements, notably

(1) CLUSTER is MVCC-safe now, removing one of the core arguments
for not using it for routine cleanup.

(2) Planned fixes like the dead space map will substantially
reduce the cost of running frequent plain vacuums.  That together
with autovacuum (which we could make more aggressive in consequence)
should help to avoid getting into situations where VACUUM FULL is
needed in the first place.

We certainly couldn't have considered eliminating VACUUM FULL before
this, but it's starting to seem a viable option to me.

            regards, tom lane