Aborted VACUUM FULL -> crash + corruption (xlog non-existent)

Поиск
Список
Период
Сортировка
От Peter Schuller
Тема Aborted VACUUM FULL -> crash + corruption (xlog non-existent)
Дата
Msg-id 20080826161010.GA49017@hyperion.scode.org
обсуждение исходный текст
Ответы Re: Aborted VACUUM FULL -> crash + corruption (xlog non-existent)  (Alvaro Herrera <alvherre@commandprompt.com>)
Список pgsql-bugs
Hello,

[note: in pastes below, the only thing changes are database names and
hostnames, for privacy reasons; otherwise plain cut'n'paste including
typos...)

Very short version: I cancelled a VACUUM FULL; server crashed; won't
start again because it tries to access an obsolete WAL log file that
is likely to be from the point in time when VACUUM FULL started.

Long version:

The following refers to PostgreSQL 8.2.4 running on Linux 2.6.19-3 on
an old sarge system. I realize 8.2.4 is too old a release, but the bug
seems distinct enough that I felt I should definitely report it. I did
not find anything that looked relevant in the point release changelog
for 8.2.

Background is that I have a couple of databases on the system with >
1000 relations (if that is at all relevant). Because of autovacuum
issues (likely having to do with a too small max_fsm_relations) I
wanted to perform a VACUUM FULL.

I did, but realized I actually wanted VACUUM VERBOSE FULL. So after a
few seconds I hit ctrl-c in the psql client (run locally on the
machine as the postgres user; thus ident authed and connected as the
postgres PostgreSQL user). Output was:

contentsharing=3D# \timing
Timing is on.
contentsharing=3D# vacuum full
contentsharing-# ;
Cancel request sent
ERROR:  canceling statement due to user request
PANIC:  cannot abort transaction 830706491, it was already committed
server closed the connection unexpectedly
       This probably means the server terminated abnormally
       before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>=20
!> \q

After this I tried re-starting the database which failed (i.e., it was
not up after the start attempt). Checking the log file I had:

Aug 26 17:14:52 host postgres[5094]: [3116-1] user=3D,db=3D LOG:  archived =
transaction log file "0000000100000018000000ED"
Aug 26 17:16:06 host postgres[28467]: [2-1] user=3Dpostgres,db=3Ddatabasena=
me ERROR:  canceling statement due to user request
Aug 26 17:16:06 host postgres[28467]: [2-2] user=3Dpostgres,db=3Ddatabasena=
me STATEMENT:  vacuum full
Aug 26 17:16:06 host postgres[28467]: [2-3] ^I;
Aug 26 17:16:06 host postgres[28467]: [3-1] user=3Dpostgres,db=3Ddatabasena=
me PANIC:  cannot abort transaction 830706491, it was already committed
Aug 26 17:16:06 host postgres[5091]: [2-1] user=3D,db=3D LOG:  server proce=
ss (PID 28467) was terminated by signal 6
Aug 26 17:16:06 host postgres[5091]: [3-1] user=3D,db=3D LOG:  terminating =
any other active server processes
Aug 26 17:16:06 host postgres[16307]: [2792-1] user=3Ddatabasename,db=3Ddat=
abasename WARNING:  terminating connection because of crash of another serv=
er process
Aug 26 17:16:06 host postgres[16307]: [2792-2] user=3Ddatabasename,db=3Ddat=
abasename DETAIL:  The postmaster has commanded this server process to roll=
 back the current
Aug 26 17:16:06 host postgres[16307]: [2792-3]  transaction and exit, becau=
se another server process exited abnormally and possibly corrupted shared m=
emory.
Aug 26 17:16:06 host postgres[16307]: [2792-4] user=3Ddatabasename,db=3Ddat=
abasename HINT:  In a moment you should be able to reconnect to the databas=
e and repeat your
Aug 26 17:16:06 host postgres[16307]: [2792-5]  command.
Aug 26 17:16:06 host postgres[5091]: [4-1] user=3D,db=3D LOG:  all server p=
rocesses terminated; reinitializing
Aug 26 17:16:07 host postgres[28483]: [5-1] user=3Dpostgres,db=3Ddatabasena=
me FATAL:  the database system is starting up
Aug 26 17:16:07 host postgres[28482]: [5-1] user=3D,db=3D LOG:  database sy=
stem was interrupted at 2008-08-26 17:15:52 CEST
Aug 26 17:16:07 host postgres[28482]: [6-1] user=3D,db=3D LOG:  could not o=
pen file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No =
such file or directory
Aug 26 17:16:07 host postgres[28482]: [7-1] user=3D,db=3D LOG:  invalid pri=
mary checkpoint record
Aug 26 17:16:07 host postgres[28482]: [8-1] user=3D,db=3D LOG:  could not o=
pen file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No =
such file or directory
Aug 26 17:16:07 host postgres[28482]: [9-1] user=3D,db=3D LOG:  invalid sec=
ondary checkpoint record
Aug 26 17:16:07 host postgres[28482]: [10-1] user=3D,db=3D PANIC:  could no=
t locate a valid checkpoint record
Aug 26 17:16:07 host postgres[5091]: [5-1] user=3D,db=3D LOG:  startup proc=
ess (PID 28482) was terminated by signal 6
Aug 26 17:16:07 host postgres[5091]: [6-1] user=3D,db=3D LOG:  aborting sta=
rtup due to startup process failure

And the following is presumably from my manual start-up attempt (based on t=
imestamps):

Aug 26 17:17:11 host postgres[28511]: [1-1] user=3D,db=3D LOG:  could not b=
ind socket for statistics collector: Cannot assign requested address
Aug 26 17:17:11 host postgres[28511]: [2-1] user=3D,db=3D LOG:  disabling s=
tatistics collector for lack of working socket
Aug 26 17:17:11 host postgres[28511]: [3-1] user=3D,db=3D WARNING:  autovac=
uum not started because of misconfiguration
Aug 26 17:17:11 host postgres[28511]: [3-2] user=3D,db=3D HINT:  Enable opt=
ions "stats_start_collector" and "stats_row_level".
Aug 26 17:17:11 host postgres[28512]: [4-1] user=3D,db=3D LOG:  database sy=
stem was interrupted at 2008-08-26 17:15:52 CEST
Aug 26 17:17:11 host postgres[28512]: [5-1] user=3D,db=3D LOG:  could not o=
pen file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No =
such file or directory
Aug 26 17:17:11 host postgres[28512]: [6-1] user=3D,db=3D LOG:  invalid pri=
mary checkpoint record
Aug 26 17:17:11 host postgres[28512]: [7-1] user=3D,db=3D LOG:  could not o=
pen file "pg_xlog/0000000100000018000000ED" (log file 24, segment 237): No =
such file or directory
Aug 26 17:17:11 host postgres[28512]: [8-1] user=3D,db=3D LOG:  invalid sec=
ondary checkpoint record
Aug 26 17:17:11 host postgres[28512]: [9-1] user=3D,db=3D PANIC:  could not=
 locate a valid checkpoint record
Aug 26 17:17:11 host postgres[28511]: [4-1] user=3D,db=3D LOG:  startup pro=
cess (PID 28512) was terminated by signal 6
Aug 26 17:17:11 host postgres[28511]: [5-1] user=3D,db=3D LOG:  aborting st=
artup due to startup process failure

The pg_xlog directory contains:

-rw-------  1 postgres postgres 16777216 Aug 26 17:16 000000010000001800000=
0EE
drwx------  2 postgres postgres   305232 Aug 26 17:14 archive_status

Note that the archival of the ....ED xlog file started at 17:14:52,
and I cancelled the VACUUM FULL at 17:16:06.

It is entirely likely, though I cannot say for sure, that I started
the VACUUM FULL prior to the archival of ...ED.

I still have the entire pgsql database directory untouched except for
the manual start-up attempt. I am afraid I cannot dump the thing
somewhere for inspection (it contains confidential information), but I
can assist in inspecting the contents.

--=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: BUG #4281: some types of errors do not log statements
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Aborted VACUUM FULL -> crash + corruption (xlog non-existent)