Re: PANIC during VACUUM

Поиск
Список
Период
Сортировка
От German Becker
Тема Re: PANIC during VACUUM
Дата
Msg-id CALyjCLvoWybyrcmvH65J=rOYPp1bO6fUbMSBEOUjh8mQd3uv1Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PANIC during VACUUM  (Kevin Grittner <kgrittn@ymail.com>)
Список pgsql-admin
OK I apologise for the lack of clarity of the first message. Let me summarize the steps that lead me to the error. 
I have 2 servers running Ubuntu 12.04 on which I am testing Postgres 9.1.9. I set up streaming replication between them (no synchronous replication)
Both servers have 4 SATA hard drives with ext3 file system set up as follows

sda   --> / main os and the database files, except for the ones defined below
sdb   ---> pg_xlog directory
sdc ----> one tablespace where heavy transaction tables are stored
sdd --> another tablespace where big historic tables are stored.

archiving mode is on and the archive location is sda (and from there to the hot-standby server)
For testing I Populate the database with the data currently in production (currently Postgres 8.3). 
Then I run several load testing etc.
For tunning / improving the archiving process I needed to generate big ammount of WAL. To do so I just deleted the contents of one big table, and then VACUUM it, like this

DELETE form bigtable;
VACUUM bigtable;

And I found the error reported. 
I repeated the whole process (creating a new cluster, populating it with data - allways the same data- , seting up replication) a couple of times after that and I found the error again about 90% of the time. I tried deleting a big portion of the table and the error did not appeard. It only appears after deleting ALL. Also in some cases I didn't run the VACUUM command manually, and the error ocurred during auto-vacuum-
My last test, was, in case there was a hardware problem in the primary, to trigger the standby server and try the vacuum there. With the same results. 
Here a chunk of the log:

2013-04-29 17:02:21 ART [12024]: [32-1] PANIC:  XX001: corrupted item pointer: offset = 8128, size = 80
2013-04-29 17:02:21 ART [12024]: [33-1] LOCATION:  PageIndexMultiDelete, bufpage.c:779
2013-04-29 17:02:21 ART [12024]: [34-1] STATEMENT:  VACUUM callshopcdrs ;
2013-04-29 17:02:21 ART [23787]: [8-1] LOG:  server process (PID 12024) was terminated by signal 6: Aborte
d
2013-04-29 17:02:21 ART [23787]: [9-1] LOG:  terminating any other active server processes
2013-04-29 17:02:21 ART [7300]: [2-1] WARNING:  terminating connection because of crash of another server 
process
2013-04-29 17:02:21 ART [7300]: [3-1] DETAIL:  The postmaster has commanded this server process to roll ba
ck the current transaction and exit, because another server process exited abnormally and possibly corrupt
ed shared memory.
2013-04-29 17:02:21 ART [7300]: [4-1] HINT:  In a moment you should be able to reconnect to the database a
nd repeat your command.
2013-04-29 17:02:21 ART [30304]: [1-1] FATAL:  the database system is in recovery mode
2013-04-29 17:02:21 ART [23787]: [10-1] LOG:  archiver process (PID 7301) exited with exit code 1
2013-04-29 17:02:21 ART [23787]: [11-1] LOG:  all server processes terminated; reinitializing
2013-04-29 17:02:21 ART [30305]: [1-1] LOG:  database system was interrupted; last known up at 2013-04-29 
16:59:01 ART
2013-04-29 17:02:21 ART [30305]: [2-1] LOG:  database system was not properly shut down; automatic recover
y in progress
2013-04-29 17:02:21 ART [30305]: [3-1] LOG:  redo starts at 11/497D4338
2013-04-29 17:02:21 ART [30305]: [4-1] LOG:  invalid magic number 0000 in log file 17, segment 73, offset 
8216576
2013-04-29 17:02:21 ART [30305]: [5-1] LOG:  redo done at 11/497D4440
2013-04-29 17:02:22 ART [30308]: [1-1] LOG:  autovacuum launcher started
2013-04-29 17:02:22 ART [23787]: [12-1] LOG:  database system is ready to accept connections


There is a core file generated, it is 7GB big:

$ file core 
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'postgres: postgres tvoip3 [local] VACUUM'

Many thanks for your help and let me know any extra information that might be useful.

--

German





On Tue, Apr 30, 2013 at 8:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
[please don't top-post]

German Becker <german.becker@gmail.com> wrote:
> Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
>> German Becker wrote:

>>> I am testing version 9.1.9 before putting it in production. One
>>> of my tests involved deleting a the contents of a big table ( ~
>>> 13 GB size) and then VACUUMing it. During VACUUM PANICS.

>> If you mess with the database files, errors like this are to be
>> expected.

> Thanks for your reply. In which sense did I mess with the
> database files?

You didn't say how you deleted the contents of that big table, and
it appears that Albe assumed you deleted or truncated the
underlying disk file rather than using the DELETE or TRUNCATE SQL
statement.

In any event, more details would help people come up with ideas on
what might be wrong.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-admin по дате отправления:

Предыдущее
От: Albe Laurenz
Дата:
Сообщение: Re: PANIC during VACUUM
Следующее
От: Scott Whitney
Дата:
Сообщение: Some replication-related notes and questions