Обсуждение: PANIC during VACUUM

Поиск
Список
Период
Сортировка

PANIC during VACUUM

От
German Becker
Дата:
Hi,

I am testing version 9.1.9 before putting it in production. One of my tests involved deleting a the contents of a big table ( ~ 13 GB size) and then VACUUMing it. During VACUUM PANICS. Here is the message:

PANIC:  corrupted item pointer: offset = 8128, size = 80

I found the error a couple of times, allways during VACUUM after deleting the context of the same big table (after re-polpulating it of course).

The error message is always *exactly* the same i.e. the same offset and size.

When this happens the backend gets restarted and if I issue the same VACUUM command, I get the same error. 

I also tried triggering the backup server (hot-standby with streaming replication, and trying the VACUUM there (to see if it may be a hardware problem in the primary) and got the same issue.

What might be causing this? Should I reported as a bug? Thanks!

--
Germán

Re: PANIC during VACUUM

От
Albe Laurenz
Дата:
German Becker wrote:
> I am testing version 9.1.9 before putting it in production. One of my tests involved deleting a the
> contents of a big table ( ~ 13 GB size) and then VACUUMing it. During VACUUM PANICS. Here is the
> message:
>
> PANIC:  corrupted item pointer: offset = 8128, size = 80
>
>
> I found the error a couple of times, allways during VACUUM after deleting the context of the same big
> table (after re-polpulating it of course).
>
> The error message is always *exactly* the same i.e. the same offset and size.
>
> When this happens the backend gets restarted and if I issue the same VACUUM command, I get the same
> error.
>
> I also tried triggering the backup server (hot-standby with streaming replication, and trying the
> VACUUM there (to see if it may be a hardware problem in the primary) and got the same issue.
>
> What might be causing this? Should I reported as a bug? Thanks!

If you mess with the database files, errors like this are to be expected.
The PANIC and restart is because the error happened during a sensitive phase.

This is not a bug.

Yours,
Laurenz Albe


Re: PANIC during VACUUM

От
German Becker
Дата:
Thanks for your reply. In which sense did I mess with the database files?


On Tue, Apr 30, 2013 at 4:34 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> I am testing version 9.1.9 before putting it in production. One of my tests involved deleting a the
> contents of a big table ( ~ 13 GB size) and then VACUUMing it. During VACUUM PANICS. Here is the
> message:
>
> PANIC:  corrupted item pointer: offset = 8128, size = 80
>
>
> I found the error a couple of times, allways during VACUUM after deleting the context of the same big
> table (after re-polpulating it of course).
>
> The error message is always *exactly* the same i.e. the same offset and size.
>
> When this happens the backend gets restarted and if I issue the same VACUUM command, I get the same
> error.
>
> I also tried triggering the backup server (hot-standby with streaming replication, and trying the
> VACUUM there (to see if it may be a hardware problem in the primary) and got the same issue.
>
> What might be causing this? Should I reported as a bug? Thanks!

If you mess with the database files, errors like this are to be expected.
The PANIC and restart is because the error happened during a sensitive phase.

This is not a bug.

Yours,
Laurenz Albe

Re: PANIC during VACUUM

От
German Becker
Дата:
Just in case there are some errors in my first email, where it says "after deleting the context of the same big table" It should say "after deleting de contents of the same big table" I essence what i did is 

DELETE from table;
VACUUM table;

And I got the error


On Tue, Apr 30, 2013 at 8:36 AM, German Becker <german.becker@gmail.com> wrote:
Thanks for your reply. In which sense did I mess with the database files?


On Tue, Apr 30, 2013 at 4:34 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
German Becker wrote:
> I am testing version 9.1.9 before putting it in production. One of my tests involved deleting a the
> contents of a big table ( ~ 13 GB size) and then VACUUMing it. During VACUUM PANICS. Here is the
> message:
>
> PANIC:  corrupted item pointer: offset = 8128, size = 80
>
>
> I found the error a couple of times, allways during VACUUM after deleting the context of the same big
> table (after re-polpulating it of course).
>
> The error message is always *exactly* the same i.e. the same offset and size.
>
> When this happens the backend gets restarted and if I issue the same VACUUM command, I get the same
> error.
>
> I also tried triggering the backup server (hot-standby with streaming replication, and trying the
> VACUUM there (to see if it may be a hardware problem in the primary) and got the same issue.
>
> What might be causing this? Should I reported as a bug? Thanks!

If you mess with the database files, errors like this are to be expected.
The PANIC and restart is because the error happened during a sensitive phase.

This is not a bug.

Yours,
Laurenz Albe


Re: PANIC during VACUUM

От
Kevin Grittner
Дата:
[please don't top-post]

German Becker <german.becker@gmail.com> wrote:
> Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
>> German Becker wrote:

>>> I am testing version 9.1.9 before putting it in production. One
>>> of my tests involved deleting a the contents of a big table ( ~
>>> 13 GB size) and then VACUUMing it. During VACUUM PANICS.

>> If you mess with the database files, errors like this are to be
>> expected.

> Thanks for your reply. In which sense did I mess with the
> database files?

You didn't say how you deleted the contents of that big table, and
it appears that Albe assumed you deleted or truncated the
underlying disk file rather than using the DELETE or TRUNCATE SQL
statement.

In any event, more details would help people come up with ideas on
what might be wrong.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PANIC during VACUUM

От
Albe Laurenz
Дата:
German Becker wrote:
> Just in case there are some errors in my first email, where it says "after deleting the context of the
> same big table" It should say "after deleting de contents of the same big table" I essence what i did
> is
>
> DELETE from table;
> VACUUM table;
>
> And I got the error

>> I am testing version 9.1.9 before putting it in production. One of my tests involved deleting a the
>> contents of a big table ( ~ 13 GB size) and then VACUUMing it. During VACUUM PANICS. Here is the
>> message:
>>
>> PANIC:  corrupted item pointer: offset = 8128, size = 80

Sorry for misunderstanding you.
A DELETE should definitely not cause such an error.

Can you provide a reproducible test case?
Is that a new database or could it be some prior corruption?

Yours,
Laurenz Albe


Re: PANIC during VACUUM

От
German Becker
Дата:
OK I apologise for the lack of clarity of the first message. Let me summarize the steps that lead me to the error. 
I have 2 servers running Ubuntu 12.04 on which I am testing Postgres 9.1.9. I set up streaming replication between them (no synchronous replication)
Both servers have 4 SATA hard drives with ext3 file system set up as follows

sda   --> / main os and the database files, except for the ones defined below
sdb   ---> pg_xlog directory
sdc ----> one tablespace where heavy transaction tables are stored
sdd --> another tablespace where big historic tables are stored.

archiving mode is on and the archive location is sda (and from there to the hot-standby server)
For testing I Populate the database with the data currently in production (currently Postgres 8.3). 
Then I run several load testing etc.
For tunning / improving the archiving process I needed to generate big ammount of WAL. To do so I just deleted the contents of one big table, and then VACUUM it, like this

DELETE form bigtable;
VACUUM bigtable;

And I found the error reported. 
I repeated the whole process (creating a new cluster, populating it with data - allways the same data- , seting up replication) a couple of times after that and I found the error again about 90% of the time. I tried deleting a big portion of the table and the error did not appeard. It only appears after deleting ALL. Also in some cases I didn't run the VACUUM command manually, and the error ocurred during auto-vacuum-
My last test, was, in case there was a hardware problem in the primary, to trigger the standby server and try the vacuum there. With the same results. 
Here a chunk of the log:

2013-04-29 17:02:21 ART [12024]: [32-1] PANIC:  XX001: corrupted item pointer: offset = 8128, size = 80
2013-04-29 17:02:21 ART [12024]: [33-1] LOCATION:  PageIndexMultiDelete, bufpage.c:779
2013-04-29 17:02:21 ART [12024]: [34-1] STATEMENT:  VACUUM callshopcdrs ;
2013-04-29 17:02:21 ART [23787]: [8-1] LOG:  server process (PID 12024) was terminated by signal 6: Aborte
d
2013-04-29 17:02:21 ART [23787]: [9-1] LOG:  terminating any other active server processes
2013-04-29 17:02:21 ART [7300]: [2-1] WARNING:  terminating connection because of crash of another server 
process
2013-04-29 17:02:21 ART [7300]: [3-1] DETAIL:  The postmaster has commanded this server process to roll ba
ck the current transaction and exit, because another server process exited abnormally and possibly corrupt
ed shared memory.
2013-04-29 17:02:21 ART [7300]: [4-1] HINT:  In a moment you should be able to reconnect to the database a
nd repeat your command.
2013-04-29 17:02:21 ART [30304]: [1-1] FATAL:  the database system is in recovery mode
2013-04-29 17:02:21 ART [23787]: [10-1] LOG:  archiver process (PID 7301) exited with exit code 1
2013-04-29 17:02:21 ART [23787]: [11-1] LOG:  all server processes terminated; reinitializing
2013-04-29 17:02:21 ART [30305]: [1-1] LOG:  database system was interrupted; last known up at 2013-04-29 
16:59:01 ART
2013-04-29 17:02:21 ART [30305]: [2-1] LOG:  database system was not properly shut down; automatic recover
y in progress
2013-04-29 17:02:21 ART [30305]: [3-1] LOG:  redo starts at 11/497D4338
2013-04-29 17:02:21 ART [30305]: [4-1] LOG:  invalid magic number 0000 in log file 17, segment 73, offset 
8216576
2013-04-29 17:02:21 ART [30305]: [5-1] LOG:  redo done at 11/497D4440
2013-04-29 17:02:22 ART [30308]: [1-1] LOG:  autovacuum launcher started
2013-04-29 17:02:22 ART [23787]: [12-1] LOG:  database system is ready to accept connections


There is a core file generated, it is 7GB big:

$ file core 
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'postgres: postgres tvoip3 [local] VACUUM'

Many thanks for your help and let me know any extra information that might be useful.

--

German





On Tue, Apr 30, 2013 at 8:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
[please don't top-post]

German Becker <german.becker@gmail.com> wrote:
> Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
>> German Becker wrote:

>>> I am testing version 9.1.9 before putting it in production. One
>>> of my tests involved deleting a the contents of a big table ( ~
>>> 13 GB size) and then VACUUMing it. During VACUUM PANICS.

>> If you mess with the database files, errors like this are to be
>> expected.

> Thanks for your reply. In which sense did I mess with the
> database files?

You didn't say how you deleted the contents of that big table, and
it appears that Albe assumed you deleted or truncated the
underlying disk file rather than using the DELETE or TRUNCATE SQL
statement.

In any event, more details would help people come up with ideas on
what might be wrong.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company