Обсуждение: "error with invalid page header" while vacuuming pgbench data

Поиск
Список
Период
Сортировка

"error with invalid page header" while vacuuming pgbench data

От
John Rouillard
Дата:
Hi all:

Not sure if this is a performance question or a generic admin
question. I have the following script running on a host different from
the database to use pgbench to test the database:

  pgbench -i (inital mode)
  pgsql vacuum analyze; (and some other code to dump table sizes)
  pgbench (multiple connections, jobs etc ....)

with a loop for setting different scales ....

 I seem to be able to provoke this error:

   vacuum...ERROR:  invalid page header in
            block 2128910 of relation base/16385/21476

on a pgbench database created with a scale factor of 1000 relatively
reliably (2 for 2). I am not seeing any disk errors from the raid
controller or the operating system.

Running pg_dumpall to check for errors reports:

   pg_dump: Error message from server: ERROR:  invalid page header in
            block 401585 of relation base/16385/21476

which is different from the originaly reported block.

Does anybody have any suggestions?

Configuration details.

OS: centos 5.5
Filesystem: data - ext4 (note 4 not 3); 6.6T formatted
            wal  - ext4; 1.5T formatted
Raid: data - level 10, 8 disk wd2003; controller LSI MegaRAID SAS 9260-4i
      wal  - level 1,  2 disk wd2003; controller LSI MegaRAID SAS 9260-4i

Could it be an ext4 issue? It seems that ext4 may still be at the
bleeding edge for postgres use.

Thanks for any thoughts even if it's go to the admin list.

--
                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

Re: "error with invalid page header" while vacuuming pgbench data

От
"Kevin Grittner"
Дата:
John Rouillard <rouilj@renesys.com> wrote:

>  I seem to be able to provoke this error:
>
>    vacuum...ERROR:  invalid page header in
>             block 2128910 of relation base/16385/21476

What version of PostgreSQL?

-Kevin

Re: "error with invalid page header" while vacuuming pgbench data

От
John Rouillard
Дата:
On Mon, May 23, 2011 at 05:21:04PM -0500, Kevin Grittner wrote:
> John Rouillard <rouilj@renesys.com> wrote:
>
> >  I seem to be able to provoke this error:
> >
> >    vacuum...ERROR:  invalid page header in
> >             block 2128910 of relation base/16385/21476
>
> What version of PostgreSQL?

Hmm, I thought I replied to this, but I haven't seen it come back to
me on list.  It's postgres version: 8.4.5.

rpm -q shows

   postgresql84-server-8.4.5-1.el5_5.1

--
                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

Re: "error with invalid page header" while vacuuming pgbench data

От
"Kevin Grittner"
Дата:
John Rouillard <rouilj@renesys.com> wrote:
> On Mon, May 23, 2011 at 05:21:04PM -0500, Kevin Grittner wrote:
>> John Rouillard <rouilj@renesys.com> wrote:
>>
>> >  I seem to be able to provoke this error:
>> >
>> >    vacuum...ERROR:  invalid page header in
>> >             block 2128910 of relation base/16385/21476
>>
>> What version of PostgreSQL?
>
> Hmm, I thought I replied to this, but I haven't seen it come back
> to me on list.  It's postgres version: 8.4.5.
>
> rpm -q shows
>
>    postgresql84-server-8.4.5-1.el5_5.1

I was hoping someone else would jump in, but I see that your
previous post didn't copy the list, which solves *that* mystery.

I'm curious whether you might have enabled one of the "it's OK to
trash my database integrity to boost performance" options.  (People
with enough replication often feel that this *is* OK.)  Please run
the query on this page and post the results:

http://wiki.postgresql.org/wiki/Server_Configuration

Basically, if fsync or full_page_writes is turned off and there was
a crash, that explains it.  If not, it provides more information to
proceed.

You might want to re-start the thread on pgsql-general, though.  Not
everybody who might be able to help with a problem like this follows
the performance list.  Or, if you didn't set any of the dangerous
configuration options, this sounds like a bug -- so pgsql-bugs might
be even better.

-Kevin

Re: "error with invalid page header" while vacuuming pgbench data

От
Greg Smith
Дата:
On 05/23/2011 06:16 PM, John Rouillard wrote:
> OS: centos 5.5
> Filesystem: data - ext4 (note 4 not 3); 6.6T formatted
>              wal  - ext4; 1.5T formatted
> Raid: data - level 10, 8 disk wd2003; controller LSI MegaRAID SAS 9260-4i
>        wal  - level 1,  2 disk wd2003; controller LSI MegaRAID SAS 9260-4i
>
> Could it be an ext4 issue? It seems that ext4 may still be at the
> bleeding edge for postgres use.
>

I would not trust ext4 on CentOS 5.5 at all.  ext4 support in 5.5 is
labeled by RedHat as being in "Technology Preview" state.  I believe
that if you had a real RedHat system instead of CentOS kernel, you'd
discover it's hard to even get it installed--you need to basically say
"yes, I know it's not for production, I want it anyway" to get preview
packages.  It's not really intended for production use.

What I'm hearing from people is that they run into the occasional ext4
bug with PostgreSQL, but the serious ones aren't happening very often
now, on systems running RHEL6 or Debian Squeeze.  Those kernels are way,
way ahead of the ext4 backport in RHEL5 based systems, and they're just
barely stable.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: "error with invalid page header" while vacuuming pgbench data

От
John Rouillard
Дата:
On Wed, May 25, 2011 at 03:19:59PM -0500, Kevin Grittner wrote:
> John Rouillard <rouilj@renesys.com> wrote:
> > On Mon, May 23, 2011 at 05:21:04PM -0500, Kevin Grittner wrote:
> >> John Rouillard <rouilj@renesys.com> wrote:
> >>
> >> >  I seem to be able to provoke this error:
> >> >
> >> >    vacuum...ERROR:  invalid page header in
> >> >             block 2128910 of relation base/16385/21476
> >>
> >> What version of PostgreSQL?
> >
> > Hmm, I thought I replied to this, but I haven't seen it come back
> > to me on list.  It's postgres version: 8.4.5.
> >
> > rpm -q shows
> >
> >    postgresql84-server-8.4.5-1.el5_5.1
>
> I was hoping someone else would jump in, but I see that your
> previous post didn't copy the list, which solves *that* mystery.
>
> I'm curious whether you might have enabled one of the "it's OK to
> trash my database integrity to boost performance" options.  (People
> with enough replication often feel that this *is* OK.)  Please run
> the query on this page and post the results:
>
> http://wiki.postgresql.org/wiki/Server_Configuration
>
> Basically, if fsync or full_page_writes is turned off and there was
> a crash, that explains it.  If not, it provides more information to
> proceed.

Nope. Neither is turned off. I can't run the query at the moment since
the system is in the middle of a memtest86+ check of 96GB of
memory. The relevent parts from the config file from the Configuration
Management system are:

  #fsync = on                             # turns forced synchronization
                                          # on or off
  #synchronous_commit = on                # immediate fsync at commit
  #wal_sync_method = fsync                # the default is the first option

  #full_page_writes = on                  # recover from partial page writes

this is the same setup I use on all my data warehouse systems (with
minor pgtune type changes based on amount of memory). Running the
query on another system (using ext3, centos 5.5) shows:

 version                        | PostgreSQL 8.4.5 on
x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red
Hat 4.1.2-48), 64-bit
 archive_command                | if test ! -e
/var/lib/pgsql/data/ARCHIVE_ENABLED; then exit 0; fi; test ! -f
/var/bak/pgsql/%f && cp %p /var/bak/p
gsql/%f
 archive_mode                   | on
 checkpoint_completion_target   | 0.9
 checkpoint_segments            | 64
 constraint_exclusion           | on
 custom_variable_classes        | pg_stat_statements
 default_statistics_target      | 100
 effective_cache_size           | 8GB
 lc_collate                     | en_US.UTF-8
 lc_ctype                       | en_US.UTF-8
 listen_addresses               | *
 log_checkpoints                | on
 log_connections                | on
 log_destination                | stderr,syslog
 log_directory                  | pg_log
 log_filename                   | postgresql-%a.log
 log_line_prefix                | %t %u@%d(%p)i:
 log_lock_waits                 | on
 log_min_duration_statement     | 2s
 log_min_error_statement        | warning
 log_min_messages               | notice
 log_rotation_age               | 1d
 log_rotation_size              | 0
 log_temp_files                 | 0
 log_truncate_on_rotation       | on
 logging_collector              | on
 maintenance_work_mem           | 1GB
 max_connections                | 300
 max_locks_per_transaction      | 128
 max_stack_depth                | 2MB
 port                           | 5432
 server_encoding                | UTF8
 shared_buffers                 | 4GB
 shared_preload_libraries       | pg_stat_statements
 superuser_reserved_connections | 3
 tcp_keepalives_count           | 0
 tcp_keepalives_idle            | 0
 tcp_keepalives_interval        | 0
 TimeZone                       | UTC
 wal_buffers                    | 32MB
 work_mem                       | 16MB

> You might want to re-start the thread on pgsql-general, though.  Not
> everybody who might be able to help with a problem like this follows
> the performance list.  Or, if you didn't set any of the dangerous
> configuration options, this sounds like a bug -- so pgsql-bugs might
> be even better.

Well I am also managing to panic the kernel on some runs as well.  So
my guess is this is not only a postgres bug (if it's a postgres issue
at all).

As gregg mentioned in another followup ext4 under centos 5.x may be an
issue. I'll drop back to ext3 and see if I can replicate the
corruption or crashes one I rule out some potential hardware issues.

If I can replicate with ext3, then I'll follow up on -general or
-bugs.

Ext4 pgbench results complete faster, but if it's not reliable ....

Thanks for your help.

--
                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

Re: "error with invalid page header" while vacuuming pgbench data

От
Scott Marlowe
Дата:
On Wed, May 25, 2011 at 4:07 PM, John Rouillard <rouilj@renesys.com> wrote:
> Well I am also managing to panic the kernel on some runs as well.  So
> my guess is this is not only a postgres bug (if it's a postgres issue
> at all).
>
> As gregg mentioned in another followup ext4 under centos 5.x may be an
> issue. I'll drop back to ext3 and see if I can replicate the
> corruption or crashes one I rule out some potential hardware issues.

Also do the standard memtest86+ run to ensure your memory isn't bad.
Also do a simple dd if=/dev/sda of=/dev/null to make sure the drive
has no errors.  It might be the drives.  Look in your logs again to
make sure.

Re: "error with invalid page header" while vacuuming pgbench data

От
Craig Ringer
Дата:
On 05/26/2011 06:18 AM, Scott Marlowe wrote:
> On Wed, May 25, 2011 at 4:07 PM, John Rouillard<rouilj@renesys.com>  wrote:
>> Well I am also managing to panic the kernel on some runs as well.  So
>> my guess is this is not only a postgres bug (if it's a postgres issue
>> at all).
>>
>> As gregg mentioned in another followup ext4 under centos 5.x may be an
>> issue. I'll drop back to ext3 and see if I can replicate the
>> corruption or crashes one I rule out some potential hardware issues.
>
> Also do the standard memtest86+ run to ensure your memory isn't bad.
> Also do a simple dd if=/dev/sda of=/dev/null to make sure the drive
> has no errors.

If possible, also/instead use smartctl from smartmontools to ask the
drive to do an internal self-test and surface scan. This doesn't help
you with RAID volumes, but is often much more informative with plain
physical drives.

--
Craig Ringer