Обсуждение: "error with invalid page header" while vacuuming pgbench data
Hi all: Not sure if this is a performance question or a generic admin question. I have the following script running on a host different from the database to use pgbench to test the database: pgbench -i (inital mode) pgsql vacuum analyze; (and some other code to dump table sizes) pgbench (multiple connections, jobs etc ....) with a loop for setting different scales .... I seem to be able to provoke this error: vacuum...ERROR: invalid page header in block 2128910 of relation base/16385/21476 on a pgbench database created with a scale factor of 1000 relatively reliably (2 for 2). I am not seeing any disk errors from the raid controller or the operating system. Running pg_dumpall to check for errors reports: pg_dump: Error message from server: ERROR: invalid page header in block 401585 of relation base/16385/21476 which is different from the originaly reported block. Does anybody have any suggestions? Configuration details. OS: centos 5.5 Filesystem: data - ext4 (note 4 not 3); 6.6T formatted wal - ext4; 1.5T formatted Raid: data - level 10, 8 disk wd2003; controller LSI MegaRAID SAS 9260-4i wal - level 1, 2 disk wd2003; controller LSI MegaRAID SAS 9260-4i Could it be an ext4 issue? It seems that ext4 may still be at the bleeding edge for postgres use. Thanks for any thoughts even if it's go to the admin list. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
John Rouillard <rouilj@renesys.com> wrote: > I seem to be able to provoke this error: > > vacuum...ERROR: invalid page header in > block 2128910 of relation base/16385/21476 What version of PostgreSQL? -Kevin
On Mon, May 23, 2011 at 05:21:04PM -0500, Kevin Grittner wrote: > John Rouillard <rouilj@renesys.com> wrote: > > > I seem to be able to provoke this error: > > > > vacuum...ERROR: invalid page header in > > block 2128910 of relation base/16385/21476 > > What version of PostgreSQL? Hmm, I thought I replied to this, but I haven't seen it come back to me on list. It's postgres version: 8.4.5. rpm -q shows postgresql84-server-8.4.5-1.el5_5.1 -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
John Rouillard <rouilj@renesys.com> wrote: > On Mon, May 23, 2011 at 05:21:04PM -0500, Kevin Grittner wrote: >> John Rouillard <rouilj@renesys.com> wrote: >> >> > I seem to be able to provoke this error: >> > >> > vacuum...ERROR: invalid page header in >> > block 2128910 of relation base/16385/21476 >> >> What version of PostgreSQL? > > Hmm, I thought I replied to this, but I haven't seen it come back > to me on list. It's postgres version: 8.4.5. > > rpm -q shows > > postgresql84-server-8.4.5-1.el5_5.1 I was hoping someone else would jump in, but I see that your previous post didn't copy the list, which solves *that* mystery. I'm curious whether you might have enabled one of the "it's OK to trash my database integrity to boost performance" options. (People with enough replication often feel that this *is* OK.) Please run the query on this page and post the results: http://wiki.postgresql.org/wiki/Server_Configuration Basically, if fsync or full_page_writes is turned off and there was a crash, that explains it. If not, it provides more information to proceed. You might want to re-start the thread on pgsql-general, though. Not everybody who might be able to help with a problem like this follows the performance list. Or, if you didn't set any of the dangerous configuration options, this sounds like a bug -- so pgsql-bugs might be even better. -Kevin
On 05/23/2011 06:16 PM, John Rouillard wrote: > OS: centos 5.5 > Filesystem: data - ext4 (note 4 not 3); 6.6T formatted > wal - ext4; 1.5T formatted > Raid: data - level 10, 8 disk wd2003; controller LSI MegaRAID SAS 9260-4i > wal - level 1, 2 disk wd2003; controller LSI MegaRAID SAS 9260-4i > > Could it be an ext4 issue? It seems that ext4 may still be at the > bleeding edge for postgres use. > I would not trust ext4 on CentOS 5.5 at all. ext4 support in 5.5 is labeled by RedHat as being in "Technology Preview" state. I believe that if you had a real RedHat system instead of CentOS kernel, you'd discover it's hard to even get it installed--you need to basically say "yes, I know it's not for production, I want it anyway" to get preview packages. It's not really intended for production use. What I'm hearing from people is that they run into the occasional ext4 bug with PostgreSQL, but the serious ones aren't happening very often now, on systems running RHEL6 or Debian Squeeze. Those kernels are way, way ahead of the ext4 backport in RHEL5 based systems, and they're just barely stable. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, May 25, 2011 at 03:19:59PM -0500, Kevin Grittner wrote: > John Rouillard <rouilj@renesys.com> wrote: > > On Mon, May 23, 2011 at 05:21:04PM -0500, Kevin Grittner wrote: > >> John Rouillard <rouilj@renesys.com> wrote: > >> > >> > I seem to be able to provoke this error: > >> > > >> > vacuum...ERROR: invalid page header in > >> > block 2128910 of relation base/16385/21476 > >> > >> What version of PostgreSQL? > > > > Hmm, I thought I replied to this, but I haven't seen it come back > > to me on list. It's postgres version: 8.4.5. > > > > rpm -q shows > > > > postgresql84-server-8.4.5-1.el5_5.1 > > I was hoping someone else would jump in, but I see that your > previous post didn't copy the list, which solves *that* mystery. > > I'm curious whether you might have enabled one of the "it's OK to > trash my database integrity to boost performance" options. (People > with enough replication often feel that this *is* OK.) Please run > the query on this page and post the results: > > http://wiki.postgresql.org/wiki/Server_Configuration > > Basically, if fsync or full_page_writes is turned off and there was > a crash, that explains it. If not, it provides more information to > proceed. Nope. Neither is turned off. I can't run the query at the moment since the system is in the middle of a memtest86+ check of 96GB of memory. The relevent parts from the config file from the Configuration Management system are: #fsync = on # turns forced synchronization # on or off #synchronous_commit = on # immediate fsync at commit #wal_sync_method = fsync # the default is the first option #full_page_writes = on # recover from partial page writes this is the same setup I use on all my data warehouse systems (with minor pgtune type changes based on amount of memory). Running the query on another system (using ext3, centos 5.5) shows: version | PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit archive_command | if test ! -e /var/lib/pgsql/data/ARCHIVE_ENABLED; then exit 0; fi; test ! -f /var/bak/pgsql/%f && cp %p /var/bak/p gsql/%f archive_mode | on checkpoint_completion_target | 0.9 checkpoint_segments | 64 constraint_exclusion | on custom_variable_classes | pg_stat_statements default_statistics_target | 100 effective_cache_size | 8GB lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 listen_addresses | * log_checkpoints | on log_connections | on log_destination | stderr,syslog log_directory | pg_log log_filename | postgresql-%a.log log_line_prefix | %t %u@%d(%p)i: log_lock_waits | on log_min_duration_statement | 2s log_min_error_statement | warning log_min_messages | notice log_rotation_age | 1d log_rotation_size | 0 log_temp_files | 0 log_truncate_on_rotation | on logging_collector | on maintenance_work_mem | 1GB max_connections | 300 max_locks_per_transaction | 128 max_stack_depth | 2MB port | 5432 server_encoding | UTF8 shared_buffers | 4GB shared_preload_libraries | pg_stat_statements superuser_reserved_connections | 3 tcp_keepalives_count | 0 tcp_keepalives_idle | 0 tcp_keepalives_interval | 0 TimeZone | UTC wal_buffers | 32MB work_mem | 16MB > You might want to re-start the thread on pgsql-general, though. Not > everybody who might be able to help with a problem like this follows > the performance list. Or, if you didn't set any of the dangerous > configuration options, this sounds like a bug -- so pgsql-bugs might > be even better. Well I am also managing to panic the kernel on some runs as well. So my guess is this is not only a postgres bug (if it's a postgres issue at all). As gregg mentioned in another followup ext4 under centos 5.x may be an issue. I'll drop back to ext3 and see if I can replicate the corruption or crashes one I rule out some potential hardware issues. If I can replicate with ext3, then I'll follow up on -general or -bugs. Ext4 pgbench results complete faster, but if it's not reliable .... Thanks for your help. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111
On Wed, May 25, 2011 at 4:07 PM, John Rouillard <rouilj@renesys.com> wrote: > Well I am also managing to panic the kernel on some runs as well. So > my guess is this is not only a postgres bug (if it's a postgres issue > at all). > > As gregg mentioned in another followup ext4 under centos 5.x may be an > issue. I'll drop back to ext3 and see if I can replicate the > corruption or crashes one I rule out some potential hardware issues. Also do the standard memtest86+ run to ensure your memory isn't bad. Also do a simple dd if=/dev/sda of=/dev/null to make sure the drive has no errors. It might be the drives. Look in your logs again to make sure.
On 05/26/2011 06:18 AM, Scott Marlowe wrote: > On Wed, May 25, 2011 at 4:07 PM, John Rouillard<rouilj@renesys.com> wrote: >> Well I am also managing to panic the kernel on some runs as well. So >> my guess is this is not only a postgres bug (if it's a postgres issue >> at all). >> >> As gregg mentioned in another followup ext4 under centos 5.x may be an >> issue. I'll drop back to ext3 and see if I can replicate the >> corruption or crashes one I rule out some potential hardware issues. > > Also do the standard memtest86+ run to ensure your memory isn't bad. > Also do a simple dd if=/dev/sda of=/dev/null to make sure the drive > has no errors. If possible, also/instead use smartctl from smartmontools to ask the drive to do an internal self-test and surface scan. This doesn't help you with RAID volumes, but is often much more informative with plain physical drives. -- Craig Ringer