Обсуждение: help troubleshooting invalid page header error
- Have you ever set
fsync=off
in the postgresql config file?
- Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
- Have you run a file system check? (
chkdsk
/fsck
)
- Are there any error messages in the system logs? (unix/linux:
dmesg
,/var/log/syslog
;
Hi all,Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.---A description of what you are trying to achieve and what results you expect.:Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"PostgreSQL version number you are running:PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bitHow you installed PostgreSQL:from standard package installerChanges made to the settings in the postgresql.conf file:name | current_setting | source------------------------------+-----------------------------+----------------------checkpoint_completion_target | 0.9 | configuration filecheckpoint_segments | 32 | configuration filecheckpoint_timeout | 15min | configuration fileDateStyle | ISO, MDY | configuration filedefault_text_search_config | pg_catalog.english | configuration fileeffective_cache_size | 1GB | configuration filelc_messages | en_US.UTF-8 | configuration filelc_monetary | en_US.UTF-8 | configuration filelc_numeric | en_US.UTF-8 | configuration filelc_time | en_US.UTF-8 | configuration filelog_checkpoints | on | configuration filelog_connections | off | configuration filelog_destination | csvlog | configuration filelog_directory | /opt/data/pgsql/data/pg_log | configuration filelog_disconnections | off | configuration filelog_duration | on | configuration filelog_filename | postgres-%Y-%m-%d_%H%M%S | configuration filelog_lock_waits | on | configuration filelog_min_duration_statement | 250ms | configuration filelog_rotation_age | 1d | configuration filelog_rotation_size | 1GB | configuration filelog_temp_files | 0 | configuration filelog_timezone | Asia/Kolkata | command linelog_truncate_on_rotation | on | configuration filelogging_collector | on | configuration filemaintenance_work_mem | 768MB | configuration filemax_connections | 500 | configuration filemax_stack_depth | 2MB | environment variableport | 5432 | command lineshared_buffers | 4GB | configuration filessl | on | configuration fileTimeZone | Asia/Kolkata | command linetimezone_abbreviations | Default | command linewal_buffers | 16MB | configuration filework_mem | 48MB | configuration fileIt's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.Operating system and version:RedHatEnterpriseServer, version 6.6What program you're using to connect to PostgreSQL:Python (django)Is there anything relevant or unusual in the PostgreSQL server logs?:I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.What you were doing when the error happened / how to cause the error:I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.The EXACT TEXT of the error message you're getting, if there is one:DatabaseError: invalid page header in block 1 of relation base/16384/76623(block and relation numbers change)Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
- Have you ever set
fsync=off
in the postgresql config file?No
- Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
Not recently, though the system did reboot normally as described above.
- Have you run a file system check? (
chkdsk
/fsck
)No.
- Are there any error messages in the system logs? (unix/linux:
dmesg
,/var/log/syslog
;I haven't seen anything obvious but I wasn't sure what to look for.thanks,Cory
On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:
>
> Hi all,
>
> Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.
>
> ---
> A description of what you are trying to achieve and what results you expect.:
>
> Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"
>
> PostgreSQL version number you are running:
>
> PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
>
> How you installed PostgreSQL:
>
> from standard package installer
>
> Changes made to the settings in the postgresql.conf file:
>
>
> name | current_setting | source
> ------------------------------+-----------------------------+----------------------
> checkpoint_completion_target | 0.9 | configuration file
> checkpoint_segments | 32 | configuration file
> checkpoint_timeout | 15min | configuration file
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 1GB | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_checkpoints | on | configuration file
> log_connections | off | configuration file
> log_destination | csvlog | configuration file
> log_directory | /opt/data/pgsql/data/pg_log | configuration file
> log_disconnections | off | configuration file
> log_duration | on | configuration file
> log_filename | postgres-%Y-%m-%d_%H%M%S | configuration file
> log_lock_waits | on | configuration file
> log_min_duration_statement | 250ms | configuration file
> log_rotation_age | 1d | configuration file
> log_rotation_size | 1GB | configuration file
> log_temp_files | 0 | configuration file
> log_timezone | Asia/Kolkata | command line
> log_truncate_on_rotation | on | configuration file
> logging_collector | on | configuration file
> maintenance_work_mem | 768MB | configuration file
> max_connections | 500 | configuration file
> max_stack_depth | 2MB | environment variable
> port | 5432 | command line
> shared_buffers | 4GB | configuration file
> ssl | on | configuration file
> TimeZone | Asia/Kolkata | command line
> timezone_abbreviations | Default | command line
> wal_buffers | 16MB | configuration file
> work_mem | 48MB | configuration file
>
> It's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.
>
> Operating system and version:
>
> RedHatEnterpriseServer, version 6.6
>
> What program you're using to connect to PostgreSQL:
>
> Python (django)
>
> Is there anything relevant or unusual in the PostgreSQL server logs?:
>
> I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.
>
> What you were doing when the error happened / how to cause the error:
>
> I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.
>
> The EXACT TEXT of the error message you're getting, if there is one:
>
> DatabaseError: invalid page header in block 1 of relation base/16384/76623
>
> (block and relation numbers change)
>
> Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
>
> Have you ever set fsync=off in the postgresql config file?
> No
> Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
> Not recently, though the system did reboot normally as described above.
> Have you run a file system check? (chkdsk / fsck)
> No.
> Are there any error messages in the system logs? (unix/linux: dmesg, /var/log/syslog ;
> I haven't seen anything obvious but I wasn't sure what to look for.
>
I guess you missed to provide the details and kernel version (rhel version and kernel level).
This will give you kernel patch level-
uname -a
I had once faced this issue and I was on a buggy patch of Linux kernel. I just had to update to latest patch. That worked for me.
On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:
>
> Hi all,
>
> Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.
>
> ---
> A description of what you are trying to achieve and what results you expect.:
>
> Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"
>
> PostgreSQL version number you are running:
>
> PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
>
> How you installed PostgreSQL:
>
> from standard package installer
>
> Changes made to the settings in the postgresql.conf file:
>
>
> name | current_setting | source
> ------------------------------+-----------------------------+----------------------
> checkpoint_completion_target | 0.9 | configuration file
> checkpoint_segments | 32 | configuration file
> checkpoint_timeout | 15min | configuration file
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 1GB | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_checkpoints | on | configuration file
> log_connections | off | configuration file
> log_destination | csvlog | configuration file
> log_directory | /opt/data/pgsql/data/pg_log | configuration file
> log_disconnections | off | configuration file
> log_duration | on | configuration file
> log_filename | postgres-%Y-%m-%d_%H%M%S | configuration file
> log_lock_waits | on | configuration file
> log_min_duration_statement | 250ms | configuration file
> log_rotation_age | 1d | configuration file
> log_rotation_size | 1GB | configuration file
> log_temp_files | 0 | configuration file
> log_timezone | Asia/Kolkata | command line
> log_truncate_on_rotation | on | configuration file
> logging_collector | on | configuration file
> maintenance_work_mem | 768MB | configuration file
> max_connections | 500 | configuration file
> max_stack_depth | 2MB | environment variable
> port | 5432 | command line
> shared_buffers | 4GB | configuration file
> ssl | on | configuration file
> TimeZone | Asia/Kolkata | command line
> timezone_abbreviations | Default | command line
> wal_buffers | 16MB | configuration file
> work_mem | 48MB | configuration file
>
> It's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.
>
> Operating system and version:
>
> RedHatEnterpriseServer, version 6.6
>
> What program you're using to connect to PostgreSQL:
>
> Python (django)
>
> Is there anything relevant or unusual in the PostgreSQL server logs?:
>
> I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.
>
> What you were doing when the error happened / how to cause the error:
>
> I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.
>
> The EXACT TEXT of the error message you're getting, if there is one:
>
> DatabaseError: invalid page header in block 1 of relation base/16384/76623
>
> (block and relation numbers change)
>
> Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
>
> Have you ever set fsync=off in the postgresql config file?
> No
> Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
> Not recently, though the system did reboot normally as described above.
> Have you run a file system check? (chkdsk / fsck)
> No.
> Are there any error messages in the system logs? (unix/linux: dmesg, /var/log/syslog ;
> I haven't seen anything obvious but I wasn't sure what to look for.
>I guess you missed to provide the details and kernel version (rhel version and kernel level).
This will give you kernel patch level-uname -a
I had once faced this issue and I was on a buggy patch of Linux kernel. I just had to update to latest patch. That worked for me.
Hi all,Thanks for the responses. Chiru, I'm looking into your suggestion.Sameer, here is the kernel version info:Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT 2014 x86_64 x86_64 x86_64 GNU/LinuxDoes that seem like it could be a problematic version?More generally - I'm still wondering whether I should chalk this failure up to a transient/random issue, or whether I should be more worried about the hardware on the machine. According to our diagnostic tools, disk and memory are fine, but it's still not clear to me how it got into this state. Any general bits of information regarding the potential causes of these types of issues would be much appreciated.thanks,CoryOn Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:
>
> Hi all,
>
> Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.
>
> ---
> A description of what you are trying to achieve and what results you expect.:
>
> Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"
>
> PostgreSQL version number you are running:
>
> PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
>
> How you installed PostgreSQL:
>
> from standard package installer
>
> Changes made to the settings in the postgresql.conf file:
>
>
> name | current_setting | source
> ------------------------------+-----------------------------+----------------------
> checkpoint_completion_target | 0.9 | configuration file
> checkpoint_segments | 32 | configuration file
> checkpoint_timeout | 15min | configuration file
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 1GB | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_checkpoints | on | configuration file
> log_connections | off | configuration file
> log_destination | csvlog | configuration file
> log_directory | /opt/data/pgsql/data/pg_log | configuration file
> log_disconnections | off | configuration file
> log_duration | on | configuration file
> log_filename | postgres-%Y-%m-%d_%H%M%S | configuration file
> log_lock_waits | on | configuration file
> log_min_duration_statement | 250ms | configuration file
> log_rotation_age | 1d | configuration file
> log_rotation_size | 1GB | configuration file
> log_temp_files | 0 | configuration file
> log_timezone | Asia/Kolkata | command line
> log_truncate_on_rotation | on | configuration file
> logging_collector | on | configuration file
> maintenance_work_mem | 768MB | configuration file
> max_connections | 500 | configuration file
> max_stack_depth | 2MB | environment variable
> port | 5432 | command line
> shared_buffers | 4GB | configuration file
> ssl | on | configuration file
> TimeZone | Asia/Kolkata | command line
> timezone_abbreviations | Default | command line
> wal_buffers | 16MB | configuration file
> work_mem | 48MB | configuration file
>
> It's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.
>
> Operating system and version:
>
> RedHatEnterpriseServer, version 6.6
>
> What program you're using to connect to PostgreSQL:
>
> Python (django)
>
> Is there anything relevant or unusual in the PostgreSQL server logs?:
>
> I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.
>
> What you were doing when the error happened / how to cause the error:
>
> I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.
>
> The EXACT TEXT of the error message you're getting, if there is one:
>
> DatabaseError: invalid page header in block 1 of relation base/16384/76623
>
> (block and relation numbers change)
>
> Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
>
> Have you ever set fsync=off in the postgresql config file?
> No
> Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
> Not recently, though the system did reboot normally as described above.
> Have you run a file system check? (chkdsk / fsck)
> No.
> Are there any error messages in the system logs? (unix/linux: dmesg, /var/log/syslog ;
> I haven't seen anything obvious but I wasn't sure what to look for.
>I guess you missed to provide the details and kernel version (rhel version and kernel level).
This will give you kernel patch level-uname -a
I had once faced this issue and I was on a buggy patch of Linux kernel. I just had to update to latest patch. That worked for me.
Hi Cory,After recovering table turn off zero_damaged_pages parameter.On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:Hi all,Thanks for the responses. Chiru, I'm looking into your suggestion.Sameer, here is the kernel version info:Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT 2014 x86_64 x86_64 x86_64 GNU/LinuxDoes that seem like it could be a problematic version?More generally - I'm still wondering whether I should chalk this failure up to a transient/random issue, or whether I should be more worried about the hardware on the machine. According to our diagnostic tools, disk and memory are fine, but it's still not clear to me how it got into this state. Any general bits of information regarding the potential causes of these types of issues would be much appreciated.thanks,CoryOn Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:
>
> Hi all,
>
> Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.
>
> ---
> A description of what you are trying to achieve and what results you expect.:
>
> Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"
>
> PostgreSQL version number you are running:
>
> PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
>
> How you installed PostgreSQL:
>
> from standard package installer
>
> Changes made to the settings in the postgresql.conf file:
>
>
> name | current_setting | source
> ------------------------------+-----------------------------+----------------------
> checkpoint_completion_target | 0.9 | configuration file
> checkpoint_segments | 32 | configuration file
> checkpoint_timeout | 15min | configuration file
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 1GB | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_checkpoints | on | configuration file
> log_connections | off | configuration file
> log_destination | csvlog | configuration file
> log_directory | /opt/data/pgsql/data/pg_log | configuration file
> log_disconnections | off | configuration file
> log_duration | on | configuration file
> log_filename | postgres-%Y-%m-%d_%H%M%S | configuration file
> log_lock_waits | on | configuration file
> log_min_duration_statement | 250ms | configuration file
> log_rotation_age | 1d | configuration file
> log_rotation_size | 1GB | configuration file
> log_temp_files | 0 | configuration file
> log_timezone | Asia/Kolkata | command line
> log_truncate_on_rotation | on | configuration file
> logging_collector | on | configuration file
> maintenance_work_mem | 768MB | configuration file
> max_connections | 500 | configuration file
> max_stack_depth | 2MB | environment variable
> port | 5432 | command line
> shared_buffers | 4GB | configuration file
> ssl | on | configuration file
> TimeZone | Asia/Kolkata | command line
> timezone_abbreviations | Default | command line
> wal_buffers | 16MB | configuration file
> work_mem | 48MB | configuration file
>
> It's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.
>
> Operating system and version:
>
> RedHatEnterpriseServer, version 6.6
>
> What program you're using to connect to PostgreSQL:
>
> Python (django)
>
> Is there anything relevant or unusual in the PostgreSQL server logs?:
>
> I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.
>
> What you were doing when the error happened / how to cause the error:
>
> I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.
>
> The EXACT TEXT of the error message you're getting, if there is one:
>
> DatabaseError: invalid page header in block 1 of relation base/16384/76623
>
> (block and relation numbers change)
>
> Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
>
> Have you ever set fsync=off in the postgresql config file?
> No
> Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
> Not recently, though the system did reboot normally as described above.
> Have you run a file system check? (chkdsk / fsck)
> No.
> Are there any error messages in the system logs? (unix/linux: dmesg, /var/log/syslog ;
> I haven't seen anything obvious but I wasn't sure what to look for.
>I guess you missed to provide the details and kernel version (rhel version and kernel level).
This will give you kernel patch level-uname -a
I had once faced this issue and I was on a buggy patch of Linux kernel. I just had to update to latest patch. That worked for me.
Hi Chiru,I am trying to pg_dump the database to have a snapshot of the current state. I've turned on 'zero_damaged_pages' but pg_dump is still failing with an "invalid page header" error - this time from what looks like a sequence object that is auto-setting IDs on a table. Any advice on how to remove this error?Here is the full query that's failing:SELECT sequence_name, start_value, last_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled, is_called from unfinishedsubmissionstub_id_seqOn Fri, Dec 26, 2014 at 2:35 PM, chiru r <chirupg@gmail.com> wrote:Hi Cory,After recovering table turn off zero_damaged_pages parameter.On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:Hi all,Thanks for the responses. Chiru, I'm looking into your suggestion.Sameer, here is the kernel version info:Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT 2014 x86_64 x86_64 x86_64 GNU/LinuxDoes that seem like it could be a problematic version?More generally - I'm still wondering whether I should chalk this failure up to a transient/random issue, or whether I should be more worried about the hardware on the machine. According to our diagnostic tools, disk and memory are fine, but it's still not clear to me how it got into this state. Any general bits of information regarding the potential causes of these types of issues would be much appreciated.thanks,CoryOn Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:
>
> Hi all,
>
> Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.
>
> ---
> A description of what you are trying to achieve and what results you expect.:
>
> Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"
>
> PostgreSQL version number you are running:
>
> PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
>
> How you installed PostgreSQL:
>
> from standard package installer
>
> Changes made to the settings in the postgresql.conf file:
>
>
> name | current_setting | source
> ------------------------------+-----------------------------+----------------------
> checkpoint_completion_target | 0.9 | configuration file
> checkpoint_segments | 32 | configuration file
> checkpoint_timeout | 15min | configuration file
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 1GB | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_checkpoints | on | configuration file
> log_connections | off | configuration file
> log_destination | csvlog | configuration file
> log_directory | /opt/data/pgsql/data/pg_log | configuration file
> log_disconnections | off | configuration file
> log_duration | on | configuration file
> log_filename | postgres-%Y-%m-%d_%H%M%S | configuration file
> log_lock_waits | on | configuration file
> log_min_duration_statement | 250ms | configuration file
> log_rotation_age | 1d | configuration file
> log_rotation_size | 1GB | configuration file
> log_temp_files | 0 | configuration file
> log_timezone | Asia/Kolkata | command line
> log_truncate_on_rotation | on | configuration file
> logging_collector | on | configuration file
> maintenance_work_mem | 768MB | configuration file
> max_connections | 500 | configuration file
> max_stack_depth | 2MB | environment variable
> port | 5432 | command line
> shared_buffers | 4GB | configuration file
> ssl | on | configuration file
> TimeZone | Asia/Kolkata | command line
> timezone_abbreviations | Default | command line
> wal_buffers | 16MB | configuration file
> work_mem | 48MB | configuration file
>
> It's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.
>
> Operating system and version:
>
> RedHatEnterpriseServer, version 6.6
>
> What program you're using to connect to PostgreSQL:
>
> Python (django)
>
> Is there anything relevant or unusual in the PostgreSQL server logs?:
>
> I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.
>
> What you were doing when the error happened / how to cause the error:
>
> I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.
>
> The EXACT TEXT of the error message you're getting, if there is one:
>
> DatabaseError: invalid page header in block 1 of relation base/16384/76623
>
> (block and relation numbers change)
>
> Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
>
> Have you ever set fsync=off in the postgresql config file?
> No
> Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
> Not recently, though the system did reboot normally as described above.
> Have you run a file system check? (chkdsk / fsck)
> No.
> Are there any error messages in the system logs? (unix/linux: dmesg, /var/log/syslog ;
> I haven't seen anything obvious but I wasn't sure what to look for.
>I guess you missed to provide the details and kernel version (rhel version and kernel level).
This will give you kernel patch level-uname -a
I had once faced this issue and I was on a buggy patch of Linux kernel. I just had to update to latest patch. That worked for me.
Hi Chiru,I am trying to pg_dump the database to have a snapshot of the current state. I've turned on 'zero_damaged_pages' but pg_dump is still failing with an "invalid page header" error - this time from what looks like a sequence object that is auto-setting IDs on a table. Any advice on how to remove this error?Here is the full query that's failing:SELECT sequence_name, start_value, last_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled, is_called from unfinishedsubmissionstub_id_seqOn Fri, Dec 26, 2014 at 2:35 PM, chiru r <chirupg@gmail.com> wrote:Hi Cory,After recovering table turn off zero_damaged_pages parameter.On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:Hi all,Thanks for the responses. Chiru, I'm looking into your suggestion.Sameer, here is the kernel version info:Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT 2014 x86_64 x86_64 x86_64 GNU/LinuxDoes that seem like it could be a problematic version?More generally - I'm still wondering whether I should chalk this failure up to a transient/random issue, or whether I should be more worried about the hardware on the machine. According to our diagnostic tools, disk and memory are fine, but it's still not clear to me how it got into this state. Any general bits of information regarding the potential causes of these types of issues would be much appreciated.thanks,CoryOn Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:
>
> Hi all,
>
> Our postgres instance on one of our production machines has recently been returning errors of the form "DatabaseError: invalid page header in block 1 of relation base/16384/76623" from normal queries. I've been reading that these are often linked to hardware errors, but I would like to better understand what else it could be or how to determine that for sure. I've filled out the standard issue reporting template below. Any feedback or troubleshooting instructions would be much appreciated.
>
> ---
> A description of what you are trying to achieve and what results you expect.:
>
> Intermittent queries are failing with the error "DatabaseError: invalid page header in block 1 of relation base/16384/76623"
>
> PostgreSQL version number you are running:
>
> PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
>
> How you installed PostgreSQL:
>
> from standard package installer
>
> Changes made to the settings in the postgresql.conf file:
>
>
> name | current_setting | source
> ------------------------------+-----------------------------+----------------------
> checkpoint_completion_target | 0.9 | configuration file
> checkpoint_segments | 32 | configuration file
> checkpoint_timeout | 15min | configuration file
> DateStyle | ISO, MDY | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 1GB | configuration file
> lc_messages | en_US.UTF-8 | configuration file
> lc_monetary | en_US.UTF-8 | configuration file
> lc_numeric | en_US.UTF-8 | configuration file
> lc_time | en_US.UTF-8 | configuration file
> log_checkpoints | on | configuration file
> log_connections | off | configuration file
> log_destination | csvlog | configuration file
> log_directory | /opt/data/pgsql/data/pg_log | configuration file
> log_disconnections | off | configuration file
> log_duration | on | configuration file
> log_filename | postgres-%Y-%m-%d_%H%M%S | configuration file
> log_lock_waits | on | configuration file
> log_min_duration_statement | 250ms | configuration file
> log_rotation_age | 1d | configuration file
> log_rotation_size | 1GB | configuration file
> log_temp_files | 0 | configuration file
> log_timezone | Asia/Kolkata | command line
> log_truncate_on_rotation | on | configuration file
> logging_collector | on | configuration file
> maintenance_work_mem | 768MB | configuration file
> max_connections | 500 | configuration file
> max_stack_depth | 2MB | environment variable
> port | 5432 | command line
> shared_buffers | 4GB | configuration file
> ssl | on | configuration file
> TimeZone | Asia/Kolkata | command line
> timezone_abbreviations | Default | command line
> wal_buffers | 16MB | configuration file
> work_mem | 48MB | configuration file
>
> It's also probably worth noting that postgres is installed on an encrypted volume which is mounted using ecryptfs.
>
> Operating system and version:
>
> RedHatEnterpriseServer, version 6.6
>
> What program you're using to connect to PostgreSQL:
>
> Python (django)
>
> Is there anything relevant or unusual in the PostgreSQL server logs?:
>
> I see lots of instances of this error (and similar). I'm not sure what else I should be looking for.
>
> What you were doing when the error happened / how to cause the error:
>
> I haven't explicitly tried to reproduce it, but it seems to consistently happen with certain queries. However, the system was rebooted shortly before the errors started occuring. The system was rebooted because another database (elasticsearch) was having problems on the same machine and the reboot was to attempt to resolve things.
>
> The EXACT TEXT of the error message you're getting, if there is one:
>
> DatabaseError: invalid page header in block 1 of relation base/16384/76623
>
> (block and relation numbers change)
>
> Unfortunately, I'm not completely familiar with the CPU and disk/RAID configurations used on the server. However it is storing to a (software) encrypted volume as mentioned above.
>
> Have you ever set fsync=off in the postgresql config file?
> No
> Have you had any unexpected power loss lately? Replaced a failed RAID disk? Had an operating system crash?
> Not recently, though the system did reboot normally as described above.
> Have you run a file system check? (chkdsk / fsck)
> No.
> Are there any error messages in the system logs? (unix/linux: dmesg, /var/log/syslog ;
> I haven't seen anything obvious but I wasn't sure what to look for.
>I guess you missed to provide the details and kernel version (rhel version and kernel level).
This will give you kernel patch level-uname -a
I had once faced this issue and I was on a buggy patch of Linux kernel. I just had to update to latest patch. That worked for me.
Cory Zue <czue@dimagi.com> wrote: > I was able to get the database back to a normal functional state > using the zero_damaged_pages flag. However, after getting > everything working and starting to use the database again, I am > again getting "invalid page header" errors on a certain table. > > Does this imply there is a hardware issue on my machine? Is there > anything else that could be causing this to come back? In my personal experience bad hardware is the most common cause, followed by buggy device drivers (where an OS software upgrade prevented further corruption), followed by using incorrect procedures for backup, restore, replication setup, or node promotion. For example, not excluding files under pg_xlog from a base backup or deleting (or moving) the backup_label file can cause corruption. For a more complete discussion, see this blog page: http://rhaas.blogspot.com/2012/03/why-is-my-database-corrupted.html -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company