Обсуждение: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

Поиск
Список
Период
Сортировка

Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Abdul Qoyyuum
Дата:
Hi all,

Knowing that it's a data corruption issue, the only way to fix this is to vacuum and reindex the database. What was suggested was the following:

SET zero_damaged_pages = 0; # This is so that we can have the application to continue to run
VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if possible.
REINDEX DATABASE "core"; # Then do a reindex and clean it up.

We're on Postgresql 12. This has worked before it happened (almost exactly a year ago) and I think this needs a more permanent solution. I've looked at routine vacuuming and checked the autovacuum is set to on and the following configurations:

core=> select name, setting from pg_settings where name like 'autovacuum%';
                name                 |  setting
-------------------------------------+-----------
 autovacuum                          | on
 autovacuum_analyze_scale_factor     | 0.1
 autovacuum_analyze_threshold        | 50
 autovacuum_freeze_max_age           | 200000000
 autovacuum_max_workers              | 3
 autovacuum_multixact_freeze_max_age | 400000000
 autovacuum_naptime                  | 60
 autovacuum_vacuum_cost_delay        | 2
 autovacuum_vacuum_cost_limit        | -1
 autovacuum_vacuum_scale_factor      | 0.2
 autovacuum_vacuum_threshold         | 50
 autovacuum_work_mem                 | -1
(12 rows)

Can anyone advise if there's anything else we can do? We have no clue what causes the invalid page block and we are running a High Availability cluster set up but we are hoping that there may be a way to mitigate it.

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Stephen Frost
Дата:
Greetings,

* Abdul Qoyyuum (aqoyyuum@cardaccess.com.bn) wrote:
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
>
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.

This is only going to help if the issue is in an index, which isn't
clear from what's been shared.

> We're on Postgresql 12. This has worked before it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:

This isn't something that should ever happen ...

This also doesn't have anything to do with autovacuum, changing settings
there won't make any difference.

> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.

Was there some kind of hardware fault?  Did you do a failover?  Restore
from a backup?  Do you have checksums enabled?  How many times has this
happened before, and how many pages were impacted?  What is the design
of your HA solution, are you using PG replication or something else?

Thanks,

Stephen

Вложения

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Chris Travers
Дата:


On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn> wrote:
Hi all,

Knowing that it's a data corruption issue, the only way to fix this is to vacuum and reindex the database. What was suggested was the following:

SET zero_damaged_pages = 0; # This is so that we can have the application to continue to run
VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if possible.
REINDEX DATABASE "core"; # Then do a reindex and clean it up.

So first, to clear up some confusion on my part here:

This procedure doesn't make a lot of sense to me.  But did it clear up the issue?   

In any of these cases, it is extremely important to diagnose the system properly.  If you have a fault in your storage device or RAID controller, for example, you are asking for more corruption and data loss later.

At first I thought maybe you mistyped something and then realized there were a few issues with the process so it actually didn't make sense.

First, zero_damaged_pages defaults to 0, and I can think of no reason to set  it explicitly.
Secondly, a vacuum full has to reindex, so there is no reason to do a reindex following.  Your whole procedure is limited to a vacuum full, when a reindex is the only part that could affect this.   If it did work, reindexing is the only part that would have been helpful.

On to the question of what to do next....

We're on Postgresql 12. This has worked before it happened (almost exactly a year ago) and I think this needs a more permanent solution. I've looked at routine vacuuming and checked the autovacuum is set to on and the following configurations:

core=> select name, setting from pg_settings where name like 'autovacuum%';
                name                 |  setting
-------------------------------------+-----------
 autovacuum                          | on
 autovacuum_analyze_scale_factor     | 0.1
 autovacuum_analyze_threshold        | 50
 autovacuum_freeze_max_age           | 200000000
 autovacuum_max_workers              | 3
 autovacuum_multixact_freeze_max_age | 400000000
 autovacuum_naptime                  | 60
 autovacuum_vacuum_cost_delay        | 2
 autovacuum_vacuum_cost_limit        | -1
 autovacuum_vacuum_scale_factor      | 0.2
 autovacuum_vacuum_threshold         | 50
 autovacuum_work_mem                 | -1
(12 rows)

Can anyone advise if there's anything else we can do? We have no clue what causes the invalid page block and we are running a High Availability cluster set up but we are hoping that there may be a way to mitigate it.


You need to figure out why the corruption is happening.  This is most likely, in my experience, not a PostgreSQL bug, but usually something that happens on the hardware layer or an environmental factor.  It could be failin storage or CPU.  Or it could be something like bad electrical input or insufficient cooling (I have seen index and even table corruption issues from both of these).

If this is a server you run, the first things I would check are:
1.  Is there a good-quality UPS that the server is plugged into?  Are the batteries in good working order?
2.  Is the server somewhere that may be sitting in a pocket of hot air?

Once you have ruled these out, the next things to check are CPU, memory, and storage health.  Unfortunately checking these is harder but you can check SMART indications, and other diagnostic indicators.

However, once these errors start happening, you are in danger territory and need to find out why (and correct the underlying problem) before you get data loss.

--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Abdul Qoyyuum
Дата:
Hi Stephen,

On Wed, Nov 29, 2023 at 5:53 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Abdul Qoyyuum (aqoyyuum@cardaccess.com.bn) wrote:
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
>
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.

This is only going to help if the issue is in an index, which isn't
clear from what's been shared.

That is a good point. Problem is I can't really find out as the logs isn't that verbose to tell me more. Part of the logs shows something like this:

2023-11-29 04:27:17.486 [ERROR] [dispatcher-1095] [<redacted>] - ERROR: invalid page in block 35217 of relation base/16421/3192429
  Where: parallel worker
org.postgresql.util.PSQLException: ERROR: invalid page in block 35217 of relation base/16421/3192429
  Where: parallel worker
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2565) ~[postgresql-42.2.24.jre7.jar:42.2.24.jre7]
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2297) ~[postgresql-42.2.24.jre7.jar:42.2.24.jre7]

> We're on Postgresql 12. This has worked before it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:

This isn't something that should ever happen ...

This also doesn't have anything to do with autovacuum, changing settings
there won't make any difference.

Noted but it has been a clean running since a year ago that I ran the vacuum and reindex commands. 

> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.

Was there some kind of hardware fault?  Did you do a failover?  Restore
from a backup?  Do you have checksums enabled?  How many times has this
happened before, and how many pages were impacted?  What is the design
of your HA solution, are you using PG replication or something else? 

There have been a few maintenance operations earlier this year but nothing too critical or anything failed that would have caused the database to go corrupt. The HA solution we're using is the pacemaker with the active passive setup.

Unsure if sharing the relevant WAL settings from postgresql.conf may be useful but here they are:

max_connections = 300
shared_buffers = 128MB

archive_mode = on
archive_command = 'test ! -f /opt/databases/postgres12/wal_archive/%f && cp %p /opt/databases/postgres12/wal_archive/%f'
hot_standby = on
wal_level = hot_standby
full_page_writes = on
max_wal_senders = 10
wal_keep_segments = 100                 # 16MB per segment = 1.6GB
hot_standby = on
restart_after_crash = off
wal_receiver_status_interval = 2   # seconds
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
synchronous_commit = on
hot_standby_feedback = on
wal_sender_timeout = 10000
wal_receiver_timeout = 10000

Thanks,

Stephen

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Abdul Qoyyuum
Дата:
Hi Chris,

On Wed, Nov 29, 2023 at 7:38 PM Chris Travers <chris.travers@gmail.com> wrote:


On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn> wrote:
Hi all,

Knowing that it's a data corruption issue, the only way to fix this is to vacuum and reindex the database. What was suggested was the following:

SET zero_damaged_pages = 0; # This is so that we can have the application to continue to run
VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if possible.
REINDEX DATABASE "core"; # Then do a reindex and clean it up.

So first, to clear up some confusion on my part here:

This procedure doesn't make a lot of sense to me.  But did it clear up the issue?   
Yeah it did fix the issue before (same issue as last year) and it has fixed the problem that just happened a few days ago (almost exactly a year ago). 

In any of these cases, it is extremely important to diagnose the system properly.  If you have a fault in your storage device or RAID controller, for example, you are asking for more corruption and data loss later.

At first I thought maybe you mistyped something and then realized there were a few issues with the process so it actually didn't make sense.

First, zero_damaged_pages defaults to 0, and I can think of no reason to set  it explicitly.
Secondly, a vacuum full has to reindex, so there is no reason to do a reindex following.  Your whole procedure is limited to a vacuum full, when a reindex is the only part that could affect this.   If it did work, reindexing is the only part that would have been helpful.
Oh that makes sense actually. Thanks. 

On to the question of what to do next....

We're on Postgresql 12. This has worked before it happened (almost exactly a year ago) and I think this needs a more permanent solution. I've looked at routine vacuuming and checked the autovacuum is set to on and the following configurations:

core=> select name, setting from pg_settings where name like 'autovacuum%';
                name                 |  setting
-------------------------------------+-----------
 autovacuum                          | on
 autovacuum_analyze_scale_factor     | 0.1
 autovacuum_analyze_threshold        | 50
 autovacuum_freeze_max_age           | 200000000
 autovacuum_max_workers              | 3
 autovacuum_multixact_freeze_max_age | 400000000
 autovacuum_naptime                  | 60
 autovacuum_vacuum_cost_delay        | 2
 autovacuum_vacuum_cost_limit        | -1
 autovacuum_vacuum_scale_factor      | 0.2
 autovacuum_vacuum_threshold         | 50
 autovacuum_work_mem                 | -1
(12 rows)

Can anyone advise if there's anything else we can do? We have no clue what causes the invalid page block and we are running a High Availability cluster set up but we are hoping that there may be a way to mitigate it.


You need to figure out why the corruption is happening.  This is most likely, in my experience, not a PostgreSQL bug, but usually something that happens on the hardware layer or an environmental factor.  It could be failin storage or CPU.  Or it could be something like bad electrical input or insufficient cooling (I have seen index and even table corruption issues from both of these).

If this is a server you run, the first things I would check are:
1.  Is there a good-quality UPS that the server is plugged into?  Are the batteries in good working order?
The servers are dual powered and hooked up to both supplied electricity, with a backup generator and if that fails, it will switch over to the UPS. All of these are supplied and maintained by the data centre that the servers are at. There have been no electrical problems so far. 
2.  Is the server somewhere that may be sitting in a pocket of hot air?
As you can imagine, the data centre has air-conditioning and floored fans blowing hot air up and out, keeping all servers cooled. Checking on the blade servers that the database VM sits on shows that the temperature is optimally sitting at 65 degrees celsius. So I don't think it's a temperature problem either. 

Once you have ruled these out, the next things to check are CPU, memory, and storage health.  Unfortunately checking these is harder but you can check SMART indications, and other diagnostic indicators.
Would this top stats be useful? The database does get busy especially at peak hours.

top - 10:04:25 up 782 days, 43 min,  2 users,  load average: 1.20, 1.14, 1.10
Tasks: 415 total,   3 running, 412 sleeping,   0 stopped,   0 zombie
Cpu(s): 22.7%us, 13.8%sy,  0.0%ni, 63.2%id,  0.2%wa,  0.0%hi,  0.1%si,  0.1%st
Mem:   8160156k total,  7832828k used,   327328k free,    34584k buffers
Swap:  1048572k total,   304396k used,   744176k free,  6674428k cached

We are running on a SAN group disk storage. I guess I should take a closer look at the disks. Thanks for the pointer.   

However, once these errors start happening, you are in danger territory and need to find out why (and correct the underlying problem) before you get data loss.
Thanks for the warning. We'll see what we can find. 

--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Chris Travers
Дата:


On Thu, Nov 30, 2023 at 9:03 AM Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn> wrote:
Hi Chris,

On Wed, Nov 29, 2023 at 7:38 PM Chris Travers <chris.travers@gmail.com> wrote:


On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn> wrote:
Hi all,

Knowing that it's a data corruption issue, the only way to fix this is to vacuum and reindex the database. What was suggested was the following:

SET zero_damaged_pages = 0; # This is so that we can have the application to continue to run
VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if possible.
REINDEX DATABASE "core"; # Then do a reindex and clean it up.

So first, to clear up some confusion on my part here:

This procedure doesn't make a lot of sense to me.  But did it clear up the issue?   
Yeah it did fix the issue before (same issue as last year) and it has fixed the problem that just happened a few days ago (almost exactly a year ago). 

In any of these cases, it is extremely important to diagnose the system properly.  If you have a fault in your storage device or RAID controller, for example, you are asking for more corruption and data loss later.

At first I thought maybe you mistyped something and then realized there were a few issues with the process so it actually didn't make sense.

First, zero_damaged_pages defaults to 0, and I can think of no reason to set  it explicitly.
Secondly, a vacuum full has to reindex, so there is no reason to do a reindex following.  Your whole procedure is limited to a vacuum full, when a reindex is the only part that could affect this.   If it did work, reindexing is the only part that would have been helpful.
Oh that makes sense actually. Thanks. 

So for a temporary workaround, it sounds like reindexing helps for now, but yeah this really needs deeper investigation. 

On to the question of what to do next....

We're on Postgresql 12. This has worked before it happened (almost exactly a year ago) and I think this needs a more permanent solution. I've looked at routine vacuuming and checked the autovacuum is set to on and the following configurations:

core=> select name, setting from pg_settings where name like 'autovacuum%';
                name                 |  setting
-------------------------------------+-----------
 autovacuum                          | on
 autovacuum_analyze_scale_factor     | 0.1
 autovacuum_analyze_threshold        | 50
 autovacuum_freeze_max_age           | 200000000
 autovacuum_max_workers              | 3
 autovacuum_multixact_freeze_max_age | 400000000
 autovacuum_naptime                  | 60
 autovacuum_vacuum_cost_delay        | 2
 autovacuum_vacuum_cost_limit        | -1
 autovacuum_vacuum_scale_factor      | 0.2
 autovacuum_vacuum_threshold         | 50
 autovacuum_work_mem                 | -1
(12 rows)

Can anyone advise if there's anything else we can do? We have no clue what causes the invalid page block and we are running a High Availability cluster set up but we are hoping that there may be a way to mitigate it.


You need to figure out why the corruption is happening.  This is most likely, in my experience, not a PostgreSQL bug, but usually something that happens on the hardware layer or an environmental factor.  It could be failin storage or CPU.  Or it could be something like bad electrical input or insufficient cooling (I have seen index and even table corruption issues from both of these).

If this is a server you run, the first things I would check are:
1.  Is there a good-quality UPS that the server is plugged into?  Are the batteries in good working order?
The servers are dual powered and hooked up to both supplied electricity, with a backup generator and if that fails, it will switch over to the UPS. All of these are supplied and maintained by the data centre that the servers are at. There have been no electrical problems so far. 
2.  Is the server somewhere that may be sitting in a pocket of hot air?
As you can imagine, the data centre has air-conditioning and floored fans blowing hot air up and out, keeping all servers cooled. Checking on the blade servers that the database VM sits on shows that the temperature is optimally sitting at 65 degrees celsius. So I don't think it's a temperature problem either. 

Ok so this is in a professionally run datacenter.  That does indeed eliminate at least some of these issues. 

Once you have ruled these out, the next things to check are CPU, memory, and storage health.  Unfortunately checking these is harder but you can check SMART indications, and other diagnostic indicators.
Would this top stats be useful? The database does get busy especially at peak hours.

top - 10:04:25 up 782 days, 43 min,  2 users,  load average: 1.20, 1.14, 1.10
Tasks: 415 total,   3 running, 412 sleeping,   0 stopped,   0 zombie
Cpu(s): 22.7%us, 13.8%sy,  0.0%ni, 63.2%id,  0.2%wa,  0.0%hi,  0.1%si,  0.1%st
Mem:   8160156k total,  7832828k used,   327328k free,    34584k buffers
Swap:  1048572k total,   304396k used,   744176k free,  6674428k cached

We are running on a SAN group disk storage. I guess I should take a closer look at the disks. Thanks for the pointer. 

anything in mcelog?  That would be a place to check for memory issues if you have ECC RAM (which I assume you do in a datacenter). 

As far as the SAN goes, another thing I would pay attention to would be whether there are any indication of fsync errors in the dmesg, or other logs.   In addition, of course, to a review of disk errors themselves.

 

However, once these errors start happening, you are in danger territory and need to find out why (and correct the underlying problem) before you get data loss.
Thanks for the warning. We'll see what we can find. 

 

--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.


--
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

От
Stephen Frost
Дата:
Greetings,

On Thu, Nov 30, 2023 at 02:51 Abdul Qoyyuum <aqoyyuum@cardaccess.com.bn> wrote:
On Wed, Nov 29, 2023 at 5:53 PM Stephen Frost <sfrost@snowman.net> wrote:
* Abdul Qoyyuum (aqoyyuum@cardaccess.com.bn) wrote:
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
>
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.

This is only going to help if the issue is in an index, which isn't
clear from what's been shared.

That is a good point. Problem is I can't really find out as the logs isn't that verbose to tell me more.

Look in pg_class for a relation with a relfileno matching the filename or for a relation with an oid matching it.

it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:

This isn't something that should ever happen ...

This also doesn't have anything to do with autovacuum, changing settings
there won't make any difference.

Noted but it has been a clean running since a year ago that I ran the vacuum and reindex commands. 

Great to hear but isn’t particularly relevant. Sure, a reindex could “fix” this, but it shouldn’t happen in the first place…

> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.

Was there some kind of hardware fault?  Did you do a failover?  Restore
from a backup?  Do you have checksums enabled?  How many times has this
happened before, and how many pages were impacted?  What is the design
of your HA solution, are you using PG replication or something else? 

There have been a few maintenance operations earlier this year but nothing too critical or anything failed that would have caused the database to go corrupt. The HA solution we're using is the pacemaker with the active passive setup.

This really isn’t enough information to assess if what you’re doing in your configuration is leading to corruption, or not. Note that a lot of corruption can happen without invalid pages showing up- you should seriously look into amcheck and using it to see if your indexes are valid.

What exactly is your “pacemaker with the active passive setup”?  What’s the storage under that?  What other maintenance operations have been done..?

Unsure if sharing the relevant WAL settings from postgresql.conf may be useful but here they are:

max_connections = 300
shared_buffers = 128MB

archive_mode = on
archive_command = 'test ! -f /opt/databases/postgres12/wal_archive/%f && cp %p /opt/databases/postgres12/wal_archive/%f'

cp is absolutely not valid as an archive command- it doesn’t ensure that the WAL has been fully sync’d to disk and therefore could lead you to a situation where you lose necessary WAL. I strongly encourage you to look into a proper backup solution.

Thanks,

Stephen