"PANIC: could not open critical system index 2662" - twice

Поиск
Список
Период
Сортировка
От Evgeny Morozov
Тема "PANIC: could not open critical system index 2662" - twice
Дата
Msg-id 01020187577238cf-da8c0f4a-3ab9-445a-8c74-31ef51439f30-000000@eu-west-1.amazonses.com
обсуждение исходный текст
Ответы Re: "PANIC: could not open critical system index 2662" - twice  (Laurenz Albe <laurenz.albe@cybertec.at>)
Список pgsql-general

Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this error:

2023-04-05 09:24:03.448 UTC [15227] ERROR:  index "pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 09:24:03.448 UTC [15227] HINT:  Please REINDEX it.
...
2023-04-05 13:05:25.018 UTC [15437] root@test_behavior_638162834106895162 FATAL:  index "pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 13:05:25.018 UTC [15437] root@test_behavior_638162834106895162 HINT:  Please REINDEX it.
... (same error for a few more DBs)
2023-04-05 13:05:25.144 UTC [16965] root@test_behavior_638162855458823077 FATAL:  index "pg_class_oid_index" contains unexpected zero page at block 0
2023-04-05 13:05:25.144 UTC [16965] root@test_behavior_638162855458823077 HINT:  Please REINDEX it.
...
2023-04-05 13:05:25.404 UTC [17309] root@test_behavior_638162881641031612 PANIC:  could not open critical system index 2662
2023-04-05 13:05:25.405 UTC [9372] LOG:  server process (PID 17309) was terminated by signal 6: Aborted
2023-04-05 13:05:25.405 UTC [9372] LOG:  terminating any other active server processes

We had the same thing happened about a month ago on a different database on the same cluster. For a while PG actually ran OK as long as you didn't access that specific DB, but when trying to back up that DB with pg_dump it would crash every time. At that time one of the disks hosting the ZFS dataset with the PG data directory on it was reporting errors, so we thought it was likely due to that.

Unfortunately, before we could replace the disks, PG crashed completely and would not start again at all, so I had to rebuild the cluster from scratch and restore from pg_dump backups (still onto the old, bad disks). Once the disks were replaced (all of them) I just copied the data to them using zfs send | zfs receive and didn't bother restoring pg_dump backups again - which was perhaps foolish in hindsight.

Well, yesterday it happened again. The server still restarted OK, so I took fresh pg_dump backups of the databases we care about (which ran fine), rebuilt the cluster and restored the pg_dump backups again - now onto the new disks, which are not reporting any problems.

So while everything is up and running now this error has me rather concerned. Could the error we're seeing now have been caused by some corruption in the PG data that's been there for a month (so it could still be attributed to the bad disk), which should now be fixed by having restored from backups onto good disks? Could this be a PG bug? What can I do to figure out why this is happening and prevent it from happening again? Advice appreciated!

В списке pgsql-general по дате отправления:

Предыдущее
От: Jehan-Guillaume de Rorthais
Дата:
Сообщение: Re: Patroni vs pgpool II
Следующее
От: Imre Samu
Дата:
Сообщение: PostgreSQL Mailing list public archives : search not working ...