Обсуждение: BUG #17954: Postgres startup fails with `could not locate a valid checkpoint record`

Поиск
Список
Период
Сортировка

BUG #17954: Postgres startup fails with `could not locate a valid checkpoint record`

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17954
Logged by:          Utkarsh Srivastava
Email address:      srivastavautkarsh8097@gmail.com
PostgreSQL version: 12.12
Operating system:   RHEL/Linux
Description:

Hi everyone,

Thank you for your time. We are running PostgreSQL 12.12 in a CRI-O
container on top of CephFS. A few days ago we noticed that DB startup was
failing with the following error:
```
2023-05-14 05:13:13.678 UTC [1] LOG:  received smart shutdown request
2023-05-14 05:13:36.692 UTC [1] LOG:  could not open file "postmaster.pid":
No such file or directory
2023-05-14 05:13:36.692 UTC [1] LOG:  performing immediate shutdown because
data directory lock file is invalid
2023-05-14 05:13:36.692 UTC [1] LOG:  received immediate shutdown request
2023-05-14 05:13:36.692 UTC [1] LOG:  could not open file "postmaster.pid":
No such file or directory
2023-05-14 05:13:36.692 UTC [261282] WARNING:  terminating connection
because of crash of another server process
2023-05-14 05:13:36.692 UTC [261282] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2023-05-14 05:13:36.692 UTC [261282] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
< --- Trimmed repetition ---> 
2023-05-14 05:13:36.739 UTC [1] LOG:  database system is shut down
2023-05-14 05:13:37.723 UTC [24] LOG:  database system was shut down at
2023-05-14 05:13:17 UTC
2023-05-14 05:13:37.723 UTC [24] LOG:  invalid resource manager ID 101 at
9/8BF289E8
2023-05-14 05:13:37.723 UTC [24] LOG:  invalid primary checkpoint record
2023-05-14 05:13:37.723 UTC [24] PANIC:  could not locate a valid checkpoint
record
2023-05-14 05:13:39.961 UTC [22] LOG:  startup process (PID 24) was
terminated by signal 6: Aborted
2023-05-14 05:13:39.961 UTC [22] LOG:  aborting startup due to startup
process failure
2023-05-14 05:13:40.117 UTC [22] LOG:  database system is shut down
2023-05-14 05:14:06.726 UTC [24] LOG:  database system was shut down at
2023-05-14 05:13:17 UTC
2023-05-14 05:14:06.726 UTC [24] LOG:  invalid resource manager ID 101 at
9/8BF289E8
```

- What could be the root cause of this issue? 
- Is this a known issue (I did search the archives but couldn't find it
though)? If yes, is this fixed in a PG 13, 14, 15?

Thank you


Re: BUG #17954: Postgres startup fails with `could not locate a valid checkpoint record`

От
Michael Paquier
Дата:
On Thu, Jun 01, 2023 at 01:11:20PM +0000, PG Bug reporting form wrote:
> - What could be the root cause of this issue?
> - Is this a known issue (I did search the archives but couldn't find it
> though)? If yes, is this fixed in a PG 13, 14, 15?

Hard to say for sure, but it looks like your host has a few problems.
This part from your logs refers to something that should not happen,
to begin with:

> 2023-05-14 05:13:13.678 UTC [1] LOG:  received smart shutdown request
> 2023-05-14 05:13:36.692 UTC [1] LOG:  could not open file "postmaster.pid":
> No such file or directory
> 2023-05-14 05:13:36.692 UTC [1] LOG:  performing immediate shutdown because
> data directory lock file is invalid
> 2023-05-14 05:13:36.692 UTC [1] LOG:  received immediate shutdown request
> 2023-05-14 05:13:36.692 UTC [1] LOG:  could not open file "postmaster.pid":
> No such file or directory

This LOG would come from either AddToDataDirLockFile() or
RecheckDataDirLockFile().  Still, the third entry I am quoting refers
to a recheck of the PID file, meaning that the postmaster has bumped
into what looks like a corrupted PID file.
--
Michael

Вложения