Обсуждение: how to check that recovery is complete

Поиск
Список
Период
Сортировка

how to check that recovery is complete

От
Dmitry O Litvintsev
Дата:
Hi,

I have a workflow where I recover from PITR backup and run a query on it. The program that runs query
checks that it can connect to database in a loop, until it can, and then runs the query.
This has worked fine far. Recently I upgraded to 11 and I see that I can connect to DB while recovery is
not complete yet. See this:

< 2020-11-05 03:34:36.114 CST  >LOG:  starting archive recovery
< 2020-11-05 03:34:36.590 CST  >LOG:  restored log file "00000001000002EF000000F9" from archive
< 2020-11-05 03:34:36.641 CST  >LOG:  redo starts at 2EF/F9000028
...
< 2020-11-05 03:34:46.392 CST  >LOG:  restored log file "00000001000002F000000008" from archive
< 2020-11-05 03:34:46.658 CST 127.0.0.1 >FATAL:  the database system is starting up
< 2020-11-05 03:34:47.028 CST  >LOG:  restored log file "00000001000002F000000009" from archive

You can see above fail to connect , but sometime into recover I see;

< 2020-11-05 04:07:51.987 CST  >LOG:  restored log file "00000001000002F200000029" from archive
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >ERROR:  canceling statement due to conflict with recovery
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.195 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:23.195 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.195 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:23.195 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.232 CST  >LOG:  background worker "parallel worker" (PID 13577) exited with exit code 1
< 2020-11-05 04:08:23.244 CST  >LOG:  background worker "parallel worker" (PID 13578) exited with exit code 1
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >FATAL:  terminating connection due to conflict with recovery
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand. 
< 2020-11-05 04:08:25.354 CST  >LOG:  restored log file "00000001000002F20000002A" from archive
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >ERROR:  canceling statement due to conflict with recovery
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.556 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:55.556 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.561 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:55.561 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.640 CST  >LOG:  background worker "parallel worker" (PID 13683) exited with exit code 1
< 2020-11-05 04:08:55.653 CST  >LOG:  background worker "parallel worker" (PID 13684) exited with exit code 1
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >FATAL:  terminating connection due to conflict with recovery
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand. 
< 2020-11-05 04:09:00.307 CST  >LOG:  restored log file "00000001000002F20000002B" from archive

As you can see a query "select count(*) from file" failed due to table not being restored yet. BUT connection was
allowed 
before DB was ready.... Only few hours after ;

< 2020-11-05 09:31:30.319 CST  >LOG:  archive recovery complete
< 2020-11-05 09:34:51.729 CST  >LOG:  database system is ready to accept connections

After which the query runs fine without errors. This is bad because I see that select count(*) takes progressively
longer to execute and the count(*) is not what I am interested in. I run the "real" query after that. As a result I add
hours to program execution time.

Is there a more robust method to it? Ideally I do not want to be able to connect to db until :

< 2020-11-05 09:31:30.319 CST  >LOG:  archive recovery complete
< 2020-11-05 09:34:51.729 CST  >LOG:  database system is ready to accept connections


And I believe this was the behavior before upgrade. If connection can't be disabled, how can I detect
condition "database system is ready to accept connections"

I believe "pg_isready" utility would succeed once it can connect. And as can see I could
connect way before DB is really ready.

Thanks!
Dmitry



Re: how to check that recovery is complete

От
Laurenz Albe
Дата:
On Thu, 2020-11-05 at 19:03 +0000, Dmitry O Litvintsev wrote:
> I have a workflow where I recover from PITR backup and run a query on it. The program that runs query 
> checks that it can connect to database in a loop, until it can, and then runs the query. 
> This has worked fine far. Recently I upgraded to 11 and I see that I can connect to DB while recovery is 
> not complete yet.

That's because "hot_standby = on" by default from v10 on.

You could change the configuration and set it to "off",
then PostgreSQL will continue to work as before.

The alternative way is running this after you connect:

  SELECT pg_is_in_recovery();

If that returns TRUE, recovery is not done yet.
Back out, wait a while, then try again.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com