BUG #16174: PostgreSQL 9.6 service can not start after unexpected shutdown

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #16174: PostgreSQL 9.6 service can not start after unexpected shutdown
Дата
Msg-id 16174-8bd292aa647c8f3c@postgresql.org
обсуждение исходный текст
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16174
Logged by:          Enrico La Torre
Email address:      pg.dba.iit.team@gmail.com
PostgreSQL version: 9.6.12
Operating system:   Windows Server 2016 Standard 10.0.14393
Description:

Hi,

in my company we run several thousands of PostgreSQL clusters currently with
version 9.6.12 (one cluster per server). We run PostgreSQL on a virtual
machine in Hyper-V Manager. 
A minor upgrade to the latest minor version is planned but I did not find
any fix for this incident in any minor upgrade from 9.6.12 to 9.6.16.

The same incident has been reported in
https://www.postgresql.org/message-id/CAJtpbKR2NZKmoxveMSdJh%3Dzng1%2B6NwqK3EEbwQs0ac6oHE%3D4eg%40mail.gmail.com
for PostgresSQL 11 but it has not been responded to yet. However, we
experience the same incident in with PostgreSQL 9.6.

We can only correlate this incident to a reboot after an unexpected shutdown
of Windows. During restart it can(!) happen that the PostgreSQL service can
not be started by the 
Service Control Manager because the PostgreSQL cluster is already running in
the background. A pg_ctl status will return a running database server. This
behavior must be related
to a race condition, because the incident does not occur every time or can
not be reproduced easily. 

The unexpected shutdowns originate from either power outages or more likely
due to restarts of the server that hosts the Hyper-V Manager. If this server
gets patched and nees to restart it basically kills all running VMs in that
process, if they don't shutdown fast enough. Of course you can argue that we
must get rid of these unexpected shutdowns. But isn't the boot
process always the same after a shutdown nor matter if it was shutdown
cleanly or not? Why is the PostgreSQL cluster started in the background and
not by the Service Control Manager?
Is there a bug in the interplay between pg_ctl and the Windows Service
Control Manager?

At the bottom I attached an excerpt of the relevant log files. 

Kind regards

PostgreSQL version number you are running: 
9.6.12

How you installed PostgreSQL:
We initally started with PostgreSQL 9.6.6 and the EnterpriseDB installer
postgresql-9.6.6-2-windows-x64.exe
The minor upgrade to 9.6.12 was an installation with EnterpriseDB installer
postgresql-9.6.12-1-windows-x64.exe and then simply a distribution of all
9.6.12 binary files to all other servers.

Changes made to the settings in the postgresql.conf file: 

                           version
--------------------------------------------------------------
 PostgreSQL 9.6.12, compiled by Visual C++ build 1800, 64-bit
 
             name             |                 current_setting
   |        source
------------------------------+-------------------------------------------------+----------------------
 application_name             | psql
   | client
 archive_command              | copy %p I:\Backup\PostgreSQL-WAL-Archive\%f
   | configuration file
 archive_mode                 | on
   | configuration file
 archive_timeout              | 1h
   | configuration file
 checkpoint_completion_target | 0.9
   | configuration file
 client_encoding              | WIN1252
   | client
 cluster_name                 | xxxxxxxxxxxx
   | configuration file
 DateStyle                    | ISO, DMY
   | configuration file
 default_text_search_config   | simple
   | configuration file
 dynamic_shared_memory_type   | windows
   | configuration file
 effective_cache_size         | 3GB
   | configuration file
 lc_messages                  | C
   | configuration file
 lc_monetary                  | C
   | configuration file
 lc_numeric                   | C
   | configuration file
 lc_time                      | C
   | configuration file
 listen_addresses             | *
   | configuration file
 log_checkpoints              | on
   | configuration file
 log_connections              | on
   | configuration file
 log_destination              | stderr
   | configuration file
 log_disconnections           | on
   | configuration file
 log_hostname                 | on
   | configuration file
 log_line_prefix              | %t [%p]: [%l]
%quser=%u,db=%d,app=%a,client=%h  | configuration file
 log_temp_files               | 100kB
   | configuration file
 log_timezone                 | America/Chicago
   | configuration file
 logging_collector            | on
   | configuration file
 maintenance_work_mem         | 256MB
   | configuration file
 max_connections              | 100
   | configuration file
 max_stack_depth              | 2MB
   | environment variable
 max_wal_senders              | 10
   | configuration file
 max_wal_size                 | 4GB
   | configuration file
 min_wal_size                 | 2GB
   | configuration file
 port                         | 5432
   | configuration file
 search_path                  | "$user", public
   | database
 shared_buffers               | 512MB
   | configuration file
 ssl                          | on
   | configuration file
 ssl_cert_file                | ssl-cert-snakeoil.pem
   | configuration file
 ssl_key_file                 | ssl-cert-snakeoil.key
   | configuration file
 TimeZone                     | America/Chicago
   | configuration file
 track_activity_query_size    | 8192
   | configuration file
 wal_level                    | replica
   | configuration file
 wal_sync_method              | fsync
   | configuration file
 work_mem                     | 61166kB
   | configuration file

Operating system and version:
Windows Server 2016 Standard 10.0.14393

What program you're using to connect to PostgreSQL:
psql, pgadmin4, ODBC driver PostgreSQL Unicode(x64) 9.06.04.00
 
Is there anything relevant or unusual in the PostgreSQL server logs?:

in pg_log

2019-12-19 03:35:06 CST [4756]: [1] LOG:  database system was interrupted;
last known up at 2019-12-19 03:21:56 CST
2019-12-19 03:35:06 CST [3924]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  connection received: host=446-091STL01.aldi-499.loc port=62037
2019-12-19 03:35:06 CST [3788]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  connection received: host=446-091STL01.aldi-499.loc port=62042
2019-12-19 03:35:06 CST [5844]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  connection received: host=446-091STL01.aldi-499.loc port=62044
2019-12-19 03:35:07 CST [3780]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  connection received: host=446-091STL01.aldi-499.loc port=62043
2019-12-19 03:35:07 CST [3924]: [2]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  could not accept SSL connection: An established connection was aborted
by the software in your host machine.
2019-12-19 03:35:07 CST [3788]: [2]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  could not accept SSL connection: An established connection was aborted
by the software in your host machine.
2019-12-19 03:35:07 CST [3780]: [2]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  could not accept SSL connection: An established connection was aborted
by the software in your host machine.
2019-12-19 03:35:08 CST [5844]: [2]
user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc
FATAL:  the database system is starting up
2019-12-19 03:35:09 CST [4144]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  connection received: host=446-091STL01.aldi-499.loc port=62046
2019-12-19 03:35:09 CST [4144]: [2]
user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc
FATAL:  the database system is starting up
2019-12-19 03:35:10 CST [4548]: [1]
user=[unknown],db=[unknown],app=[unknown],client=446-091STL01.aldi-499.loc
LOG:  connection received: host=446-091STL01.aldi-499.loc port=62048
[...] These messages continue until

2019-12-19 03:36:05 CST [5236]: [2]
user=SVCACC_446_PSQL,db=postgres,app=[unknown],client=446-091STL01.aldi-499.loc
FATAL:  the database system is starting up
2019-12-19 03:37:29 CST [4756]: [2] LOG:  database system was not properly
shut down; automatic recovery in progress
2019-12-19 03:37:29 CST [4756]: [3] LOG:  redo starts at 26/23141D00
2019-12-19 03:37:29 CST [4756]: [4] LOG:  invalid record length at
26/23151158: wanted 24, got 0
2019-12-19 03:37:29 CST [4756]: [5] LOG:  redo done at 26/23151120
2019-12-19 03:37:30 CST [4756]: [6] LOG:  last completed transaction was at
log time 2019-12-19 03:23:00.369744-06
2019-12-19 03:37:30 CST [4756]: [7] LOG:  checkpoint starting:
end-of-recovery immediate
2019-12-19 03:37:30 CST [4756]: [8] LOG:  checkpoint complete: wrote 10
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=0.006 s, sync=0.414 s, total=0.657 s; sync files=5, longest=0.098 s,
average=0.082 s; distance=61 kB, estimate=61 kB
2019-12-19 03:37:30 CST [4756]: [9] LOG:  MultiXact member wraparound
protections are now enabled
2019-12-19 03:37:30 CST [5380]: [3] LOG:  database system is ready to accept
connections

in Event Viewer / Application 

Level    Date and Time    Source    Event ID    Task Category
Information    12/19/2019 3:34:23 AM    PostgreSQL    0    None    Waiting for server
startup...
Information    12/19/2019 3:35:05 AM    PostgreSQL    0    None    "2019-12-19 03:35:05 CST
[5380]: [1] LOG:  redirecting log output to logging collector
process2019-12-19 03:35:05 CST [5380]: [2] HINT:  Future log output will
appear in directory ""pg_log""."
Error    12/19/2019 3:36:06 AM    PostgreSQL    0    None    Timed out waiting for server
startup
Information    12/19/2019 3:39:29 AM    PostgreSQL    0    None    Waiting for server
startup...
Error    12/19/2019 3:39:29 AM    PostgreSQL    0    None    "2019-12-19 03:39:29 CST
[5216]: [1] FATAL:  lock file ""postmaster.pid"" already exists2019-12-19
03:39:29 CST [5216]: [2] HINT:  Is another postmaster (PID 5380) running in
data directory ""G:/databases/PostgreSQL/9.6""?"
Error    12/19/2019 3:39:30 AM    PostgreSQL    0    None    Timed out waiting for server
startup

in Event Viewer / System

Level    Date and Time    Source    Event ID    Task Category
Information    12/19/2019 3:31:58
AM    Microsoft-Windows-Kernel-General    12    (1)    The operating system started at
system time ‎2019‎-‎12‎-‎19T09:31:58.494227200Z.
Information    12/19/2019 3:31:58 AM    Microsoft-Windows-Kernel-Boot    20    (31)    The
last shutdown's success status was false. The last boot's success status was
true.
Critical    12/19/2019 3:32:08 AM    Microsoft-Windows-Kernel-Power    41    (63)    The
system has rebooted without cleanly shutting down first. This error could be
caused if the system stopped responding, crashed, or lost power
unexpectedly.
Error    12/19/2019 3:39:30 AM    Service Control Manager    7001    None    "The
xxx-Server service depends on the postgresql - PostgreSQL Server 9.6 service
which failed to start because of the following error: The operation
completed successfully."
Information    12/19/2019 3:39:30 AM    Service Control Manager    7036    None    The
postgresql - PostgreSQL Server 9.6 service entered the stopped state.


What you were doing when the error happened / how to cause the error:
with PowerShell:

# stop the cluster 
pg_ctl stop -D G:\databases\PostgreSQL\9.6 -m fast
# start the cluster via Service Control Manager
Start-Service -Name postgresql


В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16173: PostgreSQL 9.6 service can not start after unexpected shutdown
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16175: To migrate MySQL To PostgreSQL