Обсуждение: psql: FATAL: the database system is starting up

Поиск
Список
Период
Сортировка

psql: FATAL: the database system is starting up

От
Tom K
Дата:

Hey Guy's,


I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking out the cluster.  On recovery, PostgreSQL tends to come up while other databases just blow up.  That is until now.

After the most recent crash 2-3 weeks ago, the cluster is now running into this message but I'm not able to make heads or tails out of why it's throwing this:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf
2019-05-23 08:26:34.561 EDT [10101] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:26:34.565 EDT [10101] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:26:34.591 EDT [10101] LOG:  redirecting log output to logging collector process
2019-05-23 08:26:34.591 EDT [10101] HINT:  Future log output will appear in directory "log".
^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100
2019-05-23 08:26:45.707 EDT [10105] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:26:45.715 EDT [10105] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:26:45.740 EDT [10105] LOG:  redirecting log output to logging collector process
2019-05-23 08:26:45.740 EDT [10105] HINT:  Future log output will appear in directory "log".
^C-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.1 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100
2019-05-23 08:29:36.014 EDT [10188] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:29:36.017 EDT [10188] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:29:36.040 EDT [10188] LOG:  redirecting log output to logging collector process
2019-05-23 08:29:36.040 EDT [10188] HINT:  Future log output will appear in directory "log".

The database logs only print out this:

2019-05-23 08:25:44.476 EDT [10088] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-23 08:25:44.477 EDT [10088] LOG:  entering standby mode
2019-05-23 08:25:50.209 EDT [10086] LOG:  received fast shutdown request
2019-05-23 08:25:50.215 EDT [10089] LOG:  shutting down
2019-05-23 08:25:50.221 EDT [10086] LOG:  database system is shut down
2019-05-23 08:26:11.560 EDT [10094] LOG:  database system was shut down in recovery at 2019-05-23 08:25:50 EDT
2019-05-23 08:26:11.560 EDT [10094] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-23 08:26:11.560 EDT [10094] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-23 08:26:11.562 EDT [10094] LOG:  entering standby mode
2019-05-23 08:26:17.817 EDT [10092] LOG:  received fast shutdown request
2019-05-23 08:26:17.824 EDT [10095] LOG:  shutting down
2019-05-23 08:26:17.831 EDT [10092] LOG:  database system is shut down
2019-05-23 08:26:34.596 EDT [10103] LOG:  database system was shut down in recovery at 2019-05-23 08:26:17 EDT
2019-05-23 08:26:34.596 EDT [10103] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-23 08:26:34.596 EDT [10103] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-23 08:26:34.598 EDT [10103] LOG:  entering standby mode
2019-05-23 08:26:43.025 EDT [10101] LOG:  received fast shutdown request
2019-05-23 08:26:43.032 EDT [10104] LOG:  shutting down
2019-05-23 08:26:43.038 EDT [10101] LOG:  database system is shut down
2019-05-23 08:26:45.748 EDT [10107] LOG:  database system was shut down in recovery at 2019-05-23 08:26:43 EDT
2019-05-23 08:26:45.748 EDT [10107] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-23 08:26:45.748 EDT [10107] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.

and when connecting, I get this cryptic message:

[root@psql01 log]# psql -U postgres -h 192.168.0.108
psql: FATAL:  the database system is starting up
[root@psql01 log]#

There's no space issue here.  The cluster stays in recovery. This state doesn't change and the DB is inaccessible.  

So I ran the above in debug mode ( -d 5 ) and got this:


DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory

so I freed memory on the underlying physical this VM sits on but no luck. Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes) I get the message:

[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root      2217  2188  0 00:38 pts/1    00:00:00 tail -f postgresql-Thu.log
postgres  2512     1  4 00:42 ?        00:00:01 /usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres  2533     1  0 00:42 ?        00:00:00 /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100
postgres  2535  2533  0 00:42 ?        00:00:00 postgres: postgres: logger process
postgres  2536  2533  0 00:42 ?        00:00:00 postgres: postgres: startup process   waiting for 000000010000000000000008
root      2664  2039  0 00:42 pts/0    00:00:00 grep --color=auto -Ei patroni|postgres
[root@psql01 ~]#

The system prints "postgres: postgres: startup process waiting for 000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace on the file reveals that it's looking for the following:

read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No such file or directory)
epoll_create1(EPOLL_CLOEXEC)            = 3

But this file doesn't exist in the directory on this particular cluster node (psql01).  I copy the missing history files off another cluster node to this one and it does move forward but just get's stuck on another history file.  But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02).  But this yields this error:

PANIC:  replication checkpoint has wrong magic 0 instead of 307747550

and I'm not able to find very much to move me forward on this one either.  

So again I give up after a few hours of trying to recover psql02 and move on to the third node.  On the third cluster node (pgsql03) I get this error which I'm also not able to find much info on:

FATAL:  syntax error in history file: f2W

Hoping I'm able to recover the data from at least one of the nodes, make it primary, and get it to replicate to the others.

Thx, TK

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 5/27/19 9:59 PM, Tom K wrote:
> Hey Guy's,
> 
> 
> I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 
> RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking 
> out the cluster.  On recovery, PostgreSQL tends to come up while other 
> databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming 
replication, correct?

Is there WAL archiving set up?


> 
> After the most recent crash 2-3 weeks ago, the cluster is now running 
> into this message but I'm not able to make heads or tails out of why 
> it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is 
that  more recent?

When you refer to history files below are you talking about WAL files or 
something else?

Is this:

"recovery command file "recovery.conf" specified neither 
primary_conninfo nor restore_command"

true?



> 
> |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT 
> [10101]LOG:listening onIPv4 address "192.168.0.108",port 
> 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket 
> "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log 
> output tologging collector process 2019-05-2308:26:34.591EDT 
> [10101]HINT:Future log output will appear indirectory 
> "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf --hot_standby=on 
> --listen_addresses=192.168.0.108 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 
> --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening 
> onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT 
> [10105]LOG:listening onUnix socket 
> "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log 
> output tologging collector process 2019-05-2308:26:45.740EDT 
> [10105]HINT:Future log output will appear indirectory 
> "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf --hot_standby=on 
> --listen_addresses=192.168.0.1 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 
> --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening 
> onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT 
> [10188]LOG:listening onUnix socket 
> "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log 
> output tologging collector process 2019-05-2308:29:36.040EDT 
> [10188]HINT:Future log output will appear indirectory "log".|
> 
> The database logs only print out this:
> 
> |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly 
> poll the pg_wal subdirectory tocheckforfiles placed 
> there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode 
> 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest 
> 2019-05-2308:25:50.215EDT [10089]LOG:shutting down 
> 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down 
> 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down 
> inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT 
> [10094]WARNING:recovery command file"recovery.conf"specified neither 
> primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT 
> [10094]HINT:The databaseserver will regularly poll the pg_wal 
> subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT 
> [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT 
> [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT 
> [10095]LOG:shutting down 2019-05-2308:26:17.831EDT 
> [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT 
> [10103]LOG:databasesystem was shut down inrecovery at 
> 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery 
> command file"recovery.conf"specified neither primary_conninfo nor 
> restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver 
> will regularly poll the pg_wal subdirectory tocheckforfiles placed 
> there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode 
> 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest 
> 2019-05-2308:26:43.032EDT [10104]LOG:shutting down 
> 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down 
> 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down 
> inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT 
> [10107]WARNING:recovery command file"recovery.conf"specified neither 
> primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT 
> [10107]HINT:The databaseserver will regularly poll the pg_wal 
> subdirectory tocheckforfiles placed there.|
> 
> and when connecting, I get this cryptic message:
> 
> |[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the 
> databasesystem isstarting up [root@psql01 log]#|
> 
> There's no space issue here.  The cluster stays in recovery. This state 
> doesn't change and the DB is inaccessible.
> 
> So I ran the above in debug mode ( -d 5 ) and got this:
> 
> 
> |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot 
> allocate memory|
> 
> so I freed memory on the underlying physical this VM sits on but no 
> luck. Same memory error and same result as above.
> 
> When patroni is started on just this one node (troubleshooting purposes) 
> I get the message:
> 
> |[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root 
> 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres 
> 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml 
> postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf --hot_standby=on 
> --listen_addresses=192.168.0.108 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 --max_connections=100postgres 
> 25352533000:42?00:00:00postgres:postgres:logger process postgres 
> 25362533000:42?00:00:00postgres:postgres:startup process waiting 
> for000000010000000000000008root 26642039000:42pts/000:00:00grep 
> --color=auto -Ei patroni|postgres[root@psql01 ~]#|
> 
> The system prints "postgres: postgres: startup process waiting for 
> 000000010000000000000008"
> 
> Seems it's trying to replay the history log but can't? Running an strace 
> on the file reveals that it's looking for the following:
> 
> |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily 
> unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily 
> unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such 
> fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
> 
> But this file doesn't exist in the directory on this particular cluster 
> node (psql01).  I copy the missing history files off another cluster 
> node to this one and it does move forward but just get's stuck on 
> another history file.  But same error message as above on psql01.
> 
> So I give up and try to recover the secondary node (psql02).  But this 
> yields this error:
> 
> |PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
> 
> and I'm not able to find very much to move me forward on this one either.
> 
> So again I give up after a few hours of trying to recover psql02 and 
> move on to the third node.  On the third cluster node (pgsql03) I get 
> this error which I'm also not able to find much info on:
> 
> |FATAL:syntax error inhistory file:f2W|
> 
> Hoping I'm able to recover the data from at least one of the nodes, make 
> it primary, and get it to replicate to the others.
> 
> Thx, TK
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/27/19 9:59 PM, Tom K wrote:
> Hey Guy's,
>
>
> I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
> RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking
> out the cluster.  On recovery, PostgreSQL tends to come up while other
> databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming
replication, correct?

Is there WAL archiving set up?

Correct.  Master election occurs through Patroni.  WAL level is set to:

wal_level = 'replica'

So no archiving.  
 


>
> After the most recent crash 2-3 weeks ago, the cluster is now running
> into this message but I'm not able to make heads or tails out of why
> it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that  more recent?

Haven't been able to bring this PostgresSQL cluster up ( run the cluster ) since 2-3 weeks ago.  Tried quite a few combinations of options to recover this.  No luck.  Had storage failures earlier, even with corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to come up each time without any recovery effort on my part.  
 

When you refer to history files below are you talking about WAL files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

True. recovery.conf is controlled by Patroni.  Contents of this file remained the same for all the cluster nodes with the exception of the primary_slot_name:

[root@psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql01 postgresql-patroni-etcd]#

[root@psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql02 postgres-backup]#

[root@psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location so when troubleshooting, I can always revert to the first state the cluster was in when it failed.  

Thx,
TK
 



>
> |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
> [10101]LOG:listening onIPv4 address "192.168.0.108",port
> 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
> output tologging collector process 2019-05-2308:26:34.591EDT
> [10101]HINT:Future log output will appear indirectory
> "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.108 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10
> --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
> onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
> [10105]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
> output tologging collector process 2019-05-2308:26:45.740EDT
> [10105]HINT:Future log output will appear indirectory
> "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.1 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10
> --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
> onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
> [10188]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
> output tologging collector process 2019-05-2308:29:36.040EDT
> [10188]HINT:Future log output will appear indirectory "log".|
>
> The database logs only print out this:
>
> |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
> poll the pg_wal subdirectory tocheckforfiles placed
> there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
> 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
> 2019-05-2308:25:50.215EDT [10089]LOG:shutting down
> 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
> 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
> inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
> [10094]WARNING:recovery command file"recovery.conf"specified neither
> primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
> [10094]HINT:The databaseserver will regularly poll the pg_wal
> subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
> [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
> [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
> [10095]LOG:shutting down 2019-05-2308:26:17.831EDT
> [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
> [10103]LOG:databasesystem was shut down inrecovery at
> 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
> command file"recovery.conf"specified neither primary_conninfo nor
> restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
> will regularly poll the pg_wal subdirectory tocheckforfiles placed
> there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
> 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
> 2019-05-2308:26:43.032EDT [10104]LOG:shutting down
> 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
> 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
> inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
> [10107]WARNING:recovery command file"recovery.conf"specified neither
> primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
> [10107]HINT:The databaseserver will regularly poll the pg_wal
> subdirectory tocheckforfiles placed there.|
>
> and when connecting, I get this cryptic message:
>
> |[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
> databasesystem isstarting up [root@psql01 log]#|
>
> There's no space issue here.  The cluster stays in recovery. This state
> doesn't change and the DB is inaccessible.
>
> So I ran the above in debug mode ( -d 5 ) and got this:
>
>
> |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
> allocate memory|
>
> so I freed memory on the underlying physical this VM sits on but no
> luck. Same memory error and same result as above.
>
> When patroni is started on just this one node (troubleshooting purposes)
> I get the message:
>
> |[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
> 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
> 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
> postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.108 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10 --max_connections=100postgres
> 25352533000:42?00:00:00postgres:postgres:logger process postgres
> 25362533000:42?00:00:00postgres:postgres:startup process waiting
> for000000010000000000000008root 26642039000:42pts/000:00:00grep
> --color=auto -Ei patroni|postgres[root@psql01 ~]#|
>
> The system prints "postgres: postgres: startup process waiting for
> 000000010000000000000008"
>
> Seems it's trying to replay the history log but can't? Running an strace
> on the file reveals that it's looking for the following:
>
> |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
> fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
>
> But this file doesn't exist in the directory on this particular cluster
> node (psql01).  I copy the missing history files off another cluster
> node to this one and it does move forward but just get's stuck on
> another history file.  But same error message as above on psql01.
>
> So I give up and try to recover the secondary node (psql02).  But this
> yields this error:
>
> |PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
>
> and I'm not able to find very much to move me forward on this one either.
>
> So again I give up after a few hours of trying to recover psql02 and
> move on to the third node.  On the third cluster node (pgsql03) I get
> this error which I'm also not able to find much info on:
>
> |FATAL:syntax error inhistory file:f2W|
>
> Hoping I'm able to recover the data from at least one of the nodes, make
> it primary, and get it to replicate to the others.
>
> Thx, TK
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 5/28/19 6:59 PM, Tom K wrote:
> 
> 
> On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 

> 
> Correct.  Master election occurs through Patroni.  WAL level is set to:
> 
> wal_level = 'replica'
> 
> So no archiving.
> 
> 
> 
>      >
>      > After the most recent crash 2-3 weeks ago, the cluster is now
>     running
>      > into this message but I'm not able to make heads or tails out of why
>      > it's throwing this:
> 
>     So you have not been able to run the cluster the past 2-3 weeks or is
>     that  more recent?
> 
> 
> Haven't been able to bring this PostgresSQL cluster up ( run the cluster 
> ) since 2-3 weeks ago.  Tried quite a few combinations of options to 
> recover this.  No luck.  Had storage failures earlier, even with 
> corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to 
> come up each time without any recovery effort on my part.
> 
> 
>     When you refer to history files below are you talking about WAL
>     files or
>     something else?
> 
>     Is this:
> 
>     "recovery command file "recovery.conf" specified neither
>     primary_conninfo nor restore_command"
> 
>     true?
> 
> 
> True. recovery.conf is controlled by Patroni.  Contents of this file 
> remained the same for all the cluster nodes with the exception of the 
> primary_slot_name:
> 
> [root@psql01 postgresql-patroni-etcd]# cat recovery.conf
> primary_slot_name = 'postgresql0'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql01 postgresql-patroni-etcd]#
> 
> [root@psql02 postgres-backup]# cat recovery.conf
> primary_slot_name = 'postgresql1'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql02 postgres-backup]#
> 
> [root@psql03 postgresql-patroni-backup]# cat recovery.conf
> primary_slot_name = 'postgresql2'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql03 postgresql-patroni-backup]#
> 
> I've made a copy of the root postgres directory over to another location 
> so when troubleshooting, I can always revert to the first state the 
> cluster was in when it failed.

I have no experience with Patroni so I will be of no help there. You 
might get more useful information from:

https://github.com/zalando/patroni
Community

There are two places to connect with the Patroni community: on github, 
via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If 
you're using Patroni, or just interested, please join us.

That being said, can you start the copied Postgres instance without 
using the Patroni instrumentation?


> 
> Thx,
> TK
> 



-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:



On Wed, May 29, 2019 at 10:28 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/28/19 6:59 PM, Tom K wrote:
>
>
> On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>

>
> Correct.  Master election occurs through Patroni.  WAL level is set to:
>
> wal_level = 'replica'
>
> So no archiving.
>
>
>
>      >
>      > After the most recent crash 2-3 weeks ago, the cluster is now
>     running
>      > into this message but I'm not able to make heads or tails out of why
>      > it's throwing this:
>
>     So you have not been able to run the cluster the past 2-3 weeks or is
>     that  more recent?
>
>
> Haven't been able to bring this PostgresSQL cluster up ( run the cluster
> ) since 2-3 weeks ago.  Tried quite a few combinations of options to
> recover this.  No luck.  Had storage failures earlier, even with
> corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
> come up each time without any recovery effort on my part.
>
>
>     When you refer to history files below are you talking about WAL
>     files or
>     something else?
>
>     Is this:
>
>     "recovery command file "recovery.conf" specified neither
>     primary_conninfo nor restore_command"
>
>     true?
>
>
> True. recovery.conf is controlled by Patroni.  Contents of this file
> remained the same for all the cluster nodes with the exception of the
> primary_slot_name:
>
> [root@psql01 postgresql-patroni-etcd]# cat recovery.conf
> primary_slot_name = 'postgresql0'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql01 postgresql-patroni-etcd]#
>
> [root@psql02 postgres-backup]# cat recovery.conf
> primary_slot_name = 'postgresql1'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql02 postgres-backup]#
>
> [root@psql03 postgresql-patroni-backup]# cat recovery.conf
> primary_slot_name = 'postgresql2'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql03 postgresql-patroni-backup]#
>
> I've made a copy of the root postgres directory over to another location
> so when troubleshooting, I can always revert to the first state the
> cluster was in when it failed.

I have no experience with Patroni so I will be of no help there. You
might get more useful information from:

https://github.com/zalando/patroni
Community

There are two places to connect with the Patroni community: on github,
via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
you're using Patroni, or just interested, please join us.

Will post there as well.  Thank you.  My thinking was to post here first since I suspect the Patroni community will simply refer me back here given that the PostgreSQL errors are originating directly from PostgreSQL.
 

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Yes, that is something I have been trying to do actually.  But I hit a dead end with the three errors above.  

So what I did is to copy a single node's backed up copy of the data files to /data/patroni of the same node ( this is the psql data directory as defined through patroni ) of the same node then ran this ( psql03 = 192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

This resulted in one of the 3 messages above.  Hence the post here.  If I can start a single instance, I should be fine since I could then 1) replicate over to the other two or 2) simply take a dump, reinitialize all the databases then restore the dump.  

Using the above procedure I get one of three error messages when using the data files of each node:

[ PSQL01 ]
postgrespostgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W 

And I can't start any one of them.   



>
> Thx,
> TK
>



--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 5/31/19 7:53 PM, Tom K wrote:
> 

>     There are two places to connect with the Patroni community: on github,
>     via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
>     you're using Patroni, or just interested, please join us.
> 
> 
> Will post there as well.  Thank you.  My thinking was to post here first 
> since I suspect the Patroni community will simply refer me back here 
> given that the PostgreSQL errors are originating directly from PostgreSQL.
> 
> 
>     That being said, can you start the copied Postgres instance without
>     using the Patroni instrumentation?
> 
> 
> Yes, that is something I have been trying to do actually.  But I hit a 
> dead end with the three errors above.
> 
> So what I did is to copy a single node's backed up copy of the data 
> files to */data/patroni* of the same node ( this is the psql data 
> directory as defined through patroni ) of the same node then ran this ( 
> psql03 = 192.168.0.118 ):
> 
> # sudo su - postgres
> $ /usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf 
> --listen_addresses=192.168.0.118 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 --max_connections=100 --hot_standby=on 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?
That should be covered in postgresql.conf, no?

> 
> This resulted in one of the 3 messages above.  Hence the post here.  If 
> I can start a single instance, I should be fine since I could then 1) 
> replicate over to the other two or 2) simply take a dump, reinitialize 
> all the databases then restore the dump.
> 

What if you move the recovery.conf file out?

The below looks like missing/corrupted/incorrect files. Hard to tell 
without knowing what Patroni did?

> Using the above procedure I get one of three error messages when using 
> the data files of each node:
> 
> [ PSQL01 ]
> postgres: postgres: startup process waiting for 000000010000000000000008
> 
> [ PSQL02 ]
> PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550
> 
> [ PSQL03 }
> FATAL:syntax error inhistory file:f2W
> 
> And I can't start any one of them.
> 
> 
> 
>      >
>      > Thx,
>      > TK
>      >
> 
> 
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/31/19 7:53 PM, Tom K wrote:
>

>     There are two places to connect with the Patroni community: on github,
>     via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
>     you're using Patroni, or just interested, please join us.
>
>
> Will post there as well.  Thank you.  My thinking was to post here first
> since I suspect the Patroni community will simply refer me back here
> given that the PostgreSQL errors are originating directly from PostgreSQL.
>
>
>     That being said, can you start the copied Postgres instance without
>     using the Patroni instrumentation?
>
>
> Yes, that is something I have been trying to do actually.  But I hit a
> dead end with the three errors above.
>
> So what I did is to copy a single node's backed up copy of the data
> files to */data/patroni* of the same node ( this is the psql data
> directory as defined through patroni ) of the same node then ran this (
> psql03 = 192.168.0.118 ):
>
> # sudo su - postgres
> $ /usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf
> --listen_addresses=192.168.0.118 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10 --max_connections=100 --hot_standby=on
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?
That should be covered in postgresql.conf, no?

>
> This resulted in one of the 3 messages above.  Hence the post here.  If
> I can start a single instance, I should be fine since I could then 1)
> replicate over to the other two or 2) simply take a dump, reinitialize
> all the databases then restore the dump.
>

What if you move the recovery.conf file out?

Will try.



The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Storage disappeared from underneath these clusters.  The OS was of course still in memory making futile attempts to write to disk, which would never complete.

My best guess is that Patroni or postgress was in the middle of some writes across the clusters when the failure occurred.  



> Using the above procedure I get one of three error messages when using
> the data files of each node:
>
> [ PSQL01 ]
> postgres: postgres: startup process waiting for 000000010000000000000008
>
> [ PSQL02 ]
> PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550
>
> [ PSQL03 }
> FATAL:syntax error inhistory file:f2W
>
> And I can't start any one of them.
>
>
>
>      >
>      > Thx,
>      > TK
>      >
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 3:32 PM Tom K <tomkcpr@gmail.com> wrote:


On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/31/19 7:53 PM, Tom K wrote:
>

>     There are two places to connect with the Patroni community: on github,
>     via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
>     you're using Patroni, or just interested, please join us.
>
>
> Will post there as well.  Thank you.  My thinking was to post here first
> since I suspect the Patroni community will simply refer me back here
> given that the PostgreSQL errors are originating directly from PostgreSQL.
>
>
>     That being said, can you start the copied Postgres instance without
>     using the Patroni instrumentation?
>
>
> Yes, that is something I have been trying to do actually.  But I hit a
> dead end with the three errors above.
>
> So what I did is to copy a single node's backed up copy of the data
> files to */data/patroni* of the same node ( this is the psql data
> directory as defined through patroni ) of the same node then ran this (
> psql03 = 192.168.0.118 ):
>
> # sudo su - postgres
> $ /usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf
> --listen_addresses=192.168.0.118 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10 --max_connections=100 --hot_standby=on
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?
That should be covered in postgresql.conf, no?

>
> This resulted in one of the 3 messages above.  Hence the post here.  If
> I can start a single instance, I should be fine since I could then 1)
> replicate over to the other two or 2) simply take a dump, reinitialize
> all the databases then restore the dump.
>

What if you move the recovery.conf file out?

Will try.



The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Storage disappeared from underneath these clusters.  The OS was of course still in memory making futile attempts to write to disk, which would never complete.

My best guess is that Patroni or postgress was in the middle of some writes across the clusters when the failure occurred.  

Of note are the characters f2W below.  I see nothing in the postgres source code to indicate this is any recognizable postgres message.  A part of me suspects that the postgres binaries got corrupted.   Had this case occur with glib-common and a reinstall fixed it.  However the postgres binaries csum matches a standalone install perfectly so that should not be an issue.  




> Using the above procedure I get one of three error messages when using
> the data files of each node:
>
> [ PSQL01 ]
> postgres: postgres: startup process waiting for 000000010000000000000008
>
> [ PSQL02 ]
> PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550
>
> [ PSQL03 }
> FATAL:syntax error inhistory file:f2W
>
> And I can't start any one of them.
>
>
>
>      >
>      > Thx,
>      > TK
>      >
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 12:32 PM, Tom K wrote:
> 
> 

> 
>     What if you move the recovery.conf file out?
> 
> 
> Will try.
> 
> 
> 
>     The below looks like missing/corrupted/incorrect files. Hard to tell
>     without knowing what Patroni did?
> 
> 
> Storage disappeared from underneath these clusters.  The OS was of 
> course still in memory making futile attempts to write to disk, which 
> would never complete.
> 
> My best guess is that Patroni or postgress was in the middle of some 
> writes across the clusters when the failure occurred.

So to be clear all three clusters where writing to the same storage 
medium and there was no WAL archiving to some other storage?

> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 12:42 PM, Tom K wrote:
> 
> 

> 
> Of note are the characters f2W below.  I see nothing in the postgres 
> source code to indicate this is any recognizable postgres message.  A 
> part of me suspects that the postgres binaries got corrupted.   Had this 
> case occur with glib-common and a reinstall fixed it.  However the 
> postgres binaries csum matches a standalone install perfectly so that 
> should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 4:11 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 12:32 PM, Tom K wrote:
>
>

>
>     What if you move the recovery.conf file out?
>
>
> Will try.
>
>
>
>     The below looks like missing/corrupted/incorrect files. Hard to tell
>     without knowing what Patroni did?
>
>
> Storage disappeared from underneath these clusters.  The OS was of
> course still in memory making futile attempts to write to disk, which
> would never complete.
>
> My best guess is that Patroni or postgress was in the middle of some
> writes across the clusters when the failure occurred.

So to be clear all three clusters where writing to the same storage
medium and there was no WAL archiving to some other storage?

Yep, cheap LAB hardware with no power redundancy ( yet ) .  



>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 2:08 PM, Tom K wrote:
> 

> 
> 
> Yep, cheap LAB hardware with no power redundancy ( yet ) .

I don't suppose a pg_dump was done anytime recently?


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 12:42 PM, Tom K wrote:
>
>

>
> Of note are the characters f2W below.  I see nothing in the postgres
> source code to indicate this is any recognizable postgres message.  A
> part of me suspects that the postgres binaries got corrupted.   Had this
> case occur with glib-common and a reinstall fixed it.  However the
> postgres binaries csum matches a standalone install perfectly so that
> should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.

Nope.  Here's the full set of lines in the postgres logs when running the above line:

2019-06-01 17:13:03.261 EDT [14909] LOG:  database system was shut down at 2019-05-22 01:55:13 EDT
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  primary_slot_name = 'postgresql2'
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  standby_mode = 'on'
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  recovery_target_timeline = latest
2019-06-01 17:13:03.261 EDT [14909] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 17:13:03.261 EDT [14909] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 17:13:03.263 EDT [14909] LOG:  entering standby mode
2019-06-01 17:13:03.263 EDT [14909] FATAL:  syntax error in history file: f2W
2019-06-01 17:13:03.263 EDT [14909] HINT:  Expected a numeric timeline ID.
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 1 before_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 4 on_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  exit(1)
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:13:03.265 EDT [14907] DEBUG:  reaping dead processes
2019-06-01 17:13:03.265 EDT [14907] LOG:  startup process (PID 14909) exited with exit code 1
2019-06-01 17:13:03.265 EDT [14907] LOG:  aborting startup due to startup process failure
2019-06-01 17:13:03.265 EDT [14907] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG:  cleaning up dynamic shared memory control segment with ID 1669075945
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 17:13:03.268 EDT [14907] LOG:  database system is shut down
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  exit(1)
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  logger shutting down
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  exit(0)
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(-1): 0 callbacks to make
^C
-bash-4.2$

What's interesting is that f2W isn't a string you'd expect to be printed as part of the code logic   ( I mean, what is f2W? ).  

postgresql10-server-10.8-1PGDG.rhel7.x86_64
postgresql10-libs-10.8-1PGDG.rhel7.x86_64
postgresql10-contrib-10.8-1PGDG.rhel7.x86_64
postgresql10-10.8-1PGDG.rhel7.x86_64

Without the recovery.conf file it actually started up.  Going to try to get a dump then continue to try and recover in place.

2019-06-01 17:17:29.777 EDT [14936] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.777 EDT [14936] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.778 EDT [14937] DEBUG:  received inquiry for database 0
2019-06-01 17:17:29.778 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:17:29.779 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.790 EDT [15013] DEBUG:  InitPostgres
2019-06-01 17:17:29.790 EDT [15013] DEBUG:  my backend ID is 3
2019-06-01 17:17:29.791 EDT [15013] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG:  autovacuum: processing database "postgres"
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  received inquiry for database 13806
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_13806.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.811 EDT [15013] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.811 EDT [15013] DEBUG:  pg_statistic: vac: 0 (threshold 129), anl: 0 (threshold 89)
2019-06-01 17:17:29.811 EDT [15013] DEBUG:  pg_type: vac: 0 (threshold 125), anl: 0 (threshold 88)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_authid: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_attribute: vac: 0 (threshold 566), anl: 0 (threshold 308)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_proc: vac: 0 (threshold 629), anl: 0 (threshold 339)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_class: vac: 0 (threshold 118), anl: 0 (threshold 84)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_index: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_operator: vac: 0 (threshold 207), anl: 0 (threshold 129)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_opclass: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_am: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_amop: vac: 0 (threshold 192), anl: 0 (threshold 121)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_amproc: vac: 0 (threshold 132), anl: 0 (threshold 91)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_aggregate: vac: 0 (threshold 78), anl: 0 (threshold 64)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_rewrite: vac: 0 (threshold 74), anl: 0 (threshold 62)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_cast: vac: 0 (threshold 94), anl: 0 (threshold 72)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_namespace: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_database: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_tablespace: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_toast_2618: vac: 0 (threshold 95), anl: 0 (threshold 73)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(0): 1 before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  proc_exit(0): 2 callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  exit(0)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:17:29.814 EDT [14930] DEBUG:  reaping dead processes
2019-06-01 17:17:29.814 EDT [14930] DEBUG:  server process (PID 15013) exited with exit code 0
2019-06-01 17:18:29.785 EDT [14936] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14936] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14937] DEBUG:  received inquiry for database 0
2019-06-01 17:18:29.785 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:18:29.786 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.797 EDT [15020] DEBUG:  InitPostgres
2019-06-01 17:18:29.797 EDT [15020] DEBUG:  my backend ID is 3
2019-06-01 17:18:29.798 EDT [15020] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG:  autovacuum: processing database "postgres"
2019-06-01 17:18:29.800 EDT [14937] DEBUG:  received inquiry for database 13806
2019-06-01 17:18:29.800 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:18:29.800 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_13806.stat"
2019-06-01 17:18:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_statistic: vac: 0 (threshold 129), anl: 0 (threshold 89)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_type: vac: 0 (threshold 125), anl: 0 (threshold 88)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_authid: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_attribute: vac: 0 (threshold 566), anl: 0 (threshold 308)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_proc: vac: 0 (threshold 629), anl: 0 (threshold 339)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_class: vac: 0 (threshold 118), anl: 0 (threshold 84)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_index: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_operator: vac: 0 (threshold 207), anl: 0 (threshold 129)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_opclass: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_am: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_amop: vac: 0 (threshold 192), anl: 0 (threshold 121)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_amproc: vac: 0 (threshold 132), anl: 0 (threshold 91)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_aggregate: vac: 0 (threshold 78), anl: 0 (threshold 64)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_rewrite: vac: 0 (threshold 74), anl: 0 (threshold 62)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_cast: vac: 0 (threshold 94), anl: 0 (threshold 72)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_namespace: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_database: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_tablespace: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_toast_2618: vac: 0 (threshold 95), anl: 0 (threshold 73)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(0): 1 before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  proc_exit(0): 2 callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  exit(0)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:18:29.814 EDT [14930] DEBUG:  reaping dead processes
2019-06-01 17:18:29.814 EDT [14930] DEBUG:  server process (PID 15020) exited with exit code 0
^C
[root@psql03 log]#

The point of the POC and the LAB is to test these things across failures as well as various configurations.  To that end, I'm just as curious how to recover from these error conditions as I am just getting things to work.  


 


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 5:30 PM Tom K <tomkcpr@gmail.com> wrote:


On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 12:42 PM, Tom K wrote:
>
>

>
> Of note are the characters f2W below.  I see nothing in the postgres
> source code to indicate this is any recognizable postgres message.  A
> part of me suspects that the postgres binaries got corrupted.   Had this
> case occur with glib-common and a reinstall fixed it.  However the
> postgres binaries csum matches a standalone install perfectly so that
> should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.

Nope.  Here's the full set of lines in the postgres logs when running the above line:

2019-06-01 17:13:03.261 EDT [14909] LOG:  database system was shut down at 2019-05-22 01:55:13 EDT
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  primary_slot_name = 'postgresql2'
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  standby_mode = 'on'
2019-06-01 17:13:03.261 EDT [14909] DEBUG:  recovery_target_timeline = latest
2019-06-01 17:13:03.261 EDT [14909] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 17:13:03.261 EDT [14909] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 17:13:03.263 EDT [14909] LOG:  entering standby mode
2019-06-01 17:13:03.263 EDT [14909] FATAL:  syntax error in history file: f2W
2019-06-01 17:13:03.263 EDT [14909] HINT:  Expected a numeric timeline ID.
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 1 before_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  shmem_exit(1): 4 on_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG:  exit(1)
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:13:03.265 EDT [14907] DEBUG:  reaping dead processes
2019-06-01 17:13:03.265 EDT [14907] LOG:  startup process (PID 14909) exited with exit code 1
2019-06-01 17:13:03.265 EDT [14907] LOG:  aborting startup due to startup process failure
2019-06-01 17:13:03.265 EDT [14907] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG:  cleaning up dynamic shared memory control segment with ID 1669075945
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 17:13:03.268 EDT [14907] LOG:  database system is shut down
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  exit(1)
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  logger shutting down
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  exit(0)
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG:  proc_exit(-1): 0 callbacks to make
^C
-bash-4.2$

What's interesting is that f2W isn't a string you'd expect to be printed as part of the code logic   ( I mean, what is f2W? ).  

postgresql10-server-10.8-1PGDG.rhel7.x86_64
postgresql10-libs-10.8-1PGDG.rhel7.x86_64
postgresql10-contrib-10.8-1PGDG.rhel7.x86_64
postgresql10-10.8-1PGDG.rhel7.x86_64

Without the recovery.conf file it actually started up.  Going to try to get a dump then continue to try and recover in place.

2019-06-01 17:17:29.777 EDT [14936] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.777 EDT [14936] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.778 EDT [14937] DEBUG:  received inquiry for database 0
2019-06-01 17:17:29.778 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:17:29.779 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.790 EDT [15013] DEBUG:  InitPostgres
2019-06-01 17:17:29.790 EDT [15013] DEBUG:  my backend ID is 3
2019-06-01 17:17:29.791 EDT [15013] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG:  autovacuum: processing database "postgres"
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  received inquiry for database 13806
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_13806.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.811 EDT [15013] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.811 EDT [15013] DEBUG:  pg_statistic: vac: 0 (threshold 129), anl: 0 (threshold 89)
2019-06-01 17:17:29.811 EDT [15013] DEBUG:  pg_type: vac: 0 (threshold 125), anl: 0 (threshold 88)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_authid: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_attribute: vac: 0 (threshold 566), anl: 0 (threshold 308)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_proc: vac: 0 (threshold 629), anl: 0 (threshold 339)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_class: vac: 0 (threshold 118), anl: 0 (threshold 84)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_index: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_operator: vac: 0 (threshold 207), anl: 0 (threshold 129)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_opclass: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_am: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_amop: vac: 0 (threshold 192), anl: 0 (threshold 121)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_amproc: vac: 0 (threshold 132), anl: 0 (threshold 91)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_aggregate: vac: 0 (threshold 78), anl: 0 (threshold 64)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_rewrite: vac: 0 (threshold 74), anl: 0 (threshold 62)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_cast: vac: 0 (threshold 94), anl: 0 (threshold 72)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_namespace: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_database: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_tablespace: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  pg_toast_2618: vac: 0 (threshold 95), anl: 0 (threshold 73)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(0): 1 before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  proc_exit(0): 2 callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  exit(0)
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:17:29.814 EDT [14930] DEBUG:  reaping dead processes
2019-06-01 17:17:29.814 EDT [14930] DEBUG:  server process (PID 15013) exited with exit code 0
2019-06-01 17:18:29.785 EDT [14936] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14936] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14937] DEBUG:  received inquiry for database 0
2019-06-01 17:18:29.785 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:18:29.786 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.797 EDT [15020] DEBUG:  InitPostgres
2019-06-01 17:18:29.797 EDT [15020] DEBUG:  my backend ID is 3
2019-06-01 17:18:29.798 EDT [15020] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG:  autovacuum: processing database "postgres"
2019-06-01 17:18:29.800 EDT [14937] DEBUG:  received inquiry for database 13806
2019-06-01 17:18:29.800 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/global.stat"
2019-06-01 17:18:29.800 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_13806.stat"
2019-06-01 17:18:29.801 EDT [14937] DEBUG:  writing stats file "pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_statistic: vac: 0 (threshold 129), anl: 0 (threshold 89)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_type: vac: 0 (threshold 125), anl: 0 (threshold 88)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_authid: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_attribute: vac: 0 (threshold 566), anl: 0 (threshold 308)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_proc: vac: 0 (threshold 629), anl: 0 (threshold 339)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_class: vac: 0 (threshold 118), anl: 0 (threshold 84)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_index: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_operator: vac: 0 (threshold 207), anl: 0 (threshold 129)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_opclass: vac: 0 (threshold 77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_am: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_amop: vac: 0 (threshold 192), anl: 0 (threshold 121)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_amproc: vac: 0 (threshold 132), anl: 0 (threshold 91)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_aggregate: vac: 0 (threshold 78), anl: 0 (threshold 64)
2019-06-01 17:18:29.811 EDT [15020] DEBUG:  pg_rewrite: vac: 0 (threshold 74), anl: 0 (threshold 62)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_cast: vac: 0 (threshold 94), anl: 0 (threshold 72)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_namespace: vac: 0 (threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_database: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_tablespace: vac: 0 (threshold 50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  pg_toast_2618: vac: 0 (threshold 95), anl: 0 (threshold 73)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(0): 1 before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(0): 7 on_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  proc_exit(0): 2 callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  exit(0)
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 17:18:29.814 EDT [14930] DEBUG:  reaping dead processes
2019-06-01 17:18:29.814 EDT [14930] DEBUG:  server process (PID 15020) exited with exit code 0
^C
[root@psql03 log]#

The point of the POC and the LAB is to test these things across failures as well as various configurations.  To that end, I'm just as curious how to recover from these error conditions as I am just getting things to work.  



Spoke too soon.  There's no tables when it's started without the recovery.conf file.  
 
 
 


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 2:30 PM, Tom K wrote:
> 
> 
> On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 6/1/19 12:42 PM, Tom K wrote:
>      >
>      >
> 
>      >
>      > Of note are the characters f2W below.  I see nothing in the postgres
>      > source code to indicate this is any recognizable postgres
>     message.  A
>      > part of me suspects that the postgres binaries got corrupted.  
>     Had this
>      > case occur with glib-common and a reinstall fixed it.  However the
>      > postgres binaries csum matches a standalone install perfectly so
>     that
>      > should not be an issue.
> 
>     It comes from timeline.c:
> 
>     https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html
> 
>     pg_log_error("syntax error in history file: %s", fline);
> 
>     ...
> 
>     There should be another error message after the above.
> 
> 
> Nope.  Here's the full set of lines in the postgres logs when running 
> the above line:
> 

> 2019-06-01 17:13:03.263 EDT [14909] FATAL:  syntax error in history 
> file: f2W
> 2019-06-01 17:13:03.263 EDT [14909] HINT:  Expected a numeric timeline ID.

Actually the above HINT is what I was looking for.

> ^C
> -bash-4.2$
> 
> What's interesting is that f2W isn't a string you'd expect to be printed 
> as part of the code logic   ( I mean, what is f2W? ).

As the HINT said neither was Postgres. That is probably down to file 
corruption.


> 
> The point of the POC and the LAB is to test these things across failures 
> as well as various configurations.  To that end, I'm just as curious how 
> to recover from these error conditions as I am just getting things to work.

I think what it proved was that a single point of failure is not good 
and that there needs to be steps taken to prevent or deal with it e.g. 
second location backup of some sort.

> 
> 
> 
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 2:31 PM, Tom K wrote:
> 
> 

> 
> Spoke too soon.  There's no tables when it's started without the 
> recovery.conf file.

Where there any errors in the start up?

Are there databases in the clusters system(template1, postgres, etc) or 
user?

Did you start against the correct PG_DATA directory?





-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 2:31 PM, Tom K wrote:
>
>

>
> Spoke too soon.  There's no tables when it's started without the
> recovery.conf file.

Where there any errors in the start up?

Nothing I would discern as a startup error:

[root@psql03 patroni]# sudo su - postgres
Last login: Sat Jun  1 17:30:31 EDT 2019 on pts/1
-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  -----------------------------------------
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      XDG_SESSION_ID=219
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HOSTNAME=psql03.nix.mds.xyz
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      SHELL=/bin/bash
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      TERM=xterm
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HISTSIZE=1000
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      USER=postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      MAIL=/var/spool/mail/postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      PWD=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      LANG=en_US.UTF-8
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HISTCONTROL=ignoredups
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      SHLVL=1
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HOME=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      LOGNAME=postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      PGDATA=/var/lib/pgsql/10/data
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_MONETARY=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_NUMERIC=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_TIME=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  -----------------------------------------
2019-06-01 17:41:27.494 EDT [15209] DEBUG:  registering background worker "logical replication launcher"
2019-06-01 17:41:27.495 EDT [15209] LOG:  listening on IPv4 address "192.168.0.118", port 5432
2019-06-01 17:41:27.498 EDT [15209] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-06-01 17:41:27.499 EDT [15209] DEBUG:  invoking IpcMemoryCreate(size=148545536)
2019-06-01 17:41:27.499 EDT [15209] DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 17:41:27.514 EDT [15209] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-01 17:41:27.514 EDT [15209] DEBUG:  removing file "pg_notify/0000"
2019-06-01 17:41:27.515 EDT [15209] DEBUG:  dynamic shared memory system will support 288 segments
2019-06-01 17:41:27.515 EDT [15209] DEBUG:  created dynamic shared memory control segment 834455595 (6928 bytes)
2019-06-01 17:41:27.518 EDT [15209] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-06-01 17:41:27.519 EDT [15209] LOG:  redirecting log output to logging collector process
2019-06-01 17:41:27.519 EDT [15209] HINT:  Future log output will appear in directory "log".
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  logger shutting down
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
-bash-4.2$ 2019-06-01 17:41:27.550 EDT [15210] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  exit(0)
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  proc_exit(-1): 0 callbacks to make

-bash-4.2$
-bash-4.2$

This  message:

mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
 
doesn't seem to have been a problem.  I doubled the memory on the VM but the message remained.  


Are there databases in the clusters system(template1, postgres, etc) or
user?

No.
 

Did you start against the correct PG_DATA directory?

Yes.
 





--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 6:05 PM Tom K <tomkcpr@gmail.com> wrote:


On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 2:31 PM, Tom K wrote:
>
>

>
> Spoke too soon.  There's no tables when it's started without the
> recovery.conf file.

Where there any errors in the start up?

Nothing I would discern as a startup error:

[root@psql03 patroni]# sudo su - postgres
Last login: Sat Jun  1 17:30:31 EDT 2019 on pts/1
-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  postgres: PostmasterMain: initial environment dump:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:  -----------------------------------------
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      XDG_SESSION_ID=219
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HOSTNAME=psql03.nix.mds.xyz
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      SHELL=/bin/bash
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      TERM=xterm
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HISTSIZE=1000
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      USER=postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      MAIL=/var/spool/mail/postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      PWD=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      LANG=en_US.UTF-8
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HISTCONTROL=ignoredups
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      SHLVL=1
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      HOME=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG:      LOGNAME=postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      PGDATA=/var/lib/pgsql/10/data
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      _=/usr/pgsql-10/bin/postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_COLLATE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_CTYPE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_MESSAGES=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_MONETARY=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_NUMERIC=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:      LC_TIME=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:  -----------------------------------------
2019-06-01 17:41:27.494 EDT [15209] DEBUG:  registering background worker "logical replication launcher"
2019-06-01 17:41:27.495 EDT [15209] LOG:  listening on IPv4 address "192.168.0.118", port 5432
2019-06-01 17:41:27.498 EDT [15209] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-06-01 17:41:27.499 EDT [15209] DEBUG:  invoking IpcMemoryCreate(size=148545536)
2019-06-01 17:41:27.499 EDT [15209] DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 17:41:27.514 EDT [15209] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-01 17:41:27.514 EDT [15209] DEBUG:  removing file "pg_notify/0000"
2019-06-01 17:41:27.515 EDT [15209] DEBUG:  dynamic shared memory system will support 288 segments
2019-06-01 17:41:27.515 EDT [15209] DEBUG:  created dynamic shared memory control segment 834455595 (6928 bytes)
2019-06-01 17:41:27.518 EDT [15209] DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
2019-06-01 17:41:27.519 EDT [15209] LOG:  redirecting log output to logging collector process
2019-06-01 17:41:27.519 EDT [15209] HINT:  Future log output will appear in directory "log".
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  logger shutting down
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
-bash-4.2$ 2019-06-01 17:41:27.550 EDT [15210] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  exit(0)
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG:  proc_exit(-1): 0 callbacks to make

-bash-4.2$
-bash-4.2$

This  message:

mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
 
doesn't seem to have been a problem.  I doubled the memory on the VM but the message remained.  


Are there databases in the clusters system(template1, postgres, etc) or
user?

No. 

** Correction.  There is postgres, template1 and template2 but none of the other databases we had.   
 

Did you start against the correct PG_DATA directory?

Yes.
 





--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 3:14 PM, Tom K wrote:
> 
> 

> ** Correction.  There is postgres, template1 and template2 but none of 
> the other databases we had.

In a psql session do:

select oid, datname from pg_database ;

Then go to /data/patroni an drill down to the base directory. In that 
directory there should be sub-directories with numbers equal to the oid 
values from the query. If not then something is wrong. That could be:

1) Wrong data directory.

2) Copied over incorrect data directory.

Basically if the only sub-directories are for the system databases then 
the user databases do not exist. You might want to look at some of the 
other clusters to see what is there.



-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 3:14 PM, Tom K wrote:

> 
> ** Correction.  There is postgres, template1 and template2 but none of 
> the other databases we had.

Just noticed, is that really template2 or is it actually template0?


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 6:39 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 3:14 PM, Tom K wrote:

>
> ** Correction.  There is postgres, template1 and template2 but none of
> the other databases we had.

Just noticed, is that really template2 or is it actually template0?

Apologies.  You're right, that was a typo:

-bash-4.2$ psql -h psql03 -U postgres
Password for user postgres:
psql (10.8)
Type "help" for help.

postgres=# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(3 rows)

postgres=#

 

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 6:36 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 3:14 PM, Tom K wrote:
>
>

> ** Correction.  There is postgres, template1 and template2 but none of
> the other databases we had.

In a psql session do:

select oid, datname from pg_database ;

Then go to /data/patroni an drill down to the base directory. In that
directory there should be sub-directories with numbers equal to the oid
values from the query. If not then something is wrong. That could be:

1) Wrong data directory.

2) Copied over incorrect data directory.

Basically if the only sub-directories are for the system databases then
the user databases do not exist. You might want to look at some of the
other clusters to see what is there.




-bash-4.2$ pwd
/var/lib/pgsql
-bash-4.2$ psql -h psql03 -U postgres
Password for user postgres:
psql (10.8)
Type "help" for help.

postgres=# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(3 rows)

postgres=# select oid, datname from pg_database ;
  oid  |  datname
-------+-----------
 13806 | postgres
     1 | template1
 13805 | template0
(3 rows)

postgres=#
postgres=# cd /data/patroni
postgres-# ls -altri
postgres-# pwd
postgres-# pwd
postgres-# ls -altrid
postgres-# ;
ERROR:  syntax error at or near "cd"
LINE 1: cd /data/patroni
        ^
postgres=# \q
-bash-4.2$ ls -altri
total 652
134299846 drwxr-xr-x. 39 root     root       4096 Oct 24  2018 ..
 67178653 -rw-r--r--.  1 postgres postgres     30 Oct 24  2018 .pgsql_profile
134734937 drwx------.  4 postgres postgres     31 May  8 06:25 10
 67178637 drwx------.  3 postgres postgres   4096 May  8 06:25 .
 68994450 -rw-r--r--.  1 postgres postgres 204336 May 26 22:37 s1.log
 67677594 -rwx------.  1 postgres postgres    266 May 26 23:06 .bash_profile
 68994451 -rw-r--r--.  1 postgres postgres 433920 May 26 23:44 s2.log
 68391913 -rw-------.  1 postgres postgres   4716 Jun  1 17:34 .bash_history
 67178638 -rw-------.  1 postgres postgres    325 Jun  1 18:54 .psql_history
-bash-4.2$ cd /data/patroni/
-bash-4.2$ ls -altri
total 144
 69085037 drwxr-xr-x.  3 root     root        20 Oct 23  2018 ..
135316997 -rw-------.  1 postgres postgres   206 Oct 29  2018 backup_label.old
201708781 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_commit_ts
  1502746 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_dynshmem
 68994449 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_twophase
  1502749 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_snapshots
201708785 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_serial
  1502747 drwx------.  4 postgres postgres    34 Oct 29  2018 pg_multixact
 67677559 drwx------.  5 postgres postgres    38 Oct 29  2018 base
  1502751 drwx------.  2 postgres postgres    17 Oct 29  2018 pg_xact
  1502750 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_tblspc
134668882 -rw-------.  1 postgres postgres    88 Oct 29  2018 postgresql.auto.conf
134500181 -rw-------.  1 postgres postgres     3 Oct 29  2018 PG_VERSION
134500173 -rw-------.  1 postgres postgres  1636 Oct 29  2018 pg_ident.conf
134500171 -rw-------.  1 postgres postgres  4730 Oct 29  2018 pg_hba.conf
134668883 -rw-------.  1 postgres postgres 22783 Oct 29  2018 postgresql.base.conf
134500170 -rw-------.  1 postgres postgres   124 Oct 29  2018 patroni.dynamic.json
134668889 -rw-r--r--.  1 postgres postgres    90 Oct 29  2018 recovery.done
201708786 drwx------.  2 postgres postgres    17 Oct 29  2018 pg_subtrans
134500162 drwx------.  2 postgres postgres  4096 Nov  4  2018 log
134500182 drwx------.  3 postgres postgres  8192 Apr 28 06:06 pg_wal
  1502748 drwx------.  4 postgres postgres    42 May 22 01:55 pg_replslot
134668893 -rw-r--r--.  1 postgres postgres    90 May 26 22:34 t.txt
134668885 -rw-r--r--.  1 postgres postgres   575 May 26 22:34 postgresql.conf
134668886 -rw-r--r--.  1 postgres postgres   575 May 26 22:34 postgresql.conf.backup
134668884 -rw-------.  1 postgres postgres 22783 May 26 22:34 postgresql.base.conf.backup
134500172 -rw-------.  1 postgres postgres  4730 May 26 22:34 pg_hba.conf.backup
134500174 -rw-------.  1 postgres postgres  1636 May 26 22:34 pg_ident.conf.backup
 68994437 drwx------.  4 postgres postgres    65 Jun  1 17:34 pg_logical
134734900 -rw-------.  1 postgres postgres    38 Jun  1 18:53 .s.PGSQL.5432.lock
134734941 srwxrwxrwx.  1 postgres postgres     0 Jun  1 18:53 .s.PGSQL.5432
201708783 drwx------.  2 postgres postgres    17 Jun  1 18:53 pg_notify
134668887 -rw-------.  1 postgres postgres   439 Jun  1 18:53 postmaster.opts
134734940 -rw-------.  1 postgres postgres    30 Jun  1 18:53 current_logfiles
134812989 drwx------. 20 postgres postgres  4096 Jun  1 18:53 .
134734899 -rw-------.  1 postgres postgres    81 Jun  1 18:53 postmaster.pid
 68994444 drwx------.  2 postgres postgres     6 Jun  1 18:53 pg_stat
 67677567 drwx------.  2 postgres postgres  4096 Jun  1 18:53 global
134500180 drwx------.  2 postgres postgres    60 Jun  1 18:54 pg_stat_tmp
-bash-4.2$ cd
-bash-4.2$ cd 10
-bash-4.2$ ls
backups  data
-bash-4.2$ pwd
/var/lib/pgsql/10
-bash-4.2$ cd data/
-bash-4.2$ ls -altri
total 0
134734937 drwx------. 4 postgres postgres 31 May  8 06:25 ..
   245519 drwx------. 2 postgres postgres  6 May  8 06:25 .
-bash-4.2$ cd ..
-bash-4.2$ pwd
/var/lib/pgsql/10
-bash-4.2$ cd ..
-bash-4.2$


 

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 3:56 PM, Tom K wrote:
> 
> 

> 
> postgres=# select oid, datname from pg_database ;
>    oid  |  datname
> -------+-----------
>   13806 | postgres
>       1 | template1
>   13805 | template0
> (3 rows)
> 

So there are only the system databases available

> -bash-4.2$ cd /data/patroni/
> -bash-4.2$ ls -altri
> total 144
>   69085037 drwxr-xr-x.  3 root     root        20 Oct 23  2018 ..
> 135316997 -rw-------.  1 postgres postgres   206 Oct 29  2018 
> backup_label.old
> 201708781 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_commit_ts
>    1502746 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_dynshmem
>   68994449 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_twophase
>    1502749 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_snapshots
> 201708785 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_serial
>    1502747 drwx------.  4 postgres postgres    34 Oct 29  2018 pg_multixact
>   67677559 drwx------.  5 postgres postgres    38 Oct 29  2018 base

base/ is the directory you need to look in. I'm guessing it is only 
going to show the oid/ for the three db's above and  pgsql_tmp/

For more info on this see:
https://www.postgresql.org/docs/10/storage-file-layout.html

The below looks like the RH package installed data directory. Also looks 
like it either has never had initdb run against or the files where removed.

I thought you said you had copied in data directories from the other 
nodes, did I remember correctly?

> -bash-4.2$ cd
> -bash-4.2$ cd 10
> -bash-4.2$ ls
> backups  data
> -bash-4.2$ pwd
> /var/lib/pgsql/10
> -bash-4.2$ cd data/
> -bash-4.2$ ls -altri
> total 0
> 134734937 drwx------. 4 postgres postgres 31 May  8 06:25 ..
>     245519 drwx------. 2 postgres postgres  6 May  8 06:25 .
> -bash-4.2$ cd ..
> -bash-4.2$ pwd
> /var/lib/pgsql/10
> -bash-4.2$ cd ..
> -bash-4.2$


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 7:12 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 3:56 PM, Tom K wrote:
>
>

>
> postgres=# select oid, datname from pg_database ;
>    oid  |  datname
> -------+-----------
>   13806 | postgres
>       1 | template1
>   13805 | template0
> (3 rows)
>

So there are only the system databases available

> -bash-4.2$ cd /data/patroni/
> -bash-4.2$ ls -altri
> total 144
>   69085037 drwxr-xr-x.  3 root     root        20 Oct 23  2018 ..
> 135316997 -rw-------.  1 postgres postgres   206 Oct 29  2018
> backup_label.old
> 201708781 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_commit_ts
>    1502746 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_dynshmem
>   68994449 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_twophase
>    1502749 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_snapshots
> 201708785 drwx------.  2 postgres postgres     6 Oct 29  2018 pg_serial
>    1502747 drwx------.  4 postgres postgres    34 Oct 29  2018 pg_multixact
>   67677559 drwx------.  5 postgres postgres    38 Oct 29  2018 base

base/ is the directory you need to look in. I'm guessing it is only
going to show the oid/ for the three db's above and  pgsql_tmp/

For more info on this see:
https://www.postgresql.org/docs/10/storage-file-layout.html

The below looks like the RH package installed data directory. Also looks
like it either has never had initdb run against or the files where removed.

I thought you said you had copied in data directories from the other
nodes, did I remember correctly?

> -bash-4.2$ cd
> -bash-4.2$ cd 10
> -bash-4.2$ ls
> backups  data
> -bash-4.2$ pwd
> /var/lib/pgsql/10
> -bash-4.2$ cd data/
> -bash-4.2$ ls -altri
> total 0
> 134734937 drwx------. 4 postgres postgres 31 May  8 06:25 ..
>     245519 drwx------. 2 postgres postgres  6 May  8 06:25 .
> -bash-4.2$ cd ..
> -bash-4.2$ pwd
> /var/lib/pgsql/10
> -bash-4.2$ cd ..
> -bash-4.2$


Yep, you remembered correctly.  

I copied the files as they were, out to a temporary folder under root for each node but never dug into base/ etc any further to check things.  So here's the state of things in the base/ folder of the backup of each node. 

[ PSQL03 ]
[root@psql03 base]# ls -altri
total 40
    42424 drwx------.  2 postgres postgres 8192 Oct 29  2018 1
 67714749 drwx------.  2 postgres postgres 8192 Oct 29  2018 13805
202037206 drwx------.  5 postgres postgres   38 Oct 29  2018 .
134312175 drwx------.  2 postgres postgres 8192 May 22 01:55 13806
    89714 drwxr-xr-x. 20 root     root     4096 May 22 22:43 ..
[root@psql03 base]#




[ PSQL02 ]
 [root@psql02 base]# ls -altri
total 412
201426668 drwx------.  2 postgres postgres  8192 Oct 29  2018 1
   743426 drwx------.  2 postgres postgres  8192 Mar 24 03:47 13805
135326327 drwx------.  2 postgres postgres 16384 Mar 24 20:15 40970
   451699 drwx------.  2 postgres postgres 40960 Mar 25 19:47 16395
  1441696 drwx------.  2 postgres postgres  8192 Mar 31 15:09 131137
 68396137 drwx------.  2 postgres postgres  8192 Mar 31 15:09 131138
135671065 drwx------.  2 postgres postgres  8192 Mar 31 15:09 131139
204353100 drwx------.  2 postgres postgres  8192 Mar 31 15:09 131140
135326320 drwx------. 17 postgres postgres  4096 Apr 14 10:08 .
 68574415 drwx------.  2 postgres postgres 12288 Apr 28 06:06 131142
   288896 drwx------.  2 postgres postgres 16384 Apr 28 06:06 131141
203015232 drwx------.  2 postgres postgres  8192 Apr 28 06:06 131136
135326328 drwx------.  2 postgres postgres 40960 May  5 22:09 24586
 67282461 drwx------.  2 postgres postgres  8192 May  5 22:09 13806
 67640961 drwx------.  2 postgres postgres 20480 May  5 22:09 131134
203500274 drwx------.  2 postgres postgres 16384 May  5 22:09 155710
134438257 drwxr-xr-x. 20 root     root      4096 May 22 01:44 ..
[root@psql02 base]# pwd
/root/postgres-patroni-backup/base
[root@psql02 base]#



[ PSQL01 ]
[root@psql01 base]# ls -altri
total 148
134704615 drwx------.  2 postgres postgres  8192 Oct 29  2018 1
201547700 drwx------.  2 postgres postgres  8192 Oct 29  2018 13805
   160398 drwx------.  2 postgres postgres  8192 Feb 24 23:53 13806
 67482137 drwx------.  7 postgres postgres    62 Feb 24 23:54 .
135909671 drwx------.  2 postgres postgres 24576 Feb 24 23:54 24586
134444555 drwx------.  2 postgres postgres 24576 Feb 24 23:54 16395
 67178716 drwxr-xr-x. 20 root     root      4096 May 22 01:53 ..
[root@psql01 base]# pwd
/root/postgresql-patroni-etcd/base
[root@psql01 base]#

Looks like this crash was far more catastrophic then I thought.  By the looks of things, thinking on psql02 would be my best bet.  



--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 4:22 PM, Tom K wrote:
> 

>     I thought you said you had copied in data directories from the other
>     nodes, did I remember correctly?

> Yep, you remembered correctly.
> 
> I copied the files as they were, out to a temporary folder under root 
> for each node but never dug into base/ etc any further to check things.  
> So here's the state of things in the base/ folder of the backup of each 
> node.
> 

> 
> Looks like this crash was far more catastrophic then I thought.  By the 
> looks of things, thinking on psql02 would be my best bet.
> 
> 

Assuming the other files in the cluster are all there and not corrupted.


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 4:22 PM, Tom K wrote:
> 
> 

> 
> Looks like this crash was far more catastrophic then I thought.  By the 
> looks of things, thinking on psql02 would be my best bet.
> 

The more I look at it the more I think the replication was not doing 
what you thought it was doing. That psql02 was the primary and that 
psql01 and psql03 where out of sync and/or defunct standbys.



-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 4:22 PM, Tom K wrote:
>
>

>
> Looks like this crash was far more catastrophic then I thought.  By the
> looks of things, thinking on psql02 would be my best bet.
>

The more I look at it the more I think the replication was not doing
what you thought it was doing. That psql02 was the primary and that
psql01 and psql03 where out of sync and/or defunct standbys.

Now that I look at the files myself, that's the conclusion I was coming to myself.  Sample config:

[root@psql02 base]# cat /etc/patroni.yml
scope: postgres
namespace: /db/
name: postgresql1

restapi:
    listen: 192.168.0.124:8008
    connect_address: 192.168.0.124:8008

etcd:
    host: 192.168.0.124:2379

bootstrap:
    dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
            use_pg_rewind: true

    initdb:
    - encoding: UTF8
    - data-checksums

    pg_hba:
    - host replication replicator 127.0.0.1/32 md5
    - host replication replicator 192.168.0.108/0 md5
    - host replication replicator 192.168.0.124/0 md5
    - host replication replicator 192.168.0.118/0 md5
    - host all all 0.0.0.0/0 md5

    users:
        admin:
            password: admin
            options:
                - createrole
                - createdb

postgresql:
    listen: 192.168.0.124:5432
    bin_dir: /usr/pgsql-10/bin
    connect_address: 192.168.0.124:5432
    data_dir: /data/patroni
    pgpass: /tmp/pgpass
    unix_socket_directories: /data/patroni
    authentication:
        replication:
            username: replicator
            password: rep-pass
        superuser:
            username: postgres
            password: <SECRET>
    parameters:
        unix_socket_directories: '.'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
[root@psql02 base]#


Or perhaps when the system crashed, the filesystem check simply moved the folders out due to corruption.  
 



--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:
So the best bet will be trying to get through this error then:

[ PSQL02 ]
PANIC:replication check point has wrong magic 0 instead of  307747550 


 

On Sat, Jun 1, 2019 at 8:21 PM Tom K <tomkcpr@gmail.com> wrote:


On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 4:22 PM, Tom K wrote:
>
>

>
> Looks like this crash was far more catastrophic then I thought.  By the
> looks of things, thinking on psql02 would be my best bet.
>

The more I look at it the more I think the replication was not doing
what you thought it was doing. That psql02 was the primary and that
psql01 and psql03 where out of sync and/or defunct standbys.

Now that I look at the files myself, that's the conclusion I was coming to myself.  Sample config:

[root@psql02 base]# cat /etc/patroni.yml
scope: postgres
namespace: /db/
name: postgresql1

restapi:
    listen: 192.168.0.124:8008
    connect_address: 192.168.0.124:8008

etcd:
    host: 192.168.0.124:2379

bootstrap:
    dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
            use_pg_rewind: true

    initdb:
    - encoding: UTF8
    - data-checksums

    pg_hba:
    - host replication replicator 127.0.0.1/32 md5
    - host replication replicator 192.168.0.108/0 md5
    - host replication replicator 192.168.0.124/0 md5
    - host replication replicator 192.168.0.118/0 md5
    - host all all 0.0.0.0/0 md5

    users:
        admin:
            password: admin
            options:
                - createrole
                - createdb

postgresql:
    listen: 192.168.0.124:5432
    bin_dir: /usr/pgsql-10/bin
    connect_address: 192.168.0.124:5432
    data_dir: /data/patroni
    pgpass: /tmp/pgpass
    unix_socket_directories: /data/patroni
    authentication:
        replication:
            username: replicator
            password: rep-pass
        superuser:
            username: postgres
            password: <SECRET>
    parameters:
        unix_socket_directories: '.'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
[root@psql02 base]#


Or perhaps when the system crashed, the filesystem check simply moved the folders out due to corruption.  
 



--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 8:25 PM Tom K <tomkcpr@gmail.com> wrote:
So the best bet will be trying to get through this error then:

[ PSQL02 ]
PANIC:replication check point has wrong magic 0 instead of  307747550 


 

On Sat, Jun 1, 2019 at 8:21 PM Tom K <tomkcpr@gmail.com> wrote:


On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 4:22 PM, Tom K wrote:
>
>

>
> Looks like this crash was far more catastrophic then I thought.  By the
> looks of things, thinking on psql02 would be my best bet.
>

The more I look at it the more I think the replication was not doing
what you thought it was doing. That psql02 was the primary and that
psql01 and psql03 where out of sync and/or defunct standbys.

Now that I look at the files myself, that's the conclusion I was coming to myself.  Sample config:

[root@psql02 base]# cat /etc/patroni.yml
scope: postgres
namespace: /db/
name: postgresql1

restapi:
    listen: 192.168.0.124:8008
    connect_address: 192.168.0.124:8008

etcd:
    host: 192.168.0.124:2379

bootstrap:
    dcs:
        ttl: 30
        loop_wait: 10
        retry_timeout: 10
        maximum_lag_on_failover: 1048576
        postgresql:
            use_pg_rewind: true

    initdb:
    - encoding: UTF8
    - data-checksums

    pg_hba:
    - host replication replicator 127.0.0.1/32 md5
    - host replication replicator 192.168.0.108/0 md5
    - host replication replicator 192.168.0.124/0 md5
    - host replication replicator 192.168.0.118/0 md5
    - host all all 0.0.0.0/0 md5

    users:
        admin:
            password: admin
            options:
                - createrole
                - createdb

postgresql:
    listen: 192.168.0.124:5432
    bin_dir: /usr/pgsql-10/bin
    connect_address: 192.168.0.124:5432
    data_dir: /data/patroni
    pgpass: /tmp/pgpass
    unix_socket_directories: /data/patroni
    authentication:
        replication:
            username: replicator
            password: rep-pass
        superuser:
            username: postgres
            password: <SECRET>
    parameters:
        unix_socket_directories: '.'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
[root@psql02 base]#


Or perhaps when the system crashed, the filesystem check simply moved the folders out due to corruption.  
 


Trying what we did above but on the second node:

2019-06-01 20:28:45.305 EDT [21745] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  primary_slot_name = 'postgresql1'
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  standby_mode = 'on'
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  recovery_target_timeline = latest
2019-06-01 20:28:46.142 EDT [21745] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 20:28:46.142 EDT [21745] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 20:28:46.142 EDT [21745] LOG:  entering standby mode
2019-06-01 20:28:46.142 EDT [21745] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:28:46.142 EDT [21745] LOG:  invalid primary checkpoint record
2019-06-01 20:28:46.142 EDT [21745] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:28:46.142 EDT [21745] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  starting up replication slots
2019-06-01 20:28:46.143 EDT [21745] DEBUG:  starting up replication origin progress state
2019-06-01 20:28:46.143 EDT [21745] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:28:46.143 EDT [21743] DEBUG:  reaping dead processes
2019-06-01 20:28:46.143 EDT [21743] LOG:  startup process (PID 21745) was terminated by signal 6: Aborted
2019-06-01 20:28:46.143 EDT [21743] LOG:  aborting startup due to startup process failure
2019-06-01 20:28:46.144 EDT [21743] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG:  cleaning up dynamic shared memory control segment with ID 222930115
2019-06-01 20:28:46.145 EDT [21743] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 20:28:46.146 EDT [21743] LOG:  database system is shut down
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  exit(1)
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  logger shutting down
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  exit(0)
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG:  proc_exit(-1): 0 callbacks to make
^C
-bash-4.2$


Removing the recovery.conf and restarting yields the following messages:  

2019-06-01 20:31:12.231 EDT [21910] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid primary checkpoint record
2019-06-01 20:31:12.505 EDT [21910] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication slots
2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication origin progress state
2019-06-01 20:31:12.505 EDT [21910] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  reaping dead processes
2019-06-01 20:31:12.506 EDT [21908] LOG:  startup process (PID 21910) was terminated by signal 6: Aborted
2019-06-01 20:31:12.506 EDT [21908] LOG:  aborting startup due to startup process failure
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 5 on_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG:  cleaning up dynamic shared memory control segment with ID 976986759
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(1): 2 callbacks to make
2019-06-01 20:31:12.508 EDT [21908] LOG:  database system is shut down
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  exit(1)
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(-1): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  logger shutting down
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(0): 0 callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  exit(0)
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(-1): 0 callbacks to make


 
--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 5:21 PM, Tom K wrote:
> 
> 
> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 6/1/19 4:22 PM, Tom K wrote:
>      >
>      >
> 
>      >
>      > Looks like this crash was far more catastrophic then I thought. 
>     By the
>      > looks of things, thinking on psql02 would be my best bet.
>      >
> 
>     The more I look at it the more I think the replication was not doing
>     what you thought it was doing. That psql02 was the primary and that
>     psql01 and psql03 where out of sync and/or defunct standbys.
> 
> 
> Now that I look at the files myself, that's the conclusion I was coming 
> to myself.  Sample config:

The below would be for someone that uses and understands Patroni. That 
would not be me:)

> 
> [root@psql02 base]# cat /etc/patroni.yml
> scope: postgres
> namespace: /db/
> name: postgresql1
> 
> restapi:
>      listen: 192.168.0.124:8008 <http://192.168.0.124:8008>
>      connect_address: 192.168.0.124:8008 <http://192.168.0.124:8008>

> 
> Or perhaps when the system crashed, the filesystem check simply moved 
> the folders out due to corruption.

That would leave the cluster in an inconsistent state and you would not 
have been able to start the one you got going.

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:
That file just generates the postgres configs.   Here is what is generated:

-bash-4.2$ cat postgresql.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'

cluster_name = 'postgres'
hot_standby = 'on'
listen_addresses = '192.168.0.124'
max_connections = '100'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
port = '5432'
track_commit_timestamp = 'off'
unix_socket_directories = '.'
wal_keep_segments = '8'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/data/patroni/pg_hba.conf'
ident_file = '/data/patroni/pg_ident.conf'
-bash-4.2$ cat /data/patroni/pg_hba.conf
# PostgreSQL Client Authentication Configuration File
# ===================================================
#
# Refer to the "Client Authentication" section in the PostgreSQL
# documentation for a complete description of this file.  A short
# synopsis follows.
#
# This file controls: which hosts are allowed to connect, how clients
# are authenticated, which PostgreSQL user names they can use, which
# databases they can access.  Records take one of these forms:
#
# local      DATABASE  USER  METHOD  [OPTIONS]
# host       DATABASE  USER  ADDRESS  METHOD  [OPTIONS]
# hostssl    DATABASE  USER  ADDRESS  METHOD  [OPTIONS]
# hostnossl  DATABASE  USER  ADDRESS  METHOD  [OPTIONS]
#
# (The uppercase items must be replaced by actual values.)
#
# The first field is the connection type: "local" is a Unix-domain
# socket, "host" is either a plain or SSL-encrypted TCP/IP socket,
# "hostssl" is an SSL-encrypted TCP/IP socket, and "hostnossl" is a
# plain TCP/IP socket.
#
# DATABASE can be "all", "sameuser", "samerole", "replication", a
# database name, or a comma-separated list thereof. The "all"
# keyword does not match "replication". Access to replication
# must be enabled in a separate record (see example below).
#
# USER can be "all", a user name, a group name prefixed with "+", or a
# comma-separated list thereof.  In both the DATABASE and USER fields
# you can also write a file name prefixed with "@" to include names
# from a separate file.
#
# ADDRESS specifies the set of hosts the record matches.  It can be a
# host name, or it is made up of an IP address and a CIDR mask that is
# an integer (between 0 and 32 (IPv4) or 128 (IPv6) inclusive) that
# specifies the number of significant bits in the mask.  A host name
# that starts with a dot (.) matches a suffix of the actual host name.
# Alternatively, you can write an IP address and netmask in separate
# columns to specify the set of hosts.  Instead of a CIDR-address, you
# can write "samehost" to match any of the server's own IP addresses,
# or "samenet" to match any address in any subnet that the server is
# directly connected to.
#
# METHOD can be "trust", "reject", "md5", "password", "scram-sha-256",
# "gss", "sspi", "ident", "peer", "pam", "ldap", "radius" or "cert".
# Note that "password" sends passwords in clear text; "md5" or
# "scram-sha-256" are preferred since they send encrypted passwords.
#
# OPTIONS are a set of options for the authentication in the format
# NAME=VALUE.  The available options depend on the different
# authentication methods -- refer to the "Client Authentication"
# section in the documentation for a list of which options are
# available for which authentication methods.
#
# Database and user names containing spaces, commas, quotes and other
# special characters must be quoted.  Quoting one of the keywords
# "all", "sameuser", "samerole" or "replication" makes the name lose
# its special character, and just match a database or username with
# that name.
#
# This file is read on server startup and when the server receives a
# SIGHUP signal.  If you edit the file on a running system, you have to
# SIGHUP the server for the changes to take effect, run "pg_ctl reload",
# or execute "SELECT pg_reload_conf()".
#
# Put your actual configuration here
# ----------------------------------
#
# If you want to allow non-local connections, you need to add more
# "host" records.  In that case you will also need to make PostgreSQL
# listen on a non-local interface via the listen_addresses
# configuration parameter, or via the -i or -h command line switches.

# CAUTION: Configuring the system for local "trust" authentication
# allows any local user to connect as any PostgreSQL user, including
# the database superuser.  If you do not trust all your local users,
# use another authentication method.


# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust

host replication replicator 127.0.0.1/32 md5
host replication replicator 192.168.0.108/0 md5
host replication replicator 192.168.0.124/0 md5
host replication replicator 192.168.0.118/0 md5
host all all 0.0.0.0/0 md5
-bash-4.2$ pwd
/data/patroni
-bash-4.2$


-bash-4.2$ cat postgresql.base.conf|grep -v "#"|sed -e "/^$/d"
log_timezone = 'Canada/Eastern'
datestyle = 'iso, mdy'
timezone = 'Canada/Eastern'
default_text_search_config = 'pg_catalog.english'
-bash-4.2$


On Sat, Jun 1, 2019 at 8:36 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 5:21 PM, Tom K wrote:
>
>
> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
>     On 6/1/19 4:22 PM, Tom K wrote:
>      >
>      >
>
>      >
>      > Looks like this crash was far more catastrophic then I thought.
>     By the
>      > looks of things, thinking on psql02 would be my best bet.
>      >
>
>     The more I look at it the more I think the replication was not doing
>     what you thought it was doing. That psql02 was the primary and that
>     psql01 and psql03 where out of sync and/or defunct standbys.
>
>
> Now that I look at the files myself, that's the conclusion I was coming
> to myself.  Sample config:

The below would be for someone that uses and understands Patroni. That
would not be me:)

>
> [root@psql02 base]# cat /etc/patroni.yml
> scope: postgres
> namespace: /db/
> name: postgresql1
>
> restapi:
>      listen: 192.168.0.124:8008 <http://192.168.0.124:8008>
>      connect_address: 192.168.0.124:8008 <http://192.168.0.124:8008>

>
> Or perhaps when the system crashed, the filesystem check simply moved
> the folders out due to corruption.

That would leave the cluster in an inconsistent state and you would not
have been able to start the one you got going.

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 5:32 PM, Tom K wrote:

> 
> 
> Trying what we did above but on the second node:
> 

Was this node the primary?

To me the below looks like there are replication slots set up that are 
failing. Not sure how to deal with this at the moment. You might try 
single-user mode:

https://www.postgresql.org/docs/10/app-postgres.html
Single-User Mode
...

and see if that at least gets the server started. This is a highly 
restricted so do not expect much usability.

> Removing the recovery.conf and restarting yields the following messages:
> 
> 2019-06-01 20:31:12.231 EDT [21910] LOG:  database system was 
> interrupted; last known up at 2019-04-28 06:06:24 EDT
> 2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid record length at 
> 0/4C35CDF8: wanted 24, got 0
> 2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid primary checkpoint record
> 2019-06-01 20:31:12.505 EDT [21910] LOG:  using previous checkpoint 
> record at 0/4C34EDA8
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  redo record is at 
> 0/4C34ED70; shutdown FALSE
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next transaction ID: 
> 0:1409831; next OID: 237578
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next MultiXactId: 48; next 
> MultiXactOffset: 174
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest unfrozen transaction 
> ID: 549, in database 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest MultiXactId: 1, in 
> database 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  commit timestamp Xid 
> oldest/newest: 0/0
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  transaction ID wrap limit is 
> 2147484196, limited by database with OID 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  MultiXactId wrap limit is 
> 2147483648, limited by database with OID 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication slots
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication 
> origin progress state
> 2019-06-01 20:31:12.505 EDT [21910] PANIC:  replication checkpoint has 
> wrong magic 0 instead of 307747550
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  reaping dead processes
> 2019-06-01 20:31:12.506 EDT [21908] LOG:  startup process (PID 21910) 
> was terminated by signal 6: Aborted
> 2019-06-01 20:31:12.506 EDT [21908] LOG:  aborting startup due to 
> startup process failure
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 0 
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 5 
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  cleaning up dynamic shared 
> memory control segment with ID 976986759
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(1): 2 callbacks to 
> make
> 2019-06-01 20:31:12.508 EDT [21908] LOG:  database system is shut down
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  exit(1)
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0 
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0 
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(-1): 0 callbacks 
> to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  logger shutting down
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0 
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0 
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(0): 0 callbacks to 
> make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  exit(0)
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0 
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0 
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(-1): 0 callbacks 
> to make
> 
>             -- 
>             Adrian Klaver
>             adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sat, Jun 1, 2019 at 8:53 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 5:32 PM, Tom K wrote:

>
>
> Trying what we did above but on the second node:
>

Was this node the primary?

To me the below looks like there are replication slots set up that are
failing. Not sure how to deal with this at the moment. You might try
single-user mode:

https://www.postgresql.org/docs/10/app-postgres.html
Single-User Mode
...

and see if that at least gets the server started. This is a highly
restricted so do not expect much usability.

These servers did crash before however didn't' notice anything particularly wrong at the time. Perhaps a better check was in order.

Here is the result from attempting to start the DB in single user mode. 

-bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=off --listen_addresses=192.168.0.124 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=20 -d 5 2>&1
2019-06-01 23:04:20.876 EDT [31831] DEBUG:  invoking IpcMemoryCreate(size=144687104)
2019-06-01 23:04:20.877 EDT [31831] DEBUG:  mmap(144703488) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 23:04:20.890 EDT [31831] DEBUG:  SlruScanDirectory invoking callback on pg_notify/0000
2019-06-01 23:04:20.890 EDT [31831] DEBUG:  removing file "pg_notify/0000"
2019-06-01 23:04:20.890 EDT [31831] DEBUG:  dynamic shared memory system will support 128 segments
2019-06-01 23:04:20.891 EDT [31831] DEBUG:  created dynamic shared memory control segment 846930886 (3088 bytes)
2019-06-01 23:04:20.891 EDT [31831] DEBUG:  InitPostgres
2019-06-01 23:04:20.891 EDT [31831] DEBUG:  my backend ID is 1
2019-06-01 23:04:20.891 EDT [31831] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 23:04:21.239 EDT [31831] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 23:04:21.239 EDT [31831] LOG:  invalid primary checkpoint record
2019-06-01 23:04:21.239 EDT [31831] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  redo record is at 0/4C34ED70; shutdown FALSE
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  next transaction ID: 0:1409831; next OID: 237578
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  next MultiXactId: 48; next MultiXactOffset: 174
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  oldest unfrozen transaction ID: 549, in database 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  oldest MultiXactId: 1, in database 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  commit timestamp Xid oldest/newest: 0/0
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  transaction ID wrap limit is 2147484196, limited by database with OID 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  starting up replication slots
2019-06-01 23:04:21.239 EDT [31831] DEBUG:  starting up replication origin progress state
2019-06-01 23:04:21.239 EDT [31831] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
Aborted
-bash-4.2$

-bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni --config-file=/data/patroni/postgresql.conf
2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at 0/4C35CDF8: wanted 24, got 0
2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint record
2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint record at 0/4C34EDA8
2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
Aborted
-bash-4.2$

 

> Removing the recovery.conf and restarting yields the following messages:
>
> 2019-06-01 20:31:12.231 EDT [21910] LOG:  database system was
> interrupted; last known up at 2019-04-28 06:06:24 EDT
> 2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid record length at
> 0/4C35CDF8: wanted 24, got 0
> 2019-06-01 20:31:12.505 EDT [21910] LOG:  invalid primary checkpoint record
> 2019-06-01 20:31:12.505 EDT [21910] LOG:  using previous checkpoint
> record at 0/4C34EDA8
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  redo record is at
> 0/4C34ED70; shutdown FALSE
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next transaction ID:
> 0:1409831; next OID: 237578
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  next MultiXactId: 48; next
> MultiXactOffset: 174
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest unfrozen transaction
> ID: 549, in database 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  oldest MultiXactId: 1, in
> database 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  commit timestamp Xid
> oldest/newest: 0/0
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  transaction ID wrap limit is
> 2147484196, limited by database with OID 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  MultiXactId wrap limit is
> 2147483648, limited by database with OID 1
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication slots
> 2019-06-01 20:31:12.505 EDT [21910] DEBUG:  starting up replication
> origin progress state
> 2019-06-01 20:31:12.505 EDT [21910] PANIC:  replication checkpoint has
> wrong magic 0 instead of 307747550
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  reaping dead processes
> 2019-06-01 20:31:12.506 EDT [21908] LOG:  startup process (PID 21910)
> was terminated by signal 6: Aborted
> 2019-06-01 20:31:12.506 EDT [21908] LOG:  aborting startup due to
> startup process failure
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  shmem_exit(1): 5
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.506 EDT [21908] DEBUG:  cleaning up dynamic shared
> memory control segment with ID 976986759
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(1): 2 callbacks to
> make
> 2019-06-01 20:31:12.508 EDT [21908] LOG:  database system is shut down
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  exit(1)
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  shmem_exit(-1): 0
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.508 EDT [21908] DEBUG:  proc_exit(-1): 0 callbacks
> to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  logger shutting down
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(0): 0
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(0): 0 callbacks to
> make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  exit(0)
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0
> before_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  shmem_exit(-1): 0
> on_shmem_exit callbacks to make
> 2019-06-01 20:31:12.510 EDT [21909] DEBUG:  proc_exit(-1): 0 callbacks
> to make
>
>             --
>             Adrian Klaver
>             adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/1/19 8:07 PM, Tom K wrote:

> 
>     https://www.postgresql.org/docs/10/app-postgres.html
>     Single-User Mode
>     ...
> 
>     and see if that at least gets the server started. This is a highly
>     restricted so do not expect much usability.
> 
> 
> These servers did crash before however didn't' notice anything 
> particularly wrong at the time. Perhaps a better check was in order.
> 
> Here is the result from attempting to start the DB in single user mode.

> -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf
> 2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was 
> interrupted; last known up at 2019-04-28 06:06:24 EDT
> 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at 
> 0/4C35CDF8: wanted 24, got 0
> 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint record
> 2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint 
> record at 0/4C34EDA8
> 2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has 
> wrong magic 0 instead of 307747550

That comes from origin.c in the logical replication section of the code.

Did you have logical replication set up?

What was the role of this cluster in the original setup?

> Aborted
> -bash-4.2$
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:


On Sun, Jun 2, 2019 at 11:47 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/1/19 8:07 PM, Tom K wrote:

>
>     https://www.postgresql.org/docs/10/app-postgres.html
>     Single-User Mode
>     ...
>
>     and see if that at least gets the server started. This is a highly
>     restricted so do not expect much usability.
>
>
> These servers did crash before however didn't' notice anything
> particularly wrong at the time. Perhaps a better check was in order.
>
> Here is the result from attempting to start the DB in single user mode.

> -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni
> --config-file=/data/patroni/postgresql.conf
> 2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was
> interrupted; last known up at 2019-04-28 06:06:24 EDT
> 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at
> 0/4C35CDF8: wanted 24, got 0
> 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint record
> 2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint
> record at 0/4C34EDA8
> 2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has
> wrong magic 0 instead of 307747550

That comes from origin.c in the logical replication section of the code.

Did you have logical replication set up?

Nope. wal_level was set to replica, not logical.  Unless you mean 


What was the role of this cluster in the original setup?

The cluster was the backend database for a number of applications.  The aim was to point applications to a single large cluster instead of a number of small instances of postgres running all over the LAB.

If I can get through the above error and get to the DB's and tables, I could just dump them out and reinit the entire DB from the backup.
 

> Aborted
> -bash-4.2$
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: psql: FATAL: the database system is starting up

От
Adrian Klaver
Дата:
On 6/2/19 11:14 AM, Tom K wrote:

> 
> Nope. wal_level was set to replica, not logical.  Unless you mean
> 
> 
>     What was the role of this cluster in the original setup?
> 
> 
> The cluster was the backend database for a number of applications.  The 
> aim was to point applications to a single large cluster instead of a 
> number of small instances of postgres running all over the LAB.
> 
> If I can get through the above error and get to the DB's and tables, I 
> could just dump them out and reinit the entire DB from the backup.

The only thing I could find is the below:

http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/

If that does not work I would suggest sending a new post(new subject) to 
the mailing list based on:

replication checkpoint has wrong magic 0 instead of 307747550


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: psql: FATAL: the database system is starting up

От
Tom K
Дата:
Hey Adrian,

Fixed it.

I saw the post from jebriggs but that didn't work for me so posted here.  Anyway, here's how I resolved it:

When I ran an strace on the postgres startup line, I got this:

open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6
write(2, "2019-06-02 14:50:34.777 EDT [283"..., 1062019-06-02 14:50:34.777 EDT [28362] PANIC:  replication checkpoint has wrong magic 0 instead of 307747550
-bash-4.2$


Notice how it's reading pg_logical/replorigin_checkpoint even though wal_level = 'replicate' .  It sure looks like it's doing logical replication.  Anyway, I checked the file and saw this:

-bash-4.2$ ls -altri pg_logical/
total 20
 67894871 -rw-------.  1 postgres postgres 16384 Oct 29  2018 replorigin_checkpoint
136946383 drwx------.  2 postgres postgres     6 Oct 29  2018 snapshots
204367784 drwx------.  2 postgres postgres     6 Oct 29  2018 mappings
 67894870 drwx------.  4 postgres postgres    65 Apr 28 06:06 .
135326272 drwx------. 21 postgres postgres  4096 Jun  2 14:50 ..
-bash-4.2$


Notice the dates and size.  All way off.  Now I checked the same file on the other nodes:

[root@psql03 pg_logical]# ls -altri
total 8
 68994432 drwx------.  2 postgres postgres    6 Oct 29  2018 snapshots
134984156 drwx------.  2 postgres postgres    6 Oct 29  2018 mappings
   566745 -rw-------.  1 postgres postgres    8 May 22 01:55 replorigin_checkpoint
   566731 drwx------.  4 postgres postgres   65 May 22 01:55 .
    89714 drwxr-xr-x. 20 root     root     4096 May 22 22:43 ..
[root@psql03 pg_logical]#


Dates of the other hosts replorigin_checkpoint are much more recent and smaller.

So I take the replorigin_checkpoint and copy it over to the former primary node  (  From psql03 to psql02(primary)  )  and try a restart.

Everything started up.  Replication started to the other postgres nodes without issues.  Verified base/ folder on all nodes to ensure replication is working. Every table is back to the way it was before this entire outage.  

Adrian, thank you very much for all the help here.  It was much appreciated.  I've written this up and posted it at the following two links so others can benefit and I don't forget.  


And of course a handy way to do backups, which I should have implemented before this happened:  

Thx,
TK

On Sun, Jun 2, 2019 at 4:48 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 6/2/19 11:14 AM, Tom K wrote:

>
> Nope. wal_level was set to replica, not logical.  Unless you mean
>
>
>     What was the role of this cluster in the original setup?
>
>
> The cluster was the backend database for a number of applications.  The
> aim was to point applications to a single large cluster instead of a
> number of small instances of postgres running all over the LAB.
>
> If I can get through the above error and get to the DB's and tables, I
> could just dump them out and reinit the entire DB from the backup.

The only thing I could find is the below:

http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/

If that does not work I would suggest sending a new post(new subject) to
the mailing list based on:

replication checkpoint has wrong magic 0 instead of 307747550


--
Adrian Klaver
adrian.klaver@aklaver.com