Re: psql: FATAL: the database system is starting up

Поиск
Список
Период
Сортировка
От Tom K
Тема Re: psql: FATAL: the database system is starting up
Дата
Msg-id CAE3EmBAfMa=eA1LqBB0zf-kZwBPTbRhzY5wUxcPYNQ8hathDQA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: psql: FATAL: the database system is starting up  (Adrian Klaver <adrian.klaver@aklaver.com>)
Ответы Re: psql: FATAL: the database system is starting up  (Adrian Klaver <adrian.klaver@aklaver.com>)
Список pgsql-general



On Wed, May 29, 2019 at 10:28 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/28/19 6:59 PM, Tom K wrote:
>
>
> On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>

>
> Correct.  Master election occurs through Patroni.  WAL level is set to:
>
> wal_level = 'replica'
>
> So no archiving.
>
>
>
>      >
>      > After the most recent crash 2-3 weeks ago, the cluster is now
>     running
>      > into this message but I'm not able to make heads or tails out of why
>      > it's throwing this:
>
>     So you have not been able to run the cluster the past 2-3 weeks or is
>     that  more recent?
>
>
> Haven't been able to bring this PostgresSQL cluster up ( run the cluster
> ) since 2-3 weeks ago.  Tried quite a few combinations of options to
> recover this.  No luck.  Had storage failures earlier, even with
> corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
> come up each time without any recovery effort on my part.
>
>
>     When you refer to history files below are you talking about WAL
>     files or
>     something else?
>
>     Is this:
>
>     "recovery command file "recovery.conf" specified neither
>     primary_conninfo nor restore_command"
>
>     true?
>
>
> True. recovery.conf is controlled by Patroni.  Contents of this file
> remained the same for all the cluster nodes with the exception of the
> primary_slot_name:
>
> [root@psql01 postgresql-patroni-etcd]# cat recovery.conf
> primary_slot_name = 'postgresql0'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql01 postgresql-patroni-etcd]#
>
> [root@psql02 postgres-backup]# cat recovery.conf
> primary_slot_name = 'postgresql1'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql02 postgres-backup]#
>
> [root@psql03 postgresql-patroni-backup]# cat recovery.conf
> primary_slot_name = 'postgresql2'
> standby_mode = 'on'
> recovery_target_timeline = 'latest'
> [root@psql03 postgresql-patroni-backup]#
>
> I've made a copy of the root postgres directory over to another location
> so when troubleshooting, I can always revert to the first state the
> cluster was in when it failed.

I have no experience with Patroni so I will be of no help there. You
might get more useful information from:

https://github.com/zalando/patroni
Community

There are two places to connect with the Patroni community: on github,
via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
you're using Patroni, or just interested, please join us.

Will post there as well.  Thank you.  My thinking was to post here first since I suspect the Patroni community will simply refer me back here given that the PostgreSQL errors are originating directly from PostgreSQL.
 

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Yes, that is something I have been trying to do actually.  But I hit a dead end with the three errors above.  

So what I did is to copy a single node's backed up copy of the data files to /data/patroni of the same node ( this is the psql data directory as defined through patroni ) of the same node then ran this ( psql03 = 192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

This resulted in one of the 3 messages above.  Hence the post here.  If I can start a single instance, I should be fine since I could then 1) replicate over to the other two or 2) simply take a dump, reinitialize all the databases then restore the dump.  

Using the above procedure I get one of three error messages when using the data files of each node:

[ PSQL01 ]
postgrespostgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W 

And I can't start any one of them.   



>
> Thx,
> TK
>



--
Adrian Klaver
adrian.klaver@aklaver.com

В списке pgsql-general по дате отправления:

Предыдущее
От: Morris de Oryx
Дата:
Сообщение: Re: Questions about btree_gin vs btree_gist for low cardinality columns
Следующее
От: Morris de Oryx
Дата:
Сообщение: Re: Questions about btree_gin vs btree_gist for low cardinality columns