standby replication server throws invalid memory alloc request size ,does not start up

Поиск
Список
Период
Сортировка
От Vijaykumar Jain
Тема standby replication server throws invalid memory alloc request size ,does not start up
Дата
Msg-id 1E4C5445-F307-4AE9-B1B9-C85A38B0D930@opentable.com
обсуждение исходный текст
Ответы Re: standby replication server throws invalid memory alloc requestsize , does not start up  (Scott Ribe <scott_ribe@elevated-dev.com>)
Re: standby replication server throws invalid memory alloc requestsize , does not start up  (Andres Freund <andres@anarazel.de>)
Список pgsql-admin

Hi All,

 

This is my first postgres query to admin list, so if I am not following the right standards for asking the question, pls let me know 😊

 

 

The problem:

 

I have a postgres cluster as

 

A (primary)-> streaming replication -> B(hot_standby=on)

 

We had a power outage in one of the data centers, and when we got back, one of the databases servers (B the standby node) seem to show weird errors and is not starting up. A recovered fine, and it running fine.

 

--------- logs

 

2018-06-25 10:57:04 UTC HINT:  In a moment you should be able to reconnect to the database and repeat your command.

2018-06-25 10:57:04 UTC WARNING:  terminating connection because of crash of another server process

2018-06-25 10:57:04 UTC DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2018-06-25 10:57:04 UTC HINT:  In a moment you should be able to reconnect to the database and repeat your command.

2018-06-25 10:57:04 UTC LOG:  database system is shut down

2018-06-27 16:59:28 UTC LOG:  listening on IPv4 address "0.0.0.0", port 5432

2018-06-27 16:59:28 UTC LOG:  listening on IPv6 address "::", port 5432

2018-06-27 16:59:28 UTC LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

2018-06-27 16:59:28 UTC LOG:  database system was interrupted while in recovery at log time 2018-06-25 10:52:21 UTC

2018-06-27 16:59:28 UTC HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

2018-06-27 16:59:28 UTC LOG:  entering standby mode

2018-06-27 16:59:28 UTC LOG:  recovered replication state of node 1 to 63/89B52A98

2018-06-27 16:59:28 UTC LOG:  redo starts at 9A/6F1B3888

2018-06-27 16:59:28 UTC LOG:  consistent recovery state reached at 9A/6F25BFF0

2018-06-27 16:59:28 UTC FATAL:  invalid memory alloc request size 1768185856

2018-06-27 16:59:28 UTC LOG:  database system is ready to accept read only connections

2018-06-27 16:59:28 UTC LOG:  startup process (PID 11829) exited with exit code 1

2018-06-27 16:59:28 UTC LOG:  terminating any other active server processes

2018-06-27 16:59:28 UTC LOG:  database system is shut down

pg_ctl: could not start server

Examine the log output.

 

 

-------what I tried

I did a lot of searching around “invalid memory alloc request” it mostly points to some data error,

https://www.postgresql.org/message-id/AANLkTik88wGQ6h-xFHJBU_Fj_CnXK5LX7aBws_2oANKP%40mail.gmail.com

 

mem checks and disks are ok.

 

I took  a filesystem snapshot, synced it onto another server and tried to reload it, it gave the same error.

 

Now to ensure there was not real data causing this issue, I took a full dump of the primary db (A) using pg_dumpall and restored in onto the test server. I was able to load the db and was able to query the tables without issues.

So I am confused why the standby is behaving weird after the power outage.

 

Just FYI, I spun up a new node (  C  ) as a replacement for the problem standby(B) with pg_baseback and it is working fine. Can query all tables without problems. My only concern was do I need to worry about this error showing up again?

 

 

Most the googling points to some operation and finding out errors in some operation performed etc,

https://blog.dob.sk/2012/05/19/fixing-pg_dump-invalid-memory-alloc-request-size/         -   1

https://confluence.atlassian.com/jirakb/invalid-memory-alloc-request-size-440107132.html   -   2

 

I ran that function (in 1), could not find bad data rows.

 

 

Also, I tried to bump up (memory to 32GB, cpu to 8 cpu and shmmax to  18446744073692774399 to figure out is there was any resource constraint that I could temp bump up and then debug what went wrong, but it did not start up.

 

 

 

$ /usr/lib/postgresql/10/bin/postgres -V

postgres (PostgreSQL) 10.4 (Ubuntu 10.4-2.pgdg16.04+1)

 

 

$ grep -v "#" /etc/postgresql/10/main/postgresql.conf | sed -e '/^$/d'

ssl = on

ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'

ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'

log_timezone = UTC

stats_temp_directory = '/var/run/postgresql/10-main.pg_stat_tmp'

datestyle = 'iso, mdy'

timezone = UTC

default_text_search_config = 'pg_catalog.english'

log_directory = pg_log

archive_mode = on

maintenance_work_mem = 16MB

max_wal_size = 1GB

max_stack_depth = 2MB

autovacuum_analyze_scale_factor = 0.05

autovacuum_vacuum_scale_factor = 0.1

shared_preload_libraries = pg_stat_statements

work_mem = 4MB

max_wal_senders = 10

temp_buffers = 16MB

archive_command = '/bin/true'

effective_cache_size = 256MB

max_replication_slots = 10

hot_standby = on

wal_keep_segments = 64

wal_level = hot_standby

listen_addresses = '*'

log_min_duration_statement = 10000

max_worker_processes = 18

 

 

 

 

sysctl:

kernel.shmall = 18446744073692774399

kernel.shmmax = 18446744073692774399

 

 

 

$ free -m

              total        used        free      shared  buff/cache   available

Mem:          16047         134       14469           8        1442       15620

Swap:           975           0         975

$ nproc

4

 

 

Thanks,

Vijay

В списке pgsql-admin по дате отправления:

Предыдущее
От: Ganesh Korde
Дата:
Сообщение: Re: Streaming replication connection break - unexpected EOF onstandby connection
Следующее
От: Scott Ribe
Дата:
Сообщение: Re: standby replication server throws invalid memory alloc requestsize , does not start up