Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2parallel gathers

Поиск
Список
Период
Сортировка
От Chris Snook
Тема Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2parallel gathers
Дата
Msg-id CAONUJSM5X259vAnnwSpqu=VnRECfGSJ-CgRHyS4P5YyRVwkXsQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2parallel gathers  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-bugs
I'll start by saying that I don't have as much information as I'd like to make a proper bug report, but I want to throw this out there in case anyone else has seen something similar and can point me in the right direction for reproducing it.

I have a Debian Stretch system running the 9.6.10 build from Debian APT repos. It has been running flawlessly, lightly loaded, for about a year and a half, only being restarted for security updates. It recently crashed twice about an hour apart due to postmaster receiving a signal from a worker task, both times with this complaint in the log:

FATAL:  no free slots in PMChildFlags array

For more context, I got these tightly packed around the first crash, with the first and last messages repeated hundreds of times:

FATAL:  sorry, too many clients already
FATAL:  sorry, too many clients already
FATAL:  sorry, too many clients already
FATAL:  no free slots in PMChildFlags array
WARNING:  could not remove shared memory segment "/PostgreSQL.1407760088": No such file or directory
FATAL:  semop(id=2293786) failed: Invalid argument
FATAL:  semop(id=2293786) failed: Invalid argument
FATAL:  semctl(2064403, 7, SETVAL, 0) failed: Invalid argument
FATAL:  semop(id=2621476) failed: Invalid argument
FATAL:  semop(id=2621476) failed: Invalid argument
FATAL:  semctl(2293786, 1, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(2621476, 10, SETVAL, 0) failed: Invalid argument
WARNING:  could not remove shared memory segment "/PostgreSQL.1621779631": No such file or directory
LOG:  database system is shut down
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit

The second time, I got this:

FATAL:  no free slots in PMChildFlags array
FATAL:  semctl(3604480, 14, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 13, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 12, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 11, SETVAL, 0) failed: Invalid argument
FATAL:  semctl(3604480, 10, SETVAL, 0) failed: Invalid argument
LOG:  database system is shut down
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit

followed hundreds of lines of terminating connection later by this:

FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to unexpected postmaster exit
FATAL:  terminating connection due to administrator command
STATEMENT:  SELECT column1 FROM table1 WHERE (column2='value1') AND (column3->>'key1'='value2') ORDER BY created ASC LIMIT 1
WARNING:  could not remove shared memory segment "/PostgreSQL.1032869518": No such file or directory
FATAL:  postmaster exited during a parallel transaction
LOG:  database system was interrupted; last known up at 2019-02-12 20:13:55 GMT
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  incomplete startup packet
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up

and hundreds of database start lines later by this:

FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  redo starts at E/D1504EA0
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  invalid record length at E/D15A21B0: wanted 24, got 0
LOG:  redo done at E/D15A2188
LOG:  last completed transaction was at log time 2019-02-12 20:15:58.259264+00
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up

Possibly of interest, the last completed transaction was 2 minutes after the last known time up.

I asked the application developers using the cluster if they had recently made any changes, and they said they had not. A Google search for the PMChildFlags error message found nothing recent. Since it seemed to be related to parallel queries, I looked up that behavior, and found a post that indicates that it'll use 3 workers if the table size is greater than 72 MB and configuration allows. We had max_parallel_workers_per_gather set at 10 for performance testing (the workload in no way needed that), so that was a possibility. I checked the sizes of tables in the cluster, and found that the table that the crashing query was accessing is 80 MB, and no other table in the cluster is that large.

My suspicion is that there's some sort of race condition that requires at least 3 parallel gather workers to hit, and most people aren't hitting it because they're not setting it that high, and there's something specific about my workload that triggers it. I think we started hitting it shortly after an autovacuum updated the query planner stats to reflect that the table was larger than 72 MB, which is why we didn't hit it at 72 MB.

Unfortunately, we didn't have more detailed timestamped logs at the time to confirm this theory, and it hasn't happened a third time. We've lowered max_parallel_workers_per_gather to 2, since this is a production system, but we may be able to rig up a reproducer with some guidance.

- Chris

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15633: Data loss when reading the data from logical replication slot
Следующее
От: "Saul, Jean Paolo"
Дата:
Сообщение: Re: BUG #15609: synchronous_commit=off insert performance regressionwith secondary indexes