Re: "ERROR: latch already owned" on gharial

Поиск
Список
Период
Сортировка
От Soumyadeep Chakraborty
Тема Re: "ERROR: latch already owned" on gharial
Дата
Msg-id CAE-ML+_CL3TfhLo6MjCSufinyugqSJWr8qEoWL8oAc-oT+P67g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: "ERROR: latch already owned" on gharial  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Ответы Re: "ERROR: latch already owned" on gharial  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hey hackers,

I wanted to report that we have seen this issue (with the procLatch) a few
times very sporadically on Greenplum 6X (based on 9.4), with relatively newer
versions of GCC.

I realize that 9.4 is out of support, so this email is purely to add on to the
existing thread, in case the info can help fix/reveal something in supported
versions.

Unfortunately, we don't have a core to share as we don't have the benefit of
commit [1] in Greenplum 6X, but we do possess commit [2] which gives us an elog
ERROR as opposed to PANIC.

Instance 1:

Event 1: 2023-11-13 10:01:31.927168 CET..., pY,
..."LOG","00000","disconnection: session time: ..."
Event 2: 2023-11-13 10:01:32.049135
CET...,pX,,,,,"FATAL","XX000","latch already owned by pid Y (is_set:
0) (pg_latch.c:159)",,,,,,,0,,
"pg_latch.c",159,"Stack trace:
1    0xbde8b8 postgres errstart (elog.c:567)
2    0xbe0768 postgres elog_finish (discriminator 7)
3    0xa08924 postgres <symbol not found> (pg_latch.c:158) <---------- OwnLatch
4    0xa7f179 postgres InitProcess (proc.c:523)
5    0xa94ac3 postgres PostgresMain (postgres.c:4874)
6    0xa1e2ed postgres <symbol not found> (postmaster.c:2860)
7    0xa1f295 postgres PostmasterMain (discriminator 5)
...
"LOG","00000","server process (PID Y) exited with exit code
1",,,,,,,0,,"postmaster.c",3987,

Instance 2 (was reported with (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)):

Exactly the same as Instance 1 with identical log, ordering of events and stack
trace, except this time (is_set: 1) when the ERROR is logged.

A possible ordering of events:

(1) DisownLatch() is called by pid Y during ProcKill() and the write for
latch->owner_pid = 0 is NOT yet flushed to shmem.

(2) The PGPROC object for pid Y is returned to the free list.

(3) Pid X sees the same PGPROC object on the free list and grabs it.

(4) Pid X does sanity check inside OwnLatch during InitProcess and
still sees the
old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.

The above sequence of operations should apply to PG HEAD as well.

Suggestion:

Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
ResetLatch(), like the one introduced in [3]? This would ensure that the write
latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

I'm not sure why we didn't introduce a memory barrier in DisownLatch() in [3].
I didn't find anything in the associated hackers thread [4] either. Was it the
performance impact, or was it just because SetLatch and ResetLatch
were more racy
and this is way less likely to happen?

This is out of my wheelhouse, but would one additional barrier in a process'
lifecycle be that bad for performance?

Appendix:

Build details: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)

CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -fno-aggressive-loop-optimizations
-Wno-unused-but-set-variable -Wno-address -Werror=implicit-fallthrough=3
-Wno-format-truncation -Wno-stringop-truncation -m64 -O3
-fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99
-Werror=uninitialized -Werror=implicit-function-declaration

Regards,
Soumyadeep (VMware)

[1] https://github.com/postgres/postgres/commit/12e28aac8e8eb76cab13a4e9b696e3dab17f1c99
[2] https://github.com/greenplum-db/gpdb/commit/81fdd6c5219af865e9dc41f4087e0405d6616050
[3] https://github.com/postgres/postgres/commit/14e8803f101a54d99600683543b0f893a2e3f529
[4] https://www.postgresql.org/message-id/flat/20150112154026.GB2092%40awork2.anarazel.de

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: glibc qsort() vulnerability
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: glibc qsort() vulnerability