[HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
Дата
Msg-id CAM3SWZRmGTKcO3yB1gwuGte9JLogxy53YQUjj5JegYe8=QXivQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] Possible issue with expanded object infrastructure onPostgres 9.6.1  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
A customer is on 9.6.1, and complains of a segfault observed at least
3 times. In all cases, the logs look like this:

Jan 11 16:11:07 ip-10-0-118-82 kernel: [41913.530453] traps:
postgres[6561] general protection ip:55fcf08b0491 sp:7ffc17dfa650
error:0 in postgres[55fcf0557000+638000]
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-1] LOG:  server
process (PID 643) was terminated by signal 11: Segmentation fault
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-2] DETAIL:  Failed
process was running: SELECT COUNT(count_column) FROM (SELECT  1 AS
count_column FROM "albums" INNER JOIN "album_photo_assignments" ON
"albums"."id" = "album_photo_assignments"."album_id" WHERE
"albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id"
= $1 LIMIT 1000 OFFSET 0) subquery_for_count
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [5-1] LOG:  terminating
any other active server processes
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-1] WARNING:
terminating connection because of crash of another server process
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-2] DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-3] HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.

The "general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in
postgres[55fcf0557000+638000]" details are identical in all 3
segfaults observed. Note that the kernel in question is
"4.4.0-57-generic #78~14.04.1-Ubuntu". These crashes were days apart,
but I have yet to isolate the problem. The customer tells me that this
is a representative query plan:

explain SELECT COUNT(count_column)
FROM (SELECT  1 AS count_column   FROM "albums"   INNER JOIN "album_photo_assignments" ON "albums"."id" =
"album_photo_assignments"."album_id"   WHERE "albums"."deleted_at" IS NULL AND
"album_photo_assignments"."photo_id" =
'fd0bcbb3-9c4c-4036-954a-cf01935d004c'   LIMIT 1000 OFFSET 0)
subquery_for_count ;                                                QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────Aggregate
(cost=16419.94..16419.94rows=1 width=8)  ->  Limit  (cost=1000.06..16419.94 rows=1 width=4)        ->  Nested Loop
(cost=1000.06..16419.94rows=1 width=4)              ->  Gather  (cost=1000.00..16415.87 rows=1 width=16)
   Workers Planned: 1                    ->  Parallel Seq Scan on album_photo_assignments 
(cost=0.00..15415.77 rows=1 width=16)                          Filter: (photo_id =
'fd0bcbb3-9c4c-4036-954a-cf01935d004c'::uuid)              ->  Index Scan using albums_pkey on albums
(cost=0.06..4.06 rows=1 width=16)                    Index Cond: (id = album_photo_assignments.album_id)
   Filter: (deleted_at IS NULL) 
(10 rows)

Note that this isn't a prepared statement, unlike the problematic
query we see from the logs. And, I've not yet managed to verify that
the problem case involved execution of this particular "$1 substitute
constant", since that isn't displayed in the logs. So, while this
query plan might well be representative of the problem, it may also
not be.

I can use GDB to get details of the instruction pointer that appeared
in the kernel trap error, which shows a function from the expanded
value representation infrastructure:

(gdb) info symbol 0x55fcf08b0491
EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres
(gdb) info symbol 0x55fcf08b0490
EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres
(gdb) disassemble 0x55fcf08b0490
Dump of assembler code for function EOH_get_flat_size:  0x000055fcf08b0490 <+0>: push   %rbp  0x000055fcf08b0491 <+1>:
mov   0x8(%rdi),%rax  0x000055fcf08b0495 <+5>: mov    %rsp,%rbp  0x000055fcf08b0498 <+8>: pop    %rbp
0x000055fcf08b0499<+9>: mov    (%rax),%rax  0x000055fcf08b049c <+12>: jmpq   *%rax 
End of assembler dump.

(I believe that Linux outputs VMA-wise addresses in the kernel trap
error quoted above, an address from the userland program (Postgres)
address space, so GDB does in fact display interesting disassembly --
the documentation on this seems extremely limited.)

Note that there is a text array column in the table "albums". I note
that there have been fixes in this area already, that are not in 9.6.1
(commit 77cd0dc, for one). I know next to nothing about this code, and
would appreciate some guidance.

--
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait