Обсуждение: FW: query pg_stat_ssl hang 100%cpu
Looks like an abnormal case.
From: James Pang (chaolpan)
Sent: Wednesday, September 6, 2023 9:41 AM
To: pgsql-performance@lists.postgresql.org
Subject: query pg_stat_ssl hang 100%cpu
Hi,
PGv14.8, OS RHEL8, no SSL enabled in this database, we have a lot of client sessions who check it’s ssl state by query, all other sessions got done very quickly, but only 1 session hang there in 100% cpu tens of hours, even pg_terminate_backend does not make it stopped either. It looks like abnormal.
select ssl from pg_stat_ssl where pid=pg_backend_pid();
testdb=# select pid,usename,application_name,query_start,xact_start,state_change,wait_event_type,state,query from pg_stat_activity where pid=1245344;
pid | usename | application_name | query_start | xact_start | state_change | wait_event_type |
state | query
---------+---------+------------------------+------------------------------+-------------------------------+------------------------------+-----------------+
--------+--------------------------------------------------------
1245344 | test | PostgreSQL JDBC Driver | 2023-09-03 02:36:23.40238+00 | 2023-09-03 02:36:23.402331+00 | 2023-09-03 02:36:23.40238+00 | |
active | select ssl from pg_stat_ssl where pid=pg_backend_pid()
(1 row)
testdb=# select pid,usename,application_name,query_start,xact_start,state_change,wait_event_type,state,query from pg_stat_activity where pid=1245344;
pid | usename | application_name | query_start | xact_start | state_change | wait_event_type |
state | query
---------+---------+------------------------+------------------------------+-------------------------------+------------------------------+-----------------+
--------+--------------------------------------------------------
1245344 | test | PostgreSQL JDBC Driver | 2023-09-03 02:36:23.40238+00 | 2023-09-03 02:36:23.402331+00 | 2023-09-03 02:36:23.40238+00 | |
active | select ssl from pg_stat_ssl where pid=pg_backend_pid()
(1 row)
testdb=# select pid,usename,application_name,query_start,xact_start,state_change,wait_event_type,state,query from pg_stat_activity where pid=1245344;
pid | usename | application_name | query_start | xact_start | state_change | wait_event_type |
state | query
---------+---------+------------------------+------------------------------+-------------------------------+------------------------------+-----------------+
--------+--------------------------------------------------------
1245344 | test | PostgreSQL JDBC Driver | 2023-09-03 02:36:23.40238+00 | 2023-09-03 02:36:23.402331+00 | 2023-09-03 02:36:23.40238+00 | |
active | select ssl from pg_stat_ssl where pid=pg_backend_pid()
(1 row)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ nMaj nMin WCHAN COMMAND
1245344 postgres 20 0 32.5g 12468 12164 R 99.5 0.0 4219:12 0 1343 - postgres: test testdb 10.250.193.40(48282) BIND
#0 ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714
#1 0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001
#2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimized out>,
resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393
#3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc,
resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230
#4 0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16) at nodeFunctionscan.c:370
#5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16) at execProcnode.c:255
#6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936
#7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263
#8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965
#9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0)
at pquery.c:514
#10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995
#11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>) at postgres.c:4552
#12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537
#13 BackendStartup (port=<optimized out>) at postmaster.c:4259
#14 ServerLoop () at postmaster.c:1745
#15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417
#16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209
Thanks,
James
On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote: > PGv14.8, OS RHEL8, no SSL enabled in this database, we have a > lot of client sessions who check it's ssl state by query, all > other sessions got done very quickly, but only 1 session hang > there in 100% cpu tens of hours, even pg_terminate_backend does > not make it stopped either. It looks like abnormal. > > select ssl from pg_stat_ssl where pid=pg_backend_pid(); This is hard to act on without more details or even a reproducible and self-contained test case. Even a java script based on the JDBC driver would be OK for me, for example, if it helps digging into what you are seeing. > #0 ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714 > #1 0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001 > #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimizedout>, > resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393 > #3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc, > resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230 > #4 0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16)at nodeFunctionscan.c:370 > #5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16)at execProcnode.c:255 > #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936 > #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263 > #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965 > #9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) > at pquery.c:514 > #10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995 > #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>)at postgres.c:4552 > #12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537 > #13 BackendStartup (port=<optimized out>) at postmaster.c:4259 > #14 ServerLoop () at postmaster.c:1745 > #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417 > #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209 This stack is referring to a code path where we are checking that some of the type-related data associated to a record is around, but this does not say exactly where the loop happens, so... Are we looking on a loop in the function execution itself from which the information of pg_stat_ssl is retrieved (aka pg_stat_get_activity())? Or is the type cache somewhat broken because of the extended query protocol? That's not really possible to see any evidence based on the information provided, though it provides a few hits that can help. FWIW, I've not heard about an issue like that in the field. The first thing I would do is update to 14.9, which is the latest version of Postgres available for this major version. -- Michael
Вложения
Yes, this backend has been always on same call stack hours , no changed, and 100% cpu. -----Original Message----- From: Michael Paquier <michael@paquier.xyz> Sent: Thursday, September 7, 2023 12:05 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: FW: query pg_stat_ssl hang 100%cpu On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote: > PGv14.8, OS RHEL8, no SSL enabled in this database, we have a > lot of client sessions who check it's ssl state by query, all > other sessions got done very quickly, but only 1 session hang > there in 100% cpu tens of hours, even pg_terminate_backend does > not make it stopped either. It looks like abnormal. > > select ssl from pg_stat_ssl where pid=pg_backend_pid(); This is hard to act on without more details or even a reproducible and self-contained test case. Even a java script basedon the JDBC driver would be OK for me, for example, if it helps digging into what you are seeing. > #0 ensure_record_cache_typmod_slot_exists (typmod=0) at > typcache.c:1714 > #1 0x000000000091185b in assign_record_type_typmod > (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001 > #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimizedout>, > resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at > funcapi.c:393 > #3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc, > resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at > funcapi.c:230 > #4 0x00000000006a2fa5 in ExecInitFunctionScan > (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, > eflags=eflags@entry=16) at nodeFunctionscan.c:370 > #5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, > estate=estate@entry=0x269e948, eflags=eflags@entry=16) at > execProcnode.c:255 > #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at > execMain.c:936 > #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at > execMain.c:263 > #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, > eflags=<optimized out>) at pg_stat_statements.c:965 > #9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) > at pquery.c:514 > #10 0x00000000007fa27f in exec_bind_message > (input_message=0x7ffc9dff90d0) at postgres.c:1995 > #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, > dbname=<optimized out>, username=<optimized out>) at postgres.c:4552 > #12 0x000000000077a4ea in BackendRun (port=<optimized out>, > port=<optimized out>) at postmaster.c:4537 > #13 BackendStartup (port=<optimized out>) at postmaster.c:4259 > #14 ServerLoop () at postmaster.c:1745 > #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, > argv=argv@entry=0x256abc0) at postmaster.c:1417 > #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209 This stack is referring to a code path where we are checking that some of the type-related data associated to a record isaround, but this does not say exactly where the loop happens, so... Are we looking on a loop in the function executionitself from which the information of pg_stat_ssl is retrieved (aka pg_stat_get_activity())? Or is the type cachesomewhat broken because of the extended query protocol? That's not really possible to see any evidence based on the information provided, though it provides a few hits that canhelp. FWIW, I've not heard about an issue like that in the field. The first thing I would do is update to 14.9, which is the latest version of Postgres available for this major version. -- Michael
Yes, this backend has been always on same call stack tens of hours and 100% cpu there. It's still hang there now, butI can not reproduce that in other similar environment. I found this query start a transaction "xact_start" from " 2023-09-0302:36:23" from pg_stat_activity, no idea why. -----Original Message----- From: Michael Paquier <michael@paquier.xyz> Sent: Thursday, September 7, 2023 12:05 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: FW: query pg_stat_ssl hang 100%cpu On Thu, Sep 07, 2023 at 01:35:00AM +0000, James Pang (chaolpan) wrote: > PGv14.8, OS RHEL8, no SSL enabled in this database, we have a > lot of client sessions who check it's ssl state by query, all > other sessions got done very quickly, but only 1 session hang > there in 100% cpu tens of hours, even pg_terminate_backend does > not make it stopped either. It looks like abnormal. > > select ssl from pg_stat_ssl where pid=pg_backend_pid(); This is hard to act on without more details or even a reproducible and self-contained test case. Even a java script basedon the JDBC driver would be OK for me, for example, if it helps digging into what you are seeing. > #0 ensure_record_cache_typmod_slot_exists (typmod=0) at > typcache.c:1714 > #1 0x000000000091185b in assign_record_type_typmod > (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001 > #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimizedout>, > resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at > funcapi.c:393 > #3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc, > resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at > funcapi.c:230 > #4 0x00000000006a2fa5 in ExecInitFunctionScan > (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, > eflags=eflags@entry=16) at nodeFunctionscan.c:370 > #5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, > estate=estate@entry=0x269e948, eflags=eflags@entry=16) at > execProcnode.c:255 > #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at > execMain.c:936 > #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at > execMain.c:263 > #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, > eflags=<optimized out>) at pg_stat_statements.c:965 > #9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) > at pquery.c:514 > #10 0x00000000007fa27f in exec_bind_message > (input_message=0x7ffc9dff90d0) at postgres.c:1995 > #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, > dbname=<optimized out>, username=<optimized out>) at postgres.c:4552 > #12 0x000000000077a4ea in BackendRun (port=<optimized out>, > port=<optimized out>) at postmaster.c:4537 > #13 BackendStartup (port=<optimized out>) at postmaster.c:4259 > #14 ServerLoop () at postmaster.c:1745 > #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, > argv=argv@entry=0x256abc0) at postmaster.c:1417 > #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209 This stack is referring to a code path where we are checking that some of the type-related data associated to a record isaround, but this does not say exactly where the loop happens, so... Are we looking on a loop in the function executionitself from which the information of pg_stat_ssl is retrieved (aka pg_stat_get_activity())? Or is the type cachesomewhat broken because of the extended query protocol? That's not really possible to see any evidence based on the information provided, though it provides a few hits that canhelp. FWIW, I've not heard about an issue like that in the field. The first thing I would do is update to 14.9, which is the latest version of Postgres available for this major version. -- Michael
> #0 ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714 Are you able to print out the value of global variable RecordCacheArrayLen? I wonder if this loop in ensure_record_cache_typmod_slot_exists() is not terminating: int32 newlen = RecordCacheArrayLen * 2; while (typmod >= newlen) newlen *= 2;
(gdb) bt #0 ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714 #1 0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001 #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimizedout>, resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393 #3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc, resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230 #4 0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16) at nodeFunctionscan.c:370 #5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16) at execProcnode.c:255 #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936 #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263 #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965 #9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) at pquery.c:514 #10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995 #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>) at postgres.c:4552 #12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537 #13 BackendStartup (port=<optimized out>) at postmaster.c:4259 #14 ServerLoop () at postmaster.c:1745 #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417 #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209 (gdb) p RecordCacheArrayLen $3 = 0 -----Original Message----- From: Thomas Munro <thomas.munro@gmail.com> Sent: Thursday, September 7, 2023 5:31 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: FW: query pg_stat_ssl hang 100%cpu > #0 ensure_record_cache_typmod_slot_exists (typmod=0) at > typcache.c:1714 Are you able to print out the value of global variable RecordCacheArrayLen? I wonder if this loop in ensure_record_cache_typmod_slot_exists() is not terminating: int32 newlen = RecordCacheArrayLen * 2; while (typmod >= newlen) newlen *= 2;
#0 ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714 #1 0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001 #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimizedout>, resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393 #3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc, resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230 #4 0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16)at nodeFunctionscan.c:370 #5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16)at execProcnode.c:255 #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936 #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263 #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965 #9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) at pquery.c:514 #10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995 #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>) atpostgres.c:4552 #12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537 #13 BackendStartup (port=<optimized out>) at postmaster.c:4259 #14 ServerLoop () at postmaster.c:1745 #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417 #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209 (gdb) p RecordCacheArrayLen $1 = 0 (gdb) p RecordCacheArrayLen $2 = 0 (gdb) p RecordCacheArrayLen $3 = 0 -----Original Message----- From: Thomas Munro <thomas.munro@gmail.com> Sent: Thursday, September 7, 2023 5:31 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: FW: query pg_stat_ssl hang 100%cpu > #0 ensure_record_cache_typmod_slot_exists (typmod=0) at > typcache.c:1714 Are you able to print out the value of global variable RecordCacheArrayLen? I wonder if this loop in ensure_record_cache_typmod_slot_exists() is not terminating: int32 newlen = RecordCacheArrayLen * 2; while (typmod >= newlen) newlen *= 2;
On Thu, Sep 7, 2023 at 9:38 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote: > (gdb) p RecordCacheArrayLen > $3 = 0 Clearly this system lacks check against wrapping around, but it must be hard work to allocate billions of typmods... Or maybe if in an earlier call we assigned RecordCacheArray but the allocation of RecordIdentifierArray failed (a clue would be that it is still NULL), we never manage to assign RecordCacheArrayLen a non-zero value? But it must be unlikely for a small allocation to fail...
Yes, checked the server history logs, we found when the backend starting hang there, operating system has out of memorythat may lead to the allocation of RecordIdentierArray failed. Thanks, James -----Original Message----- From: Thomas Munro <thomas.munro@gmail.com> Sent: Thursday, September 7, 2023 6:01 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: FW: query pg_stat_ssl hang 100%cpu On Thu, Sep 7, 2023 at 9:38 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote: > (gdb) p RecordCacheArrayLen > $3 = 0 Clearly this system lacks check against wrapping around, but it must be hard work to allocate billions of typmods... Or maybe if in an earlier call we assigned RecordCacheArray but the allocation of RecordIdentifierArray failed (a clue wouldbe that it is still NULL), we never manage to assign RecordCacheArrayLen a non-zero value? But it must be unlikelyfor a small allocation to fail...
On Thu, Sep 7, 2023 at 10:31 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote: > Yes, checked the server history logs, we found when the backend starting hang there, operating system has out of memorythat may lead to the allocation of RecordIdentierArray failed. Can you please print out RecordCacheArray and RecordIdentifierArray? For that theory, the first one must be non-NULL and the second one NULL. That would lead to the infinite loop, I think.
#0 ensure_record_cache_typmod_slot_exists (typmod=0) at typcache.c:1714 #1 0x000000000091185b in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x27bc738) at typcache.c:2001 #2 0x000000000091df03 in internal_get_result_type (funcid=<optimized out>, call_expr=<optimized out>, rsinfo=<optimizedout>, resultTypeId=<optimized out>, resultTupleDesc=0x7ffc9dff8cd0) at funcapi.c:393 #3 0x000000000091e263 in get_expr_result_type (expr=expr@entry=0x2792798, resultTypeId=resultTypeId@entry=0x7ffc9dff8ccc, resultTupleDesc=resultTupleDesc@entry=0x7ffc9dff8cd0) at funcapi.c:230 #4 0x00000000006a2fa5 in ExecInitFunctionScan (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16)at nodeFunctionscan.c:370 #5 0x000000000069084e in ExecInitNode (node=node@entry=0x273afa8, estate=estate@entry=0x269e948, eflags=eflags@entry=16)at execProcnode.c:255 #6 0x000000000068a96d in InitPlan (eflags=16, queryDesc=0x273b2d8) at execMain.c:936 #7 standard_ExecutorStart (queryDesc=0x273b2d8, eflags=16) at execMain.c:263 #8 0x00007f67c2821d5d in pgss_ExecutorStart (queryDesc=0x273b2d8, eflags=<optimized out>) at pg_stat_statements.c:965 #9 0x00000000007fc226 in PortalStart (portal=portal@entry=0x26848b8, params=params@entry=0x0, eflags=eflags@entry=0, snapshot=snapshot@entry=0x0) at pquery.c:514 #10 0x00000000007fa27f in exec_bind_message (input_message=0x7ffc9dff90d0) at postgres.c:1995 #11 PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc9dff9370, dbname=<optimized out>, username=<optimized out>) atpostgres.c:4552 #12 0x000000000077a4ea in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4537 #13 BackendStartup (port=<optimized out>) at postmaster.c:4259 #14 ServerLoop () at postmaster.c:1745 #15 0x000000000077b363 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x256abc0) at postmaster.c:1417 #16 0x00000000004fec63 in main (argc=5, argv=0x256abc0) at main.c:209 (gdb) p RecordCacheArray $1 = (TupleDesc *) 0x7f5fac365d90 (gdb) p RecordIdentifierArray $2 = (uint64 *) 0x0 -----Original Message----- From: Thomas Munro <thomas.munro@gmail.com> Sent: Thursday, September 7, 2023 6:36 PM To: James Pang (chaolpan) <chaolpan@cisco.com> Cc: Michael Paquier <michael@paquier.xyz>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: FW: query pg_stat_ssl hang 100%cpu On Thu, Sep 7, 2023 at 10:31 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote: > Yes, checked the server history logs, we found when the backend starting hang there, operating system has out of memorythat may lead to the allocation of RecordIdentierArray failed. Can you please print out RecordCacheArray and RecordIdentifierArray? For that theory, the first one must be non-NULL and the second one NULL. That would lead to the infinite loop, I think.
On Thu, Sep 7, 2023 at 10:39 PM James Pang (chaolpan) <chaolpan@cisco.com> wrote: > (gdb) p RecordCacheArray > $1 = (TupleDesc *) 0x7f5fac365d90 > (gdb) p RecordIdentifierArray > $2 = (uint64 *) 0x0 Hah, yeah that's it, and you've been extremely unlucky to hit it. ensure_record_cache_typmod_slot_exists() should be more careful about cleaning up on allocation failure, to avoid this state.
On Thu, Sep 7, 2023 at 10:59 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Sep 7, 2023 at 10:39 PM James Pang (chaolpan) > <chaolpan@cisco.com> wrote: > > (gdb) p RecordCacheArray > > $1 = (TupleDesc *) 0x7f5fac365d90 > > (gdb) p RecordIdentifierArray > > $2 = (uint64 *) 0x0 > > Hah, yeah that's it, and you've been extremely unlucky to hit it. > ensure_record_cache_typmod_slot_exists() should be more careful about > cleaning up on allocation failure, to avoid this state. I think the lazy fix would be to re-order those allocations. A marginally more elegant fix would be to merge the arrays, as in the attached. Thoughts?
Вложения
On Thu, Sep 7, 2023 at 10:39 PM James Pang (chaolpan) <chaolpan(at)cisco(dot)com> wrote: > (gdb) p RecordCacheArray > $1 = (TupleDesc *) 0x7f5fac365d90 > (gdb) p RecordIdentifierArray > $2 = (uint64 *) 0x0 Oh, yeah, this stack is broken. You have been really unlucky to hit that. This can randomly cause any session to get stuck, and no need for the extended query protocol here. (I am not sure how, but my email server has somewhat not been able to get the previous messages from James. Anyway.) On Fri, Sep 08, 2023 at 11:45:51AM +1200, Thomas Munro wrote: > I think the lazy fix would be to re-order those allocations. A > marginally more elegant fix would be to merge the arrays, as in the > attached. Thoughts? So, ensure_record_cache_typmod_slot_exists() would allocate the initial RecordCacheArray and if it fails on the second one it keeps RecordCacheArrayLen at 0. When coming back again to this code path, the second part of the routine causes an infinite loop because the allocation has been done, but the tracked length is 0. Fun. This is broken since 4b93f57 where the second array has been introduced. Getting that fixed before 11 is EOL is nice as it was introduced there, so good timing. There is a repalloc_extended(), but I cannot get excited to use MCXT_ALLOC_NO_OOM in this code path if there is a simpler method to avoid this issue with a single allocation for the all information set. +static RecordCacheArrayEntry * RecordCacheArray = NULL; pgindent is annoyed by that.. typedefs.list has been updated in your patch, so I guess that you missed one extra indentation after this is refreshed. Note that RememberToFreeTupleDescAtEOX() does something similar to the type cache, and it uses the same approach as your patch. +1 to your proposal of using a struct for the entries in the cache. -- Michael
Вложения
On Fri, Sep 8, 2023 at 2:48 PM Michael Paquier <michael@paquier.xyz> wrote: > +1 to your proposal of using a struct for the entries in the cache. Cool, I'll push/back-patch after 16.0. Even though this seems simple enough, it's an extremely low probability failure and I'd rather keep out of REL_16_STABLE's way...
On Mon, Sep 11, 2023 at 10:37:34AM +1200, Thomas Munro wrote: > Cool, I'll push/back-patch after 16.0. Even though this seems > simple enough, it's an extremely low probability failure and I'd > rather keep out of REL_16_STABLE's way... +1. -- Michael
Вложения
On Mon, Sep 11, 2023 at 10:47 AM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Sep 11, 2023 at 10:37:34AM +1200, Thomas Munro wrote: > > Cool, I'll push/back-patch after 16.0. Even though this seems > > simple enough, it's an extremely low probability failure and I'd > > rather keep out of REL_16_STABLE's way... > > +1. Done.