Обсуждение: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

Поиск
Список
Период
Сортировка

Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
Hi,
first time segmentation fault happened on my laptop:

cat /etc/redhat-release 
Fedora release 32 (Thirty Two)

SELECT version();
                                                 version                                                  
----------------------------------------------------------------------------------------------------------
 PostgreSQL 13.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1
20200723 (Red Hat 10.2.1-1), 64-bit


while executing SQL function:
SELECT * FROM vacuum_dead_size(now());
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.


Please don't look at the timestamps as i was executing the function many
times. Although if you want the exact times i can also provide it.

Postgres logs:

2020-10-18 21:26:08.871 CEST [1423] LOG:  server process (PID 435031) was
terminated by signal 11: Segmentation fault
2020-10-18 21:26:08.871 CEST [1423] DETAIL:  Failed process was running:
SELECT * FROM vacuum_dead_size(now());
2020-10-18 21:26:08.871 CEST [1423] LOG:  terminating any other active
server processes
2020-10-18 21:26:08.871 CEST [429052] WARNING:  terminating connection
because of crash of another server process
2020-10-18 21:26:08.871 CEST [429052] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2020-10-18 21:26:08.871 CEST [429052] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2020-10-18 21:26:08.872 CEST [428785] WARNING:  terminating connection
because of crash of another server process
2020-10-18 21:26:08.872 CEST [428785] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2020-10-18 21:26:08.872 CEST [428785] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2020-10-18 21:26:08.878 CEST [435117] FATAL:  the database system is in
recovery mode
2020-10-18 21:26:08.879 CEST [1423] LOG:  all server processes terminated;
reinitializing
2020-10-18 21:26:08.911 CEST [435119] LOG:  database system was interrupted;
last known up at 2020-10-18 21:23:52 CEST
2020-10-18 21:26:08.977 CEST [435119] LOG:  database system was not properly
shut down; automatic recovery in progress
2020-10-18 21:26:08.980 CEST [435119] LOG:  redo starts at 0/E5080E8
2020-10-18 21:26:08.981 CEST [435119] LOG:  invalid record length at
0/E540EA8: wanted 24, got 0
2020-10-18 21:26:08.981 CEST [435119] LOG:  redo done at 0/E540E80
2020-10-18 21:26:09.000 CEST [1423] LOG:  database system is ready to accept
connections


journalctl:

Oct 18 20:48:48 localhost.localdomain systemd-coredump[428779]: Process
428774 (postmaster) of user 26 dumped core.
                                                                
                                                                Stack trace
of thread 428774:
                                                                #0 
0x00007fcbbbdd06e8 __memmove_avx_unaligned_erms (libc.so.6 + 0x1656e8)
                                                                #1 
0x00000000004f8362 fill_val (postgres + 0xf8362)
                                                                #2 
0x00000000004f908d heap_fill_tuple (postgres + 0xf908d)
                                                                #3 
0x00000000004fa27b heap_form_minimal_tuple (postgres + 0xfa27b)
                                                                #4 
0x000000000067c509 tts_minimal_materialize (postgres + 0x27c509)
                                                                #5 
0x000000000067c558 tts_minimal_copy_minimal_tuple (postgres + 0x27c558)
                                                                #6 
0x0000000000918ba7 tuplestore_puttupleslot (postgres + 0x518ba7)
                                                                #7 
0x000000000067f5c3 sqlfunction_receive (postgres + 0x27f5c3)
                                                                #8 
0x0000000000671f4f standard_ExecutorRun (postgres + 0x271f4f)
                                                                #9 
0x00000000006805b4 fmgr_sql (postgres + 0x2805b4)
                                                                #10
0x000000000067a4bf ExecMakeTableFunctionResult (postgres + 0x27a4bf)
                                                                #11
0x000000000068a2b1 FunctionNext (postgres + 0x28a2b1)
                                                                #12
0x0000000000671f22 standard_ExecutorRun (postgres + 0x271f22)
                                                                #13
0x00000000007d439c PortalRunSelect (postgres + 0x3d439c)
                                                                #14
0x00000000007d56ee PortalRun (postgres + 0x3d56ee)
                                                                #15
0x00000000007d127c exec_simple_query (postgres + 0x3d127c)
                                                                #16
0x00000000007d32e9 PostgresMain (postgres + 0x3d32e9)
                                                                #17
0x000000000075d569 ServerLoop (postgres + 0x35d569)
                                                                #18
0x000000000075e4b3 PostmasterMain (postgres + 0x35e4b3)
                                                                #19
0x00000000004f03f3 main (postgres + 0xf03f3)
                                                                #20
0x00007fcbbbc92042 __libc_start_main (libc.so.6 + 0x27042)
                                                                #21
0x00000000004f049e _start (postgres + 0xf049e)








--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
then i installed the fresh PostgreSQL 13 on vm on Azure on Redhat, restored
the logical dump, ran the SQL function and segmentation fault happened
again.

Do you need any futher details?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
Tom Lane
Дата:
pinker <pinker@onet.eu> writes:
> Do you need any futher details?

Yes.  The stack trace is interesting, but far from enough to solve
the problem.  Since it evidently is reproducible for you, maybe
you could extract a self-contained test case?

            regards, tom lane



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
in the core dump seeing only that one:

[New LWP 441004]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: postgres metrics [local] SELECT        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fdf4471d6e8 in __memmove_avx_unaligned_erms () from
/lib64/libc.so.6




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
Tom Lane-2 wrote
> pinker <

> pinker@

> > writes:
>> Do you need any futher details?
> 
> Yes.  The stack trace is interesting, but far from enough to solve
> the problem.  Since it evidently is reproducible for you, maybe
> you could extract a self-contained test case?
> 
>             regards, tom lane

Hi Tom, 
I'm absolutely ok to provide you with the all DDL but i'm not allow to
provide any data. Would that be ok with you?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
DDL:

CREATE FUNCTION public.abs(interval) RETURNS interval
    LANGUAGE sql IMMUTABLE
    AS $_$ select case when ($1<interval '0') then -$1 else $1 end; $_$;

CREATE FUNCTION public.vacuum_dead_size(i_now timestamp with time zone, OUT
schemaname text, OUT relname text, OUT total_bytes numeric, OUT
dead_tup_size nume
ric) RETURNS SETOF record
    LANGUAGE sql
    AS $_$

    WITH closest_metric_stat_user_tables AS (
        SELECT now FROM stat_user_tables ORDER BY abs(now-$1) LIMIT 1
    ), closest_metric_table_sizes AS (
        SELECT now FROM table_sizes ORDER BY abs(now - $1) LIMIT 1
    )
SELECT sut.schemaname, sut.relname, ts.total_bytes, 1::numeric
FROM stat_user_tables sut
         LEFT JOIN table_sizes ts ON ts.table_name = sut.relname AND
ts.table_schema = sut.schemaname
    WHERE ts.now = (SELECT now FROM closest_metric_table_sizes) AND sut.now
= (SELECT now FROM closest_metric_stat_user_tables)
ORDER BY 1;
$_$;

CREATE TABLE public.stat_user_tables (
    now timestamp with time zone,
    relid oid,
    schemaname name,
    relname name,
    seq_scan bigint,
    seq_tup_read bigint,
    idx_scan bigint,
    idx_tup_fetch bigint,
    n_tup_ins bigint,
    n_tup_upd bigint,
    n_tup_del bigint,
    n_tup_hot_upd bigint,
    n_live_tup bigint,
    n_dead_tup bigint,
    n_mod_since_analyze bigint,
    last_vacuum timestamp with time zone,
    last_autovacuum timestamp with time zone,
    last_analyze timestamp with time zone,
    last_autoanalyze timestamp with time zone,
    vacuum_count bigint,
    autovacuum_count bigint,
    analyze_count bigint,
    autoanalyze_count bigint
);

CREATE TABLE public.table_sizes (
    now timestamp with time zone,
    oid oid,
    table_schema name,
    table_name name,
    row_estimate real,
    total_bytes bigint,
    index_bytes bigint,
    toast_bytes bigint,
    table_bytes bigint,
    total text,
    index text,
    toast text,
    "table" text
);



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
Tom Lane
Дата:
pinker <pinker@onet.eu> writes:
> I'm absolutely ok to provide you with the all DDL but i'm not allow to
> provide any data. Would that be ok with you?

It would be if it were sufficient to reproduce the crash ... but
I tried, and I don't see any crash.

Maybe you could make up some dummy data that is enough to make
a self-contained test?

            regards, tom lane



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
Tom Lane-2 wrote
> pinker <

> pinker@

> > writes:
>> I'm absolutely ok to provide you with the all DDL but i'm not allow to
>> provide any data. Would that be ok with you?
> 
> It would be if it were sufficient to reproduce the crash ... but
> I tried, and I don't see any crash.
> 
> Maybe you could make up some dummy data that is enough to make
> a self-contained test?
> 
>             regards, tom lane

I'm trying with this tool https://github.com/pivotal-gss/mock-data
but unfortunately gives me error about unsuported data types. Do you know
any other tool that i can use?



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
ok, it sounds so crazy that i cannot believe...
it crashes when value in column relname is 'promotion' also with
'promotion_[something]' it does not crash when other strings are used.... !!

COPY public.stat_user_tables (now, relid, schemaname, relname, seq_scan,
seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del,
n_tup_hot_upd, n_live_tup, n_dead_tup, n_mod_since_analyze, last_vacuum,
last_autovacuum, last_analyze, last_autoanalyze, vacuum_count,
autovacuum_count, analyze_count, autoanalyze_count) FROM stdin;
2020-10-17 01:35:01.935309+02    16802    public    promotion    311    7775    0    0    0    0    00    25
0    0    \N    2020-10-07 09:08:41.061339+02    2020-10-16 04:47:58.421258+02    \N    01    1    0
2020-10-17 01:35:01.935309+02    123040    public    promotion    2    0    1    0    0    0    00    0    0    0
\N    2020-10-07 09:08:41.264478+02    2020-10-16 04:47:31.591375+02    \N    01    1    0
\.

COPY public.table_sizes (now, oid, table_schema, table_name, row_estimate,
total_bytes, index_bytes, toast_bytes, table_bytes, total, index, toast,
"table") FROM stdin;
2020-10-16 18:13:57.089157+02    123045    public    promotion    1910    23142400    5496832
139264    1750630422 MB    5368 kB    136 kB    17 MB
2020-10-16 18:13:57.089157+02    16802    public    promotion    25    57344    16384    \N    40960
56 kB    16 kB    \N    40 kB
\.




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
and then pls run:

SELECT * FROM vacuum_dead_size(now());



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
the same function behaves totally differently on 12:

postgres=# \c test
psql (13.0, server 12.4)
You are now connected to database "test" as user "postgres".
test=# SELECT * FROM public.vacuum_dead_size(now());
ERROR:  return type mismatch in function declared to return record
DETAIL:  Final statement returns name instead of text at column 1.
CONTEXT:  SQL function "vacuum_dead_size" during startup

there is been no mismatch error on 13



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
ala=# SELECT * FROM public.vacuum_dead_size(now());
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?> \q
[postgres@localhost ~]$ psql ala
psql (13.0)
Type "help" for help.

ala=# update table_sizes set table_name = 'promotionuuu';
UPDATE 2
ala=# SELECT * FROM public.vacuum_dead_size(now());
 schemaname | relname | total_bytes | dead_tup_size 
------------+---------+-------------+---------------
(0 rows)

ala=# SELECT * FROM public.vacuum_dead_size(now());
 schemaname | relname | total_bytes | dead_tup_size 
------------+---------+-------------+---------------
(0 rows)

ala=# update table_sizes set table_name = 'promotion';
UPDATE 2
ala=# SELECT * FROM public.vacuum_dead_size(now());
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?> \q




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
cast to text those 2 columns helped:

sut.schemaname::TEXT, sut.relname::TEXT

so this function doesn't cause segmentation fault

CREATE OR REPLACE FUNCTION vacuum_dead_size(i_now timestamp with time zone,
OUT schemaname TEXT, OUT relname TEXT, OUT total_bytes NUMERIC, OUT
dead_tup_size NUMERIC)
RETURNS SETOF RECORD
AS $$

    WITH closest_metric_stat_user_tables AS (
        SELECT now FROM stat_user_tables ORDER BY abs(now-$1) LIMIT 1
    ), closest_metric_table_sizes AS (
        SELECT now FROM table_sizes ORDER BY abs(now - $1) LIMIT 1
    )
SELECT sut.schemaname::TEXT, sut.relname::TEXT, ts.total_bytes, 1::numeric
FROM stat_user_tables sut
         LEFT JOIN table_sizes ts ON ts.table_name = sut.relname AND
ts.table_schema = sut.schemaname
    WHERE ts.now = (SELECT now FROM closest_metric_table_sizes) AND sut.now
= (SELECT now FROM closest_metric_stat_user_tables)
ORDER BY 1;
$$ LANGUAGE SQL;



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
Tom Lane
Дата:
pinker <pinker@onet.eu> writes:
> cast to text those 2 columns helped:
> sut.schemaname::TEXT, sut.relname::TEXT

Yeah, I was just concluding that something is failing to handle
the required implicit coercions from name to text.  Haven't found
where yet, though I'm betting 913bbd88d overlooked something.

For the archives' sake, attached is an actual reproduction script,
hopefully without the multiple whitespace problems in the previous
messages.

            regards, tom lane


CREATE TABLE public.stat_user_tables (
    now timestamp with time zone,
    relid oid,
    schemaname name,
    relname name,
    seq_scan bigint,
    seq_tup_read bigint,
    idx_scan bigint,
    idx_tup_fetch bigint,
    n_tup_ins bigint,
    n_tup_upd bigint,
    n_tup_del bigint,
    n_tup_hot_upd bigint,
    n_live_tup bigint,
    n_dead_tup bigint,
    n_mod_since_analyze bigint,
    last_vacuum timestamp with time zone,
    last_autovacuum timestamp with time zone,
    last_analyze timestamp with time zone,
    last_autoanalyze timestamp with time zone,
    vacuum_count bigint,
    autovacuum_count bigint,
    analyze_count bigint,
    autoanalyze_count bigint
);

CREATE TABLE public.table_sizes (
    now timestamp with time zone,
    oid oid,
    table_schema name,
    table_name name,
    row_estimate real,
    total_bytes bigint,
    index_bytes bigint,
    toast_bytes bigint,
    table_bytes bigint,
    total text,
    index text,
    toast text,
    "table" text
);

COPY public.stat_user_tables (now, relid, schemaname, relname, seq_scan,
seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del,
n_tup_hot_upd, n_live_tup, n_dead_tup, n_mod_since_analyze, last_vacuum,
last_autovacuum, last_analyze, last_autoanalyze, vacuum_count,
autovacuum_count, analyze_count, autoanalyze_count) FROM stdin;
2020-10-17 01:35:01.935309+02    16802    public    promotion    311    7775    0    0    0    0    0    0    25    0
0    \N    2020-10-07 09:08:41.061339+02    2020-10-16 04:47:58.421258+02    \N    0    1    1    0 
2020-10-17 01:35:01.935309+02    123040    public    promotion    2    0    1    0    0    0    0    0    0    0    0
\N    2020-10-07 09:08:41.264478+02    2020-10-16 04:47:31.591375+02    \N    0    1    1    0 
\.

COPY public.table_sizes (now, oid, table_schema, table_name, row_estimate,
total_bytes, index_bytes, toast_bytes, table_bytes, total, index, toast,
"table") FROM stdin;
2020-10-16 18:13:57.089157+02    123045    public    promotion    1910    23142400    5496832    139264    17506304
22MB    5368 kB    136 kB    17 MB 
2020-10-16 18:13:57.089157+02    16802    public    promotion    25    57344    16384    \N    40960    56 kB    16 kB
 \N    40 kB 
\.

CREATE FUNCTION public.abs(interval) RETURNS interval
    LANGUAGE sql IMMUTABLE
    AS $_$ select case when ($1<interval '0') then -$1 else $1 end; $_$;

CREATE FUNCTION public.vacuum_dead_size(i_now timestamp with time zone, OUT
schemaname text, OUT relname text, OUT total_bytes numeric, OUT
dead_tup_size numeric) RETURNS SETOF record
    LANGUAGE sql
    AS $_$

    WITH closest_metric_stat_user_tables AS (
        SELECT now FROM stat_user_tables ORDER BY abs(now-$1) LIMIT 1
    ), closest_metric_table_sizes AS (
        SELECT now FROM table_sizes ORDER BY abs(now - $1) LIMIT 1
    )
SELECT sut.schemaname, sut.relname, ts.total_bytes, 1::numeric
FROM stat_user_tables sut
         LEFT JOIN table_sizes ts ON ts.table_name = sut.relname AND
ts.table_schema = sut.schemaname
    WHERE ts.now = (SELECT now FROM closest_metric_table_sizes) AND sut.now
= (SELECT now FROM closest_metric_stat_user_tables)
ORDER BY 1;
$_$;

SELECT * FROM vacuum_dead_size(now());

Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
Tom Lane
Дата:
Oh, here's a simpler reproducer:

create or replace function foo (out schemaname text, out relname text)
returns setof record language sql
as $$
  select nspname, relname from pg_class c join pg_namespace n
  on (n.oid = relnamespace)
  order by 1
$$;

select * from foo();

It doesn't fail without the ORDER BY, suggesting that the problem
is localized in failing to handle the case where a sort key
column needs to be coerced.

            regards, tom lane



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
pinker
Дата:
I've changed all name data types to text and running a query on a simple
table with another function and it is still crashing :/
yeah, removing ORDER BY helps here also...



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: Postgres 13 signal 11: Segmentation fault tested on 2 independent machines

От
Tom Lane
Дата:
pinker <pinker@onet.eu> writes:
> I've changed all name data types to text and running a query on a simple
> table with another function and it is still crashing :/
> yeah, removing ORDER BY helps here also...

Hard to comment on that when you haven't shown an example.

Anyway, I've identified the issue with the presented example and pushed
a fix.  If you're in a position to rebuild Postgres locally you could
try applying

https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=25378db74fd97f2b10ad44d1f0b2e1f8b0a651f2

and see whether it takes care of all the cases you noticed.

            regards, tom lane