Обсуждение: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

Поиск
Список
Период
Сортировка

Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
Hi,
I know it's going to be most likely due to glibc and locales, but I found
interesting case that I can't figure out how to fix.

We have pg 12.6 on bionic. Works. Added focal replica (binary).

Replicates OK, but then fails when I try to pg_dump -s.

Error is:

pg_dump: error: query failed: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
pg_dump: error: query was: SELECT p.tableoid, p.oid, p.proname, p.prolang, p.pronargs, p.proargtypes, p.prorettype,
(SELECTpg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM
pg_catalog.unnest(coalesce(p.proacl,pg_catalog.acldefault('f',p.proowner)))WITH ORDINALITY AS perm(acl,row_n) WHERE NOT
EXISTS( SELECT 1 FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('f',p.proowner))) AS
init(init_acl)WHERE acl = init_acl)) as foo) AS proacl, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT
acl,row_n FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('f',p.proowner))) WITH ORDINALITY AS
initp(acl,row_n)WHERE NOT EXISTS ( SELECT 1 FROM
pg_catalog.unnest(coalesce(p.proacl,pg_catalog.acldefault('f',p.proowner)))AS permp(orig_acl) WHERE acl = orig_acl)) as
foo)AS rproacl, NULL AS initproacl, NULL AS initrproacl, p.pronamespace, (SELECT rolname FROM pg_catalog.pg_roles WHERE
oid= p.proowner) AS rolname FROM pg_proc p LEFT JOIN pg_init_privs pip ON (p.oid = pip.objoid AND pip.classoid =
'pg_proc'::regclassAND pip.objsubid = 0) WHERE p.prokind <> 'a' 
  AND NOT EXISTS (SELECT 1 FROM pg_depend WHERE classid = 'pg_proc'::regclass AND objid = p.oid AND deptype = 'i')
  AND (
  pronamespace != (SELECT oid FROM pg_namespace WHERE nspname = 'pg_catalog')
  OR EXISTS (SELECT 1 FROM pg_cast
  WHERE pg_cast.oid > 16383
  AND p.oid = pg_cast.castfunc)
  OR EXISTS (SELECT 1 FROM pg_transform
  WHERE pg_transform.oid > 16383 AND
  (p.oid = pg_transform.trffromsql
  OR p.oid = pg_transform.trftosql))
  OR p.proacl IS DISTINCT FROM pip.initprivs)


Based on https://wiki.postgresql.org/wiki/Locale_data_changes
I wrote:
SELECT 'reindex index ' || indexrelid::regclass::text
FROM (SELECT indexrelid, indrelid, indcollation[i] coll FROM pg_index, generate_subscripts(indcollation, 1) g(i)) s
  JOIN pg_collation c ON coll=c.oid
WHERE collprovider IN ('d', 'c') AND collname NOT IN ('C', 'POSIX') \gexec

and let it run (80k+ indexes).

Afterwards - same problem.

So I did "reindex system".

And the query still fails.

I tried running simple "select * from table", for each of:

- pg_catalog.pg_roles
- pg_catalog.pg_proc
- pg_catalog.pg_init_privs
- pg_catalog.pg_depend
- pg_catalog.pg_namespace
- pg_catalog.pg_cast
- pg_catalog.pg_transform

and it worked, so I'm kinda at loss here.

I have test system, can test anything. Any idea on what could be the reason,
and if the system is fixable afterwards?

Ah, one more thing - straced backend when it was doing its thing. Last 50 lines:

15:01:31.229198 futex(0xfffe49b8cb3c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229233 futex(0xfffe49b8c41c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229268 futex(0xfffe49b8cdfc, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229303 futex(0xfffe49b8ce00, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229338 futex(0xfffe49b8ce08, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229379 futex(0xfffe49b8dde4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229415 futex(0xfffe49b91524, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229450 futex(0xfffe49b8998c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229485 futex(0xfffe49b8cf9c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229521 futex(0xfffe49b882e4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229556 futex(0xfffe49b8e548, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229596 futex(0xfffe49b8d75c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229634 futex(0xfffe49b8cfe8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229670 futex(0xfffe49b8dfe8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229705 futex(0xfffe49b8dff0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229740 futex(0xfffe49b8dff8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229775 futex(0xfffe49b8ddf0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229810 futex(0xfffe49b8e0a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229845 futex(0xfffe49b8e200, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229881 futex(0xfffe49b8e534, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229925 futex(0xfffe49b8e734, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229961 futex(0xfffe49b8e7dc, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.229998 futex(0xfffe49b8e924, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230034 futex(0xfffe49b8ea6c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230070 futex(0xfffe49b8ea74, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230105 futex(0xfffe49b91514, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230141 futex(0xfffe49b8ad84, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230177 futex(0xfffe49b91518, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230212 futex(0xfffe49b9152c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230247 futex(0xfffe49b91530, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230283 futex(0xfffe49b91718, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230323 futex(0xfffe49b91720, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230384 futex(0xfffe49b8e488, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000013>
15:01:31.230423 futex(0xfffe49b946cc, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230464 futex(0xfffe49b884d4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230515 futex(0xfffe49b955c0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.230624 futex(0xfffe49ba8738, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000013>
15:01:31.230678 futex(0xfffe49b91288, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012>
15:01:31.231031 brk(0xaaaae513e000)     = 0xaaaae513e000 <0.000016>
15:01:31.232480 brk(0xaaaae515f000)     = 0xaaaae515f000 <0.000018>
15:01:31.239133 brk(0xaaaae5180000)     = 0xaaaae5180000 <0.000019>
15:01:31.245463 brk(0xaaaae51a1000)     = 0xaaaae51a1000 <0.000020>
15:01:31.245847 brk(0xaaaae51c2000)     = 0xaaaae51c2000 <0.000016>
15:01:31.259644 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff80da8000 <0.000025>
15:01:31.259705 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfffe4a36a000 <0.000014>
15:01:31.260012 futex(0xaaaae4f7fcf8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000014>
15:01:31.260074 mprotect(0xfffe4a36a000, 36864, PROT_READ|PROT_EXEC) = 0 <0.000017>
15:01:31.260120 mprotect(0xffff80da8000, 4096, PROT_READ|PROT_EXEC) = 0 <0.000013>
15:01:31.260165 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xfffd80da80a8} ---
15:01:31.451341 +++ killed by SIGSEGV (core dumped) +++

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Thu, Mar 03, 2022 at 04:04:28PM +0100, hubert depesz lubaczewski wrote:
> and it worked, so I'm kinda at loss here.

based on some talk on IRC, I was able to get stack trace from fail:

(gdb) bt
#0  0x0000fffe4a36e4d8 in ?? ()
#1  0x0000aaaabe03ffb8 in ExecProcNode (node=0xaaaae4f87cf8) at ./build/../src/include/executor/executor.h:242
#2  ExecMergeJoin (pstate=0xaaaae4f87da8) at ./build/../src/backend/executor/nodeMergejoin.c:656
#3  0x0000aaaabe03ffb8 in ExecProcNode (node=0xaaaae4f87ae8) at ./build/../src/include/executor/executor.h:242
#4  ExecMergeJoin (pstate=0xaaaae4f876c8) at ./build/../src/backend/executor/nodeMergejoin.c:656
#5  0x0000aaaabe039b1c in ExecProcNode (node=0xaaaae4f876c8) at ./build/../src/include/executor/executor.h:242
#6  ExecHashJoinImpl (parallel=false, pstate=0xaaaae4f87408) at ./build/../src/backend/executor/nodeHashjoin.c:262
#7  ExecHashJoin (pstate=0xaaaae4f87408) at ./build/../src/backend/executor/nodeHashjoin.c:591
#8  0x0000aaaabe01ed5c in ExecProcNode (node=0xaaaae4f87408) at ./build/../src/include/executor/executor.h:242
#9  ExecutePlan (execute_once=<optimized out>, dest=0xfffe49be0828, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0xaaaae4f87408,
estate=0xaaaae4ce4180)
    at ./build/../src/backend/executor/execMain.c:1632
#10 standard_ExecutorRun (queryDesc=0xaaaae4e9da70, direction=<optimized out>, count=0, execute_once=<optimized out>)
at./build/../src/backend/executor/execMain.c:350
 
#11 0x0000ffff7d1981fc in pgss_ExecutorRun (queryDesc=0xaaaae4e9da70, direction=ForwardScanDirection, count=0,
execute_once=true)at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:893
 
#12 0x0000ffff7d182688 in explain_ExecutorRun (queryDesc=0xaaaae4e9da70, direction=ForwardScanDirection, count=0,
execute_once=true)at ./build/../contrib/auto_explain/auto_explain.c:320
 
#13 0x0000aaaabe1754c8 in PortalRunSelect (portal=portal@entry=0xaaaae4d476e0, forward=forward@entry=true, count=0,
count@entry=9223372036854775807,dest=dest@entry=0xfffe49be0828) at ./build/../src/backend/tcop/pquery.c:938
 
#14 0x0000aaaabe176a64 in PortalRun (portal=portal@entry=0xaaaae4d476e0, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0xfffe49be0828,
altdest=altdest@entry=0xfffe49be0828,
 
    completionTag=completionTag@entry=0xffffc89001f8 "") at ./build/../src/backend/tcop/pquery.c:779
#15 0x0000aaaabe172a18 in exec_simple_query (
    query_string=query_string@entry=0xaaaae4c6c7b0 "SELECT\n    p.tableoid,\n    p.oid,\n    p.proname,\n
p.prolang,\n   p.pronargs,\n    p.proargtypes,\n    p.prorettype,\n    (\n        SELECT\n", ' ' <repeats 12 times>,
"pg_catalog.array_agg(aclORDER BY row_n)\n        F"...) at ./build/../src/backend/tcop/postgres.c:1215
 
#16 0x0000aaaabe1736ac in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaae4cc9018, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4271
 
#17 0x0000aaaabe0fa768 in BackendRun (port=0xaaaae4cb88e0, port=0xaaaae4cb88e0) at
./build/../src/backend/postmaster/postmaster.c:4510
#18 BackendStartup (port=0xaaaae4cb88e0) at ./build/../src/backend/postmaster/postmaster.c:4193
#19 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#20 0x0000aaaabe0fb74c in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#21 0x0000aaaabde8c8a8 in main (argc=5, argv=0xaaaae4c65ea0) at ./build/../src/backend/main/main.c:228

And one more thing - if it matters - it's on aarch64 architecture
(graviton ec2 in AWS)

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Thu, Mar 03, 2022 at 04:11:56PM +0100, hubert depesz lubaczewski wrote:
> On Thu, Mar 03, 2022 at 04:04:28PM +0100, hubert depesz lubaczewski wrote:
> > and it worked, so I'm kinda at loss here.
> 
> based on some talk on IRC, I was able to get stack trace from fail:

Based on the stack trace I was able to get it to break using simple
query:
select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;

I took a simple look at ranges of oid/prowner, and they look fine:

=# select min(proowner), max(proowner), count(*) from pg_proc;
 min │ max │ count 
─────┼─────┼───────
  10 │  10 │  2970
(1 row)

16:38:34 db: postgres@postgres, pid:1991057
=# select min(oid), max(oid), count(*) from pg_roles;
 min │    max    │ count 
─────┼───────────┼───────
  10 │ 310235824 │   244
(1 row)


Also, as I didn't mention it before: it's Pg 12.9.

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Thu, Mar 03, 2022 at 05:39:21PM +0100, hubert depesz lubaczewski wrote:
> On Thu, Mar 03, 2022 at 04:11:56PM +0100, hubert depesz lubaczewski wrote:
> > On Thu, Mar 03, 2022 at 04:04:28PM +0100, hubert depesz lubaczewski wrote:
> > > and it worked, so I'm kinda at loss here.
> > 
> > based on some talk on IRC, I was able to get stack trace from fail:
> 
> Based on the stack trace I was able to get it to break using simple
> query:
> select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;
> 
> I took a simple look at ranges of oid/prowner, and they look fine:
> 
> =# select min(proowner), max(proowner), count(*) from pg_proc;
>  min │ max │ count 
> ─────┼─────┼───────
>   10 │  10 │  2970
> (1 row)
> 
> 16:38:34 db: postgres@postgres, pid:1991057
> =# select min(oid), max(oid), count(*) from pg_roles;
>  min │    max    │ count 
> ─────┼───────────┼───────
>   10 │ 310235824 │   244
> (1 row)
> 
> 
> Also, as I didn't mention it before: it's Pg 12.9.

OK. Traced it back to JIT. With JIT enabled:

=# show jit;
 jit 
─────
 on
(1 row)

=# explain select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;
                                          QUERY PLAN                                          
──────────────────────────────────────────────────────────────────────────────────────────────
 Seq Scan on pg_proc p  (cost=0.00..156507.84 rows=63264 width=128)
   SubPlan 1
     ->  Index Scan using pg_authid_oid_index on pg_authid  (cost=0.14..2.36 rows=1 width=64)
           Index Cond: (oid = p.proowner)
 JIT:
   Functions: 8
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(7 rows)

=# select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
16:42:44 canvas test, cluster 999, standalone db: @, pid:2014255
?!> 


But when I disable jit - query works fine.

versions of things that I think are relevant:

=$ dpkg -l | grep -E 'llvm|clang|gcc|glibc'
ii  gcc                                4:9.3.0-1ubuntu2                      arm64        GNU C compiler
ii  gcc-10-base:arm64                  10.3.0-1ubuntu1~20.04                 arm64        GCC, the GNU Compiler
Collection(base package)
 
ii  gcc-9                              9.3.0-17ubuntu1~20.04                 arm64        GNU C compiler
ii  gcc-9-base:arm64                   9.3.0-17ubuntu1~20.04                 arm64        GCC, the GNU Compiler
Collection(base package)
 
ii  libgcc-9-dev:arm64                 9.3.0-17ubuntu1~20.04                 arm64        GCC support library
(developmentfiles)
 
ii  libgcc-s1:arm64                    10.3.0-1ubuntu1~20.04                 arm64        GCC support library
ii  libllvm9:arm64                     1:9.0.1-12                            arm64        Modular compiler and
toolchaintechnologies, runtime library
 

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> OK. Traced it back to JIT. With JIT enabled:

Hah, that's useful info.  Seems like it must be incorrect code
generated by JIT.

> versions of things that I think are relevant:

> =$ dpkg -l | grep -E 'llvm|clang|gcc|glibc'
> ii  gcc                                4:9.3.0-1ubuntu2                      arm64        GNU C compiler
> ii  gcc-10-base:arm64                  10.3.0-1ubuntu1~20.04                 arm64        GCC, the GNU Compiler
Collection(base package) 
> ii  gcc-9                              9.3.0-17ubuntu1~20.04                 arm64        GNU C compiler
> ii  gcc-9-base:arm64                   9.3.0-17ubuntu1~20.04                 arm64        GCC, the GNU Compiler
Collection(base package) 
> ii  libgcc-9-dev:arm64                 9.3.0-17ubuntu1~20.04                 arm64        GCC support library
(developmentfiles) 
> ii  libgcc-s1:arm64                    10.3.0-1ubuntu1~20.04                 arm64        GCC support library
> ii  libllvm9:arm64                     1:9.0.1-12                            arm64        Modular compiler and
toolchaintechnologies, runtime library 

arm64, eh?  I wonder if that's buggier than the Intel code paths.

I tried and failed to reproduce this on Fedora 35 on aarch64,
but that has what I think is a newer LLVM version:

clang-13.0.0-3.fc35.aarch64
clang-libs-13.0.0-3.fc35.aarch64
clang-resource-filesystem-13.0.0-3.fc35.aarch64
gcc-11.2.1-9.fc35.aarch64
gcc-c++-11.2.1-9.fc35.aarch64
llvm-13.0.0-4.fc35.aarch64
llvm-devel-13.0.0-4.fc35.aarch64
llvm-libs-13.0.0-4.fc35.aarch64
llvm-static-13.0.0-4.fc35.aarch64
llvm-test-13.0.0-4.fc35.aarch64

Don't think I can readily install anything as old as LLVM 9 ...

            regards, tom lane



Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Fri, Mar 04, 2022 at 02:09:52PM -0500, Tom Lane wrote:
> arm64, eh?  I wonder if that's buggier than the Intel code paths.
> 
> I tried and failed to reproduce this on Fedora 35 on aarch64,
> but that has what I think is a newer LLVM version:

I have suspicion that it also kinda depends on number of rows in there.

When I deleted some schemas, which included some functions, the problem
disappeared.

I wasn't able to pinpoint specific thing, though, and when I called the
pg_proc + pg_roles query for each separate row - it worked flawlessly.

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> On Fri, Mar 04, 2022 at 02:09:52PM -0500, Tom Lane wrote:
>> I tried and failed to reproduce this on Fedora 35 on aarch64,
>> but that has what I think is a newer LLVM version:

> I have suspicion that it also kinda depends on number of rows in there.
> When I deleted some schemas, which included some functions, the problem
> disappeared.
> I wasn't able to pinpoint specific thing, though, and when I called the
> pg_proc + pg_roles query for each separate row - it worked flawlessly.

Mmm ... it might have just been that the planner chose not to use
JIT when it thought there were fewer rows involved.  Did you check
with EXPLAIN that these cut-down cases still used JIT?

            regards, tom lane



Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Mladen Gogala
Дата:
On 3/4/22 17:03, Tom Lane wrote:
Mmm ... it might have just been that the planner chose not to use
JIT when it thought there were fewer rows involved.  Did you check
with EXPLAIN that these cut-down cases still used JIT?

This is interesting and informative answer. How do I check whether JIT is used in the explain plan? Can you give me an example?

TIA

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Tom Lane
Дата:
Mladen Gogala <gogala.mladen@gmail.com> writes:
> On 3/4/22 17:03, Tom Lane wrote:
>> Mmm ... it might have just been that the planner chose not to use
>> JIT when it thought there were fewer rows involved.  Did you check
>> with EXPLAIN that these cut-down cases still used JIT?

> This is interesting and informative answer. How do I check whether JIT
> is used in the explain plan? Can you give me an example?

If EXPLAIN prints some stuff about JIT, then JIT is going to be
used, otherwise not.  Here's an example from my (entirely
unsuccessful) attempts to duplicate depesz's problem:

=# explain select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on pg_proc p  (cost=0.00..393587.22 rows=47243 width=128)
   SubPlan 1
     ->  Index Scan using pg_authid_oid_index on pg_authid  (cost=0.28..8.30 rows=1 width=64)
           Index Cond: (oid = p.proowner)
 JIT:
   Functions: 8
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(7 rows)

            regards, tom lane



Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Fri, Mar 04, 2022 at 05:03:14PM -0500, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > On Fri, Mar 04, 2022 at 02:09:52PM -0500, Tom Lane wrote:
> >> I tried and failed to reproduce this on Fedora 35 on aarch64,
> >> but that has what I think is a newer LLVM version:
> 
> > I have suspicion that it also kinda depends on number of rows in there.
> > When I deleted some schemas, which included some functions, the problem
> > disappeared.
> > I wasn't able to pinpoint specific thing, though, and when I called the
> > pg_proc + pg_roles query for each separate row - it worked flawlessly.
> 
> Mmm ... it might have just been that the planner chose not to use
> JIT when it thought there were fewer rows involved.  Did you check
> with EXPLAIN that these cut-down cases still used JIT?

Hi,
I tore these boxes down, so can't check immediately, but I think
I remember that you're right - single-row queries didn't use JIT.

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> On Fri, Mar 04, 2022 at 05:03:14PM -0500, Tom Lane wrote:
>> Mmm ... it might have just been that the planner chose not to use
>> JIT when it thought there were fewer rows involved.  Did you check
>> with EXPLAIN that these cut-down cases still used JIT?

> I tore these boxes down, so can't check immediately, but I think
> I remember that you're right - single-row queries didn't use JIT.

FWIW, I went to the trouble of installing Ubuntu Focal on my
raspberry pi to see if I could duplicate this, and I couldn't.
However, what you get from a fresh install now is

$ dpkg -l | grep libllvm
ii  libllvm10:arm64                1:10.0.0-4ubuntu1                     arm64        Modular compiler and toolchain
technologies,runtime library 

not 9.0.1.  I also found that Fedora 31/aarch64 is still downloadable, and
that does contain LLVM 9 ... and I could not reproduce it there either.

So I've run out of things to try.  It still seems like a bug in a specific
point release of LLVM is a possible explanation, especially given that
Debian went so far as to replace that release in an LTS distro.  (The
gcc-related packages are all exactly the same versions as you show.)
I'd believe that data corruption could be a contributing factor too,
except that you were able to read all the rows without JIT.

            regards, tom lane



Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Sun, Mar 06, 2022 at 11:10:00AM -0500, Tom Lane wrote:
> > I tore these boxes down, so can't check immediately, but I think
> > I remember that you're right - single-row queries didn't use JIT.

Got focal box up. Loaded schema for Pg.

Initially select didn't break anything, but when I tuned down
jit_above_cost so that it will kick in - got fails immediately.

> FWIW, I went to the trouble of installing Ubuntu Focal on my
> raspberry pi to see if I could duplicate this, and I couldn't.
> However, what you get from a fresh install now is
> $ dpkg -l | grep libllvm                  
> ii  libllvm10:arm64                1:10.0.0-4ubuntu1                     arm64        Modular compiler and toolchain
technologies,runtime library
 
> not 9.0.1.  I also found that Fedora 31/aarch64 is still downloadable, and
> that does contain LLVM 9 ... and I could not reproduce it there either.

Soo... plot thickens.

Looks that pg 12 supplied by pgdg required libllvm9:

=$ apt-cache show postgresql-12 | grep -E '^(Package|Version|Depends):'
Package: postgresql-12
Version: 12.9-2.pgdg20.04+1
Depends: ..., libllvm9 (>= 1:9~svn298832-1~), ...

Package: postgresql-12
Version: 12.9-0ubuntu0.20.04.1
Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ...

Package: postgresql-12
Version: 12.2-4
Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ...


Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to
use/depend-on llvm10 would solve the problem, for now I'm not sure how to do
it. Reached to Christoph about it.

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> Got focal box up. Loaded schema for Pg.
> Initially select didn't break anything, but when I tuned down
> jit_above_cost so that it will kick in - got fails immediately.

Hmph.  I tried two more builds:

* downgraded llvm to llvm-9, compiled PG 12 from source

* installed postgresql-12.10-1.pgdg20.04+1 from pgdg repo

Neither of those configurations fail for me, so either
it's been fixed since 12.9, or (more likely) there is
something to your test case beyond what you've mentioned.

(I guess a long-shot possibility is that my raspberry pi 3B+
is sufficiently different hardware from your box as to not
see the problem.  Doubt that though.)

Can you create a self-contained test case?

            regards, tom lane



Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
Andres Freund
Дата:
Hi,

On 2022-03-07 16:11:28 +0100, hubert depesz lubaczewski wrote:
> On Sun, Mar 06, 2022 at 11:10:00AM -0500, Tom Lane wrote:
> > > I tore these boxes down, so can't check immediately, but I think
> > > I remember that you're right - single-row queries didn't use JIT.
> 
> Got focal box up. Loaded schema for Pg.
> 
> Initially select didn't break anything, but when I tuned down
> jit_above_cost so that it will kick in - got fails immediately.

Could you set jit_debugging_support=on and show a backtrace with that?


> > FWIW, I went to the trouble of installing Ubuntu Focal on my
> > raspberry pi to see if I could duplicate this, and I couldn't.
> > However, what you get from a fresh install now is
> > $ dpkg -l | grep libllvm                  
> > ii  libllvm10:arm64                1:10.0.0-4ubuntu1                     arm64        Modular compiler and
toolchaintechnologies, runtime library
 
> > not 9.0.1.  I also found that Fedora 31/aarch64 is still downloadable, and
> > that does contain LLVM 9 ... and I could not reproduce it there either.
> 
> Soo... plot thickens.
> 
> Looks that pg 12 supplied by pgdg required libllvm9:

What are the libllvm package versions on which you can reproduce the crash?


> Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to
> use/depend-on llvm10 would solve the problem, for now I'm not sure how to do
> it. Reached to Christoph about it.

It'd require rebuilding.


It's possible that the problem is one of the patches applied by ubuntu to llvm
and that's why others can't reproduce so far. There's quite a few...

Greetings,

Andres Freund



Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Mon, Mar 07, 2022 at 12:22:26PM -0500, Tom Lane wrote:
> Neither of those configurations fail for me, so either
> it's been fixed since 12.9, or (more likely) there is
> something to your test case beyond what you've mentioned.

Upgraded to 12.10 from pgdg, same problem.

> (I guess a long-shot possibility is that my raspberry pi 3B+
> is sufficiently different hardware from your box as to not
> see the problem.  Doubt that though.)

My "hardware" is AWS EC2 graviton instance (c6g.2xlarge).

> Can you create a self-contained test case?

I tried, but it seems that number of functions/users is a factor, and
I wasn't able to make simple test that would replicate the issue.

I guess I will just disable JIT on arm for now, and be done with it.

Best regards,

depesz




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

От
hubert depesz lubaczewski
Дата:
On Mon, Mar 07, 2022 at 09:54:22AM -0800, Andres Freund wrote:
> > Initially select didn't break anything, but when I tuned down
> > jit_above_cost so that it will kick in - got fails immediately.
> Could you set jit_debugging_support=on and show a backtrace with that?

Here you go:
Program received signal SIGSEGV, Segmentation fault.
0x0000fffe7587b6c8 in deform_0_8 ()
(gdb) bt
#0  0x0000fffe7587b6c8 in deform_0_8 ()
#1  0x0000fffe7587b5f4 in evalexpr_0_7 ()
#2  0x0000aaaab67e6f7c in ExecEvalExprSwitchContext (isNull=0xfffff7956217, econtext=0xaaaaf6014608,
state=0xaaaaf6014ab0)at ./build/../src/include/executor/executor.h:316
 
#3  ExecProject (projInfo=0xaaaaf6014aa8) at ./build/../src/include/executor/executor.h:350
#4  ExecScan (node=<optimized out>, accessMtd=0xaaaab68051b8 <SeqNext>, recheckMtd=0xaaaab6805248 <SeqRecheck>) at
./build/../src/backend/executor/execScan.c:239
#5  0x0000aaaab67ded5c in ExecProcNode (node=0xaaaaf60143b8) at ./build/../src/include/executor/executor.h:242
#6  ExecutePlan (execute_once=<optimized out>, dest=0xaaaaf5fc15e8, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0xaaaaf60143b8,
estate=0xaaaaf5e01360)
    at ./build/../src/backend/executor/execMain.c:1632
#7  standard_ExecutorRun (queryDesc=0xaaaaf5ef4130, direction=<optimized out>, count=0, execute_once=<optimized out>)
at./build/../src/backend/executor/execMain.c:350
 
#8  0x0000ffffa87751fc in pgss_ExecutorRun (queryDesc=0xaaaaf5ef4130, direction=ForwardScanDirection, count=0,
execute_once=true)at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:893
 
#9  0x0000ffffa875f688 in explain_ExecutorRun (queryDesc=0xaaaaf5ef4130, direction=ForwardScanDirection, count=0,
execute_once=true)at ./build/../contrib/auto_explain/auto_explain.c:320
 
#10 0x0000aaaab69354c8 in PortalRunSelect (portal=portal@entry=0xaaaaf5e83ef0, forward=forward@entry=true, count=0,
count@entry=9223372036854775807,dest=dest@entry=0xaaaaf5fc15e8) at ./build/../src/backend/tcop/pquery.c:938
 
#11 0x0000aaaab6936a64 in PortalRun (portal=portal@entry=0xaaaaf5e83ef0, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0xaaaaf5fc15e8,
altdest=altdest@entry=0xaaaaf5fc15e8,
 
    completionTag=completionTag@entry=0xfffff7956848 "") at ./build/../src/backend/tcop/pquery.c:779
#12 0x0000aaaab6932a18 in exec_simple_query (query_string=query_string@entry=0xaaaaf5d9af90 "select p.proname, (SELECT
rolnamefrom pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;") at
./build/../src/backend/tcop/postgres.c:1215
#13 0x0000aaaab69336ac in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5df6f18, dbname=<optimized out>,
username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4271
 
#14 0x0000aaaab68ba768 in BackendRun (port=0xaaaaf5df0910, port=0xaaaaf5df0910) at
./build/../src/backend/postmaster/postmaster.c:4510
#15 BackendStartup (port=0xaaaaf5df0910) at ./build/../src/backend/postmaster/postmaster.c:4193
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#17 0x0000aaaab68bb74c in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1398
#18 0x0000aaaab664c8a8 in main (argc=5, argv=0xaaaaf5d94ea0) at ./build/../src/backend/main/main.c:228

> > > FWIW, I went to the trouble of installing Ubuntu Focal on my
> > > raspberry pi to see if I could duplicate this, and I couldn't.
> > > However, what you get from a fresh install now is
> > > $ dpkg -l | grep libllvm                  
> > > ii  libllvm10:arm64                1:10.0.0-4ubuntu1                     arm64        Modular compiler and
toolchaintechnologies, runtime library
 
> > > not 9.0.1.  I also found that Fedora 31/aarch64 is still downloadable, and
> > > that does contain LLVM 9 ... and I could not reproduce it there either.
> > 
> > Soo... plot thickens.
> > 
> > Looks that pg 12 supplied by pgdg required libllvm9:
> What are the libllvm package versions on which you can reproduce the crash?

Hmm .. not sure I fully understand. Here are all packages that I think are relevant:

root@c999-pg1302:~# dpkg -l | grep -iE 'postgresql-12|llvm|clang'
ii  libllvm9:arm64                     1:9.0.1-12                            arm64        Modular compiler and
toolchaintechnologies, runtime library
 
ii  postgresql-12                      12.9-2.pgdg20.04+1                    arm64        The World's Most Advanced
OpenSource Relational Database
 
ii  postgresql-12-dbgsym               12.9-2.pgdg20.04+1                    arm64        debug symbols for
postgresql-12
ii  postgresql-12-pg-collkey           0.5.1-1insops1                        arm64        ICU collation function
wrapperfor PostgreSQL 12
 
ii  postgresql-12-postgis-2.5          2.5.5+dfsg-1.pgdg20.04+2              arm64        Geographic objects support
forPostgreSQL 12
 
ii  postgresql-12-postgis-2.5-scripts  2.5.5+dfsg-1.pgdg20.04+2              all          Geographic objects support
forPostgreSQL 12 -- SQL scripts
 
ii  postgresql-12-repack               1.4.7-2.pgdg20.04+1                   arm64        reorganize tables in
PostgreSQLdatabases with minimal locks
 
ii  postgresql-12-system-stats         1.0.0-1                               arm64        A packaged version of
EnterpriseDB'ssystem_stats postgres extension
 

Hope it helps.

Best regards,

depesz




PostgreSQL on focal and llvm version

От
Christoph Berg
Дата:
Re: hubert depesz lubaczewski
> Soo... plot thickens.
> 
> Looks that pg 12 supplied by pgdg required libllvm9:
> 
> =$ apt-cache show postgresql-12 | grep -E '^(Package|Version|Depends):'
> Package: postgresql-12
> Version: 12.9-2.pgdg20.04+1
> Depends: ..., libllvm9 (>= 1:9~svn298832-1~), ...
> 
> Package: postgresql-12
> Version: 12.9-0ubuntu0.20.04.1
> Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ...
> 
> Package: postgresql-12
> Version: 12.2-4
> Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ...
> 
> 
> Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to
> use/depend-on llvm10 would solve the problem, for now I'm not sure how to do
> it. Reached to Christoph about it.

Thanks for spotting that. The problem turned out me being not smart
enough to determine the newest llvm version installed; the
not-quite-correct Makefile code failed on "9" < "10".

I have now fixed that problem. Updated postgresql-{11,12,13,14}
packages using libllvm10 instead of libllvm9 are available in
focal-pgdg-testing.

Since this is potentially a breaking change, I'll not push these live
immediately but will give people a chance to test these until next
week.

Christoph



PostgreSQL on focal and llvm version

От
Christoph Berg
Дата:
Re: hubert depesz lubaczewski
> Soo... plot thickens.
> 
> Looks that pg 12 supplied by pgdg required libllvm9:
> 
> =$ apt-cache show postgresql-12 | grep -E '^(Package|Version|Depends):'
> Package: postgresql-12
> Version: 12.9-2.pgdg20.04+1
> Depends: ..., libllvm9 (>= 1:9~svn298832-1~), ...
> 
> Package: postgresql-12
> Version: 12.9-0ubuntu0.20.04.1
> Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ...
> 
> Package: postgresql-12
> Version: 12.2-4
> Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ...
> 
> 
> Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to
> use/depend-on llvm10 would solve the problem, for now I'm not sure how to do
> it. Reached to Christoph about it.

Thanks for spotting that. The problem turned out me being not smart
enough to determine the newest llvm version installed; the
not-quite-correct Makefile code failed on "9" < "10".

I have now fixed that problem. Updated postgresql-{11,12,13,14}
packages using libllvm10 instead of libllvm9 are available in
focal-pgdg-testing.

Since this is potentially a breaking change, I'll not push these live
immediately but will give people a chance to test these until next
week.

Christoph