Обсуждение: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query

Поиск
Список
Период
Сортировка

BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16122
Logged by:          ascott
Email address:      adam.c.scott@gmail.com
PostgreSQL version: 10.11
Operating system:   Ubuntu & CentOS
Description:

Seg fault can be repeated by running this query:

select count(*) from natica_hdu where boundary is not null
and
"dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@
'[45.0,45.1]';

it crashes on this line:  

if (VARATT_IS_EXTENDED(datum)) in fmgr.c:1833

GDB stacktrace is below.

The table definition for natica_hdu is as follows:
CREATE TABLE public.natica_hdu
(
  id integer NOT NULL,
  updated timestamp with time zone NOT NULL,
  hdu_idx smallint NOT NULL,
  ra double precision,
  "dec" double precision,
  boundary double precision[],
  extras jsonb NOT NULL,
  fitsfile_id character varying(32) NOT NULL,
  dec_range numrange,
  ra_range numrange
)
WITH (
  OIDS=FALSE
);
ALTER TABLE public.natica_hdu
  OWNER TO postgres;

CREATE INDEX natica_hdu_dec_range_56c7d92d
  ON public.natica_hdu
  USING btree
  (dec_range);

CREATE INDEX natica_hdu_fitsfile_id_3a3363fe
  ON public.natica_hdu
  USING btree
  (fitsfile_id COLLATE pg_catalog."default");

CREATE INDEX natica_hdu_fitsfile_id_3a3363fe_like
  ON public.natica_hdu
  USING btree
  (fitsfile_id COLLATE pg_catalog."default" varchar_pattern_ops);


CREATE INDEX natica_hdu_q3c_ang2ipix_idx
  ON public.natica_hdu
  USING btree
  (q3c_ang2ipix(ra, "dec"));
ALTER TABLE public.natica_hdu CLUSTER ON natica_hdu_q3c_ang2ipix_idx;

CREATE INDEX natica_hdu_ra_range_b9f4d3ac
  ON public.natica_hdu
  USING btree
  (ra_range);


#0  pg_detoast_datum (datum=0xffffffff) at fmgr.c:1833
#1  0x0000557a18c19545 in numeric_sub (fcinfo=fcinfo@entry=0x7ffff5795e30)
    at numeric.c:2288
#2  0x0000557a18c9bb44 in DirectFunctionCall2Coll (
    func=0x557a18c19520 <numeric_sub>, collation=collation@entry=0, 
    arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:739
#3  0x0000557a18c2752a in numrange_subdiff (fcinfo=<optimized out>)
    at rangetypes.c:1424
#4  0x0000557a18c9c4af in FunctionCall2Coll (
    flinfo=flinfo@entry=0x557a1ad85098, collation=<optimized out>, 
    arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1059
#5  0x0000557a18c2c1ce in get_position (
    typcache=typcache@entry=0x557a1ad84f58,
value=value@entry=0x7ffff5796690, 
    hist2=hist2@entry=0x557a1ad76700, hist1=<optimized out>, 
    hist1=<optimized out>) at rangetypes_selfuncs.c:696
#6  0x0000557a18c2d679 in calc_hist_selectivity_contained (
    length_hist_nvalues=101, length_hist_values=0x557a1ad88568, 
    hist_nvalues=<optimized out>, hist_lower=<optimized out>, 
    upper=0x7ffff57966a0, lower=0x7ffff5796690, typcache=0x557a1ad84f58)
    at rangetypes_selfuncs.c:1065
#7  calc_hist_selectivity (operator=<optimized out>, constval=<optimized
out>, 
    vardata=0x7ffff57966b0, typcache=0x557a1ad84f58)
    at rangetypes_selfuncs.c:561
#8  calc_rangesel (operator=<optimized out>, constval=<optimized out>, 
    vardata=0x7ffff57966b0, typcache=0x557a1ad84f58)
    at rangetypes_selfuncs.c:332
#9  rangesel (fcinfo=<optimized out>) at rangetypes_selfuncs.c:216
#10 0x0000557a18c9cfa9 in OidFunctionCall4Coll (functionId=<optimized out>,

    collation=collation@entry=0, arg1=arg1@entry=93982924614952, 
    arg2=arg2@entry=3892, arg3=arg3@entry=93982924616184,
arg4=arg4@entry=0)
    at fmgr.c:1415
#11 0x0000557a18b0f9e4 in restriction_selectivity (
    root=root@entry=0x557a1ad62d28, operatorid=3892, args=0x557a1ad631f8, 
---Type <return> to continue, or q <return> to quit---
    inputcollid=0, varRelid=varRelid@entry=0) at plancat.c:1670
#12 0x0000557a18ad56db in clause_selectivity
(root=root@entry=0x557a1ad62d28, 
    clause=0x557a1ad63298, clause@entry=0x557a1ad74960, 
    varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, 
    sjinfo=sjinfo@entry=0x0) at clausesel.c:746
#13 0x0000557a18ad594d in clauselist_selectivity (
    root=root@entry=0x557a1ad62d28, clauses=<optimized out>, 
    varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, 
    sjinfo=sjinfo@entry=0x0) at clausesel.c:168
#14 0x0000557a18adb28d in set_baserel_size_estimates (
    root=root@entry=0x557a1ad62d28, rel=rel@entry=0x557a1ad73568)
    at costsize.c:4028
#15 0x0000557a18ad3c85 in set_plain_rel_size (rte=<optimized out>, 
    rel=0x557a1ad73568, root=0x557a1ad62d28) at allpaths.c:526
#16 set_rel_size (root=0x557a1ad62d28, rel=0x557a1ad73568, rti=1, 
    rte=<optimized out>) at allpaths.c:368
#17 0x0000557a18ad5015 in set_base_rel_sizes (root=<optimized out>)
    at allpaths.c:280
#18 make_one_rel (root=root@entry=0x557a1ad62d28, 
    joinlist=joinlist@entry=0x557a1ad73798) at allpaths.c:178
#19 0x0000557a18af22a0 in query_planner (root=root@entry=0x557a1ad62d28, 
    tlist=tlist@entry=0x557a1ad63038, 
    qp_callback=qp_callback@entry=0x557a18af3140 <standard_qp_callback>, 
    qp_extra=qp_extra@entry=0x7ffff5796f60) at planmain.c:259
#20 0x0000557a18af4900 in grouping_planner (root=root@entry=0x557a1ad62d28,

    inheritance_update=inheritance_update@entry=0 '\000', 
    tuple_fraction=<optimized out>, tuple_fraction@entry=0) at
planner.c:1733
#21 0x0000557a18af7a2f in subquery_planner (glob=glob@entry=0x557a1ad62c98,

    parse=parse@entry=0x557a1ad43ef8, parent_root=parent_root@entry=0x0, 
    hasRecursion=hasRecursion@entry=0 '\000', 
    tuple_fraction=tuple_fraction@entry=0) at planner.c:862
#22 0x0000557a18af88b2 in standard_planner (parse=0x557a1ad43ef8, 
    cursorOptions=256, boundParams=0x0) at planner.c:334
---Type <return> to continue, or q <return> to quit---
#23 0x0000557a18b907dc in pg_plan_query (
    querytree=querytree@entry=0x557a1ad43ef8, 
    cursorOptions=cursorOptions@entry=256,
boundParams=boundParams@entry=0x0)
    at postgres.c:819
#24 0x0000557a18b908b6 in pg_plan_queries (querytrees=<optimized out>, 
    cursorOptions=cursorOptions@entry=256,
boundParams=boundParams@entry=0x0)
    at postgres.c:885
#25 0x0000557a18b90cf9 in exec_simple_query (
    query_string=0x557a1ad429c8 "select count(*) from natica_hdu where
boundary is not null\nand\n\"dec_range\" <@
'[89.9999998611111,90.0000001388889)' AND \"ra_range\" <@ '[45.0,45.1]';")
at postgres.c:1050
#26 0x0000557a18b929f1 in PostgresMain (argc=<optimized out>, 
    argv=argv@entry=0x557a1aceee90, dbname=<optimized out>, 
    username=<optimized out>) at postgres.c:4117
#27 0x0000557a188ea6ac in BackendRun (port=0x557a1ace77a0) at
postmaster.c:4408
#28 BackendStartup (port=0x557a1ace77a0) at postmaster.c:4080
#29 ServerLoop () at postmaster.c:1756
#30 0x0000557a18b24d35 in PostmasterMain (argc=5, argv=0x557a1acbf790)
    at postmaster.c:1364
#31 0x0000557a188ebea9 in main (argc=5, argv=0x557a1acbf790) at main.c:228


Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
Hi Adam,

On Mon, Nov 18, 2019 at 01:27:22AM +0000, PG Bug reporting form wrote:
> Email address:      adam.c.scott@gmail.com
> PostgreSQL version: 10.11
> Operating system:   Ubuntu & CentOS
> Description:

Did you see that after updating to 10.11.  If you used 10.10 or
an older version, did the problem happen?

> Seg fault can be repeated by running this query:
>
> select count(*) from natica_hdu where boundary is not null
> and
> "dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@
> '[45.0,45.1]';

It would help if we had a sample of data here able to reproduce the
problem.  Something looks to be wrong with this range operator when
working on numeric ranges, still I cannot reproduce the problem with
for example stuff like that:
CREATE TABLE aa (a numrange);
CREATE INDEX aai ON aa(a);
INSERT INTO aa
  SELECT ('['|| (90.0 - a::numeric/10000000000) || ',' ||
                (90.0 + a::numeric/10000000000) || ')')::numrange
  FROM generate_series(1,10000) as a;
SELECT count(*) FROM aa
  WHERE a <@ '[89.9999998611111,90.0000001388889)';

Seeing the plan of your query may help as well.  Could you run EXPLAIN
on it or does it crash before?  Perhaps a parallel plan is involved
here?

> #0  pg_detoast_datum (datum=0xffffffff) at fmgr.c:1833
> #1  0x0000557a18c19545 in numeric_sub (fcinfo=fcinfo@entry=0x7ffff5795e30)
>     at numeric.c:2288

Hmm.  Per the shape of this backtrace, you are indeed processing the
range operator, and this portion of memory looks freed.  My simplified
example above basically goes through the same when planning the
query.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Tomas Vondra
Дата:
Hi Adam,

thanks for the report.

On Mon, Nov 18, 2019 at 01:27:22AM +0000, PG Bug reporting form wrote:
>The following bug has been logged on the website:
>
>Bug reference:      16122
>Logged by:          ascott
>Email address:      adam.c.scott@gmail.com
>PostgreSQL version: 10.11
>Operating system:   Ubuntu & CentOS
>Description:
>
>Seg fault can be repeated by running this query:
>
>select count(*) from natica_hdu where boundary is not null
>and
>"dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@
>'[45.0,45.1]';
>
>it crashes on this line:
>
>if (VARATT_IS_EXTENDED(datum)) in fmgr.c:1833
>
>GDB stacktrace is below.
>

Hmmm, this does seem like some sort of memory/data corruption, probably.
But I'm unable to reproduce the issue, it's likely data dependent. Can
you try extracting a data set reproducing the issue? Not sure how large
or sensitive the data is, but if needed you can share it off-list.

>The table definition for natica_hdu is as follows:
>CREATE TABLE public.natica_hdu
>(
>  id integer NOT NULL,
>  updated timestamp with time zone NOT NULL,
>  hdu_idx smallint NOT NULL,
>  ra double precision,
>  "dec" double precision,
>  boundary double precision[],
>  extras jsonb NOT NULL,
>  fitsfile_id character varying(32) NOT NULL,
>  dec_range numrange,
>  ra_range numrange
>)
>WITH (
>  OIDS=FALSE
>);

Hm, the query only really touches dec_range and ra_range. If you try
just 

   select count(*) from natica_hdu
   where "dec_range" <@ '[89.9999998611111,90.0000001388889)'

and

   select count(*) from natica_hdu
   where "ra_range" <@ '[45.0,45.1]'

does either of those queries crash too? That would reduce the amount of
data for the reproducer.

If you try just

   COPY (SELECT ra_range, dec_range FROM public.natica_hdu) TO '/dev/null';

does that fail too? If yes, it's likely due to on-disk data corruption.
If not, it seems more like a memory corruption during processing.

I think the grants and indexes are mostly irrelevant for the issue,
particularly those not related to columns in the query.  Did you include
just because they are defined, or because you think it's actually
related?

>ALTER TABLE public.natica_hdu
>  OWNER TO postgres;
>
>CREATE INDEX natica_hdu_dec_range_56c7d92d
>  ON public.natica_hdu
>  USING btree
>  (dec_range);
>
>CREATE INDEX natica_hdu_fitsfile_id_3a3363fe
>  ON public.natica_hdu
>  USING btree
>  (fitsfile_id COLLATE pg_catalog."default");
>
>CREATE INDEX natica_hdu_fitsfile_id_3a3363fe_like
>  ON public.natica_hdu
>  USING btree
>  (fitsfile_id COLLATE pg_catalog."default" varchar_pattern_ops);
>
>
>CREATE INDEX natica_hdu_q3c_ang2ipix_idx
>  ON public.natica_hdu
>  USING btree
>  (q3c_ang2ipix(ra, "dec"));
>ALTER TABLE public.natica_hdu CLUSTER ON natica_hdu_q3c_ang2ipix_idx;
>
>CREATE INDEX natica_hdu_ra_range_b9f4d3ac
>  ON public.natica_hdu
>  USING btree
>  (ra_range);

...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Adam Scott
Дата:
> Did you see that after updating to 10.11.  If you used 10.10 or
> an older version, did the problem happen?
Originally this first was discovered on Centos 7, PG 10.10, then loaded the 40gb table on 10.11 on Ubuntu.

>Seeing the plan of your query may help as well.  Could you run EXPLAIN
>on it or does it crash before?  Perhaps a parallel plan is involved
>here?
Explain plan works fine with no crash

I've done a binary search to find out where the error occurs in the data, but no luck.  It seems intermittent now.  Finally, I was able to reproduce the error repeatably with a blank table:

Stop and start postgres from fresh, and then run this query (notice, I removed a sarg from the originally supplied query):
select id from natica_hdu_test
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';

Thank you for your help in tracking this down!

On Sun, Nov 17, 2019 at 8:50 PM Michael Paquier <michael@paquier.xyz> wrote:
Hi Adam,

On Mon, Nov 18, 2019 at 01:27:22AM +0000, PG Bug reporting form wrote:
> Email address:      adam.c.scott@gmail.com
> PostgreSQL version: 10.11
> Operating system:   Ubuntu &amp; CentOS
> Description:       

Did you see that after updating to 10.11.  If you used 10.10 or
an older version, did the problem happen?

> Seg fault can be repeated by running this query:
>
> select count(*) from natica_hdu where boundary is not null
> and
> "dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@
> '[45.0,45.1]';

It would help if we had a sample of data here able to reproduce the
problem.  Something looks to be wrong with this range operator when
working on numeric ranges, still I cannot reproduce the problem with
for example stuff like that:
CREATE TABLE aa (a numrange);
CREATE INDEX aai ON aa(a);
INSERT INTO aa
  SELECT ('['|| (90.0 - a::numeric/10000000000) || ',' ||
                (90.0 + a::numeric/10000000000) || ')')::numrange
  FROM generate_series(1,10000) as a;
SELECT count(*) FROM aa
  WHERE a <@ '[89.9999998611111,90.0000001388889)';

Seeing the plan of your query may help as well.  Could you run EXPLAIN
on it or does it crash before?  Perhaps a parallel plan is involved
here?

> #0  pg_detoast_datum (datum=0xffffffff) at fmgr.c:1833
> #1  0x0000557a18c19545 in numeric_sub (fcinfo=fcinfo@entry=0x7ffff5795e30)
>     at numeric.c:2288

Hmm.  Per the shape of this backtrace, you are indeed processing the
range operator, and this portion of memory looks freed.  My simplified
example above basically goes through the same when planning the
query.
--
Michael

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Adam Scott
Дата:
Hi Tomas, thank you for your help!  @Michael it looks like 15 repeats is not enough to say it is "reliably" reproduced: hope this does it.

To reproduce, be sure to start postgres from scratch.   Load the dump file (below).  And then run the following to reliably reproduce the error:
select id from natica_hdu_test
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';

It was bugging out with an empty table but it seems intermittent. A successful query followed by the one above will not fail sometimes; hence the need for a from-scratch startup instance of postgres.



Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Mon, Nov 18, 2019 at 02:51:25PM -0800, Adam Scott wrote:
>> Did you see that after updating to 10.11.  If you used 10.10 or
>> an older version, did the problem happen?
>
> Originally this first was discovered on Centos 7, PG 10.10, then loaded the
> 40gb table on 10.11 on Ubuntu.

Which means that prior 10.9 you used this query, but did not notice
anything?  Or is 10.10 the first version in the 10.X series you used?

>> Seeing the plan of your query may help as well.  Could you run EXPLAIN
>> on it or does it crash before?  Perhaps a parallel plan is involved
>> here?
> Explain plan works fine with no crash

I was also wondering about the shape of the plan selected here.  The
crash happens when doing some selectivity on the clause when working
on a plan, but it could help.

> I've done a binary search to find out where the error occurs in the data,
> but no luck.  It seems intermittent now.  Finally, I was able to reproduce
> the error repeatably with a blank table:
>
> Stop and start postgres from fresh, and then run this query (notice, I
> removed a sarg from the originally supplied query):
> select id from natica_hdu_test
> WHERE
> "dec_range" <@ '[88.9999998611111,90.0000001388889)';

Thanks.  I can see that you have provided a dump on the other part of
the thread.  Let's continue from that.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Mon, Nov 18, 2019 at 03:59:14PM -0800, Adam Scott wrote:
> It was bugging out with an empty table but it seems intermittent. A
> successful query followed by the one above will not fail sometimes; hence
> the need for a from-scratch startup instance of postgres.

Okay, that has proved to help a lot.  I have restored the dump, and
after an analyze on natica_hdu_test I can see the failure
immediately.

I have also extracted a small test case able to reproduce the failure.
That's actually pretty close to the scenarios I was playing with
upthread, still I missed it:
CREATE TABLE rangetab (a numrange);
INSERT INTO rangetab
SELECT ('['|| (45.0 - a::numeric/10000000) || ',' ||
              (45.1 + a::numeric/10000000) || ')')::numrange
FROM generate_series(1,1000) as a;
ANALYZE rangetab;
SELECT a FROM rangetab
  WHERE a <@ '[89.9999998611111,90.0000001388889)';

I can easily reproduce the failure on HEAD, REL_12_STABLE and
REL_11_STABLE.  On 10, I could not.  But now that we have a test case
that's only a matter of time.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Adam Scott
Дата:
Great!  Thank you Michael and Tomas!  

Maybe a clue to the issue, why does a numrange need to be detoasted since that is for large strings?

Let me know how I can further help.

Cheers,
Adam


On Mon, Nov 18, 2019 at 5:48 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Nov 18, 2019 at 03:59:14PM -0800, Adam Scott wrote:
> It was bugging out with an empty table but it seems intermittent. A
> successful query followed by the one above will not fail sometimes; hence
> the need for a from-scratch startup instance of postgres.

Okay, that has proved to help a lot.  I have restored the dump, and
after an analyze on natica_hdu_test I can see the failure
immediately.

I have also extracted a small test case able to reproduce the failure.
That's actually pretty close to the scenarios I was playing with
upthread, still I missed it:
CREATE TABLE rangetab (a numrange);
INSERT INTO rangetab
SELECT ('['|| (45.0 - a::numeric/10000000) || ',' ||
              (45.1 + a::numeric/10000000) || ')')::numrange
FROM generate_series(1,1000) as a;
ANALYZE rangetab;
SELECT a FROM rangetab
  WHERE a <@ '[89.9999998611111,90.0000001388889)';                 

I can easily reproduce the failure on HEAD, REL_12_STABLE and
REL_11_STABLE.  On 10, I could not.  But now that we have a test case
that's only a matter of time.
--
Michael

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Mon, Nov 18, 2019 at 06:22:11PM -0800, Adam Scott wrote:
> Maybe a clue to the issue, why does a numrange need to be detoasted since
> that is for large strings?

I looked at the code a bit but could not really figure out what could
go wrong.  Then I switched my strategy and bisected for the first
commit leading to the failure, and this is pointing me to the
integration of stdbool.h for my environment (Debian SID with kernel at
5.3):
commit: 9a95a77d9d5d3003d2d67121f2731b6e5fc37336
author: Peter Eisentraut <peter_e@gmx.net>
date: Thu, 22 Mar 2018 20:42:25 -0400
Use stdbool.h if suitable

Using the standard bool type provided by C allows some recent compilers
and debuggers to give better diagnostics.  Also, some extension code and
third-party headers are increasingly pulling in stdbool.h, so it's
probably saner if everyone uses the same definition.

I have also checked that manually and I am getting a match.  And the
thing is that this commit did not make it through 10...
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Tomas Vondra
Дата:
On Tue, Nov 19, 2019 at 12:04:28PM +0900, Michael Paquier wrote:
>On Mon, Nov 18, 2019 at 06:22:11PM -0800, Adam Scott wrote:
>> Maybe a clue to the issue, why does a numrange need to be detoasted since
>> that is for large strings?
>
>I looked at the code a bit but could not really figure out what could
>go wrong.  Then I switched my strategy and bisected for the first
>commit leading to the failure, and this is pointing me to the
>integration of stdbool.h for my environment (Debian SID with kernel at
>5.3):
>commit: 9a95a77d9d5d3003d2d67121f2731b6e5fc37336
>author: Peter Eisentraut <peter_e@gmx.net>
>date: Thu, 22 Mar 2018 20:42:25 -0400
>Use stdbool.h if suitable
>
>Using the standard bool type provided by C allows some recent compilers
>and debuggers to give better diagnostics.  Also, some extension code and
>third-party headers are increasingly pulling in stdbool.h, so it's
>probably saner if everyone uses the same definition.
>
>I have also checked that manually and I am getting a match.  And the
>thing is that this commit did not make it through 10...

Not sure I understand. How could this be the root cause when it's not
present in the 10.x branch (which is what Adam is using)?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Tomas Vondra
Дата:
On Tue, Nov 19, 2019 at 10:48:27AM +0900, Michael Paquier wrote:
>On Mon, Nov 18, 2019 at 03:59:14PM -0800, Adam Scott wrote:
>> It was bugging out with an empty table but it seems intermittent. A
>> successful query followed by the one above will not fail sometimes; hence
>> the need for a from-scratch startup instance of postgres.
>
>Okay, that has proved to help a lot.  I have restored the dump, and
>after an analyze on natica_hdu_test I can see the failure
>immediately.
>
>I have also extracted a small test case able to reproduce the failure.
>That's actually pretty close to the scenarios I was playing with
>upthread, still I missed it:
>CREATE TABLE rangetab (a numrange);
>INSERT INTO rangetab
>SELECT ('['|| (45.0 - a::numeric/10000000) || ',' ||
>              (45.1 + a::numeric/10000000) || ')')::numrange
>FROM generate_series(1,1000) as a;
>ANALYZE rangetab;
>SELECT a FROM rangetab
>  WHERE a <@ '[89.9999998611111,90.0000001388889)';        
>
>I can easily reproduce the failure on HEAD, REL_12_STABLE and
>REL_11_STABLE.  On 10, I could not.  But now that we have a test case
>that's only a matter of time.

FWIW I've managed to reproduce this on 10, but I had to build without
--enable-cassert. So this does trigger the issue:

$ ./configure --enable-debug --prefix=/home/user/pg-10
$ make -s clean
$ make -s -j8 install
$ pg_ctl -D /tmp/data init
$ pg_ctl -D /tmp/data -l pg.log start
$ createdb test
$ psql test < bug.sql
$ pgbench -n -c 1 -f q.sql test

where q.sql is simply

select id from natica_hdu_test
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';

Haven't investigated further yet.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Tue, Nov 19, 2019 at 11:38:13AM +0100, Tomas Vondra wrote:
> FWIW I've managed to reproduce this on 10, but I had to build without
> --enable-cassert. So this does trigger the issue:
>
> Haven't investigated further yet.

If you add an ANALYZE on the table natica_hdu_test after restoring, I
am rather sure that you would reproduce the crash more quickly because
the handling around the stats of the column are busted here.  Anyway,
taking my example of upthread, I have been also able to reproduce the
problem on REL_10_STABLE even with assertions enabled: the trick is
that you need to leave once the session after the analyze on the
table.  Then a SELECT within a new session is enough to crash the
server.

The change with stdbool.h actually makes the crash easier to reproduce
as there is no need to leave the session.  I am not sure how it
mattered..

[ ... And one bisect later ... ]

This looks more correct as culprit than the precedent because it
touches the area of the crash:
commit: 9aab83fc5039d83e84144b7bed3fb1d62a74ae78
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sat, 13 May 2017 15:14:39 -0400
Redesign get_attstatsslot()/free_attstatsslot() for more safety and speed.

It seems to me that that we are simply free'ing an area which still
needs to be accessed for the stat estimations.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Tue, Nov 19, 2019 at 11:18:18AM +0100, Tomas Vondra wrote:
> Not sure I understand. How could this be the root cause when it's not
> present in the 10.x branch (which is what Adam is using)?

Sorry for the misunderstanding here.  The switch to stdbool.h visibly
makes it easier for the issue to appear, but doing more retries of the
test case when bisecting the error originates from 9aab83f.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Tue, Nov 19, 2019 at 08:40:56PM +0900, Michael Paquier wrote:
> If you add an ANALYZE on the table natica_hdu_test after restoring, I
> am rather sure that you would reproduce the crash more quickly because
> the handling around the stats of the column are busted here.  Anyway,
> taking my example of upthread, I have been also able to reproduce the
> problem on REL_10_STABLE even with assertions enabled: the trick is
> that you need to leave once the session after the analyze on the
> table.  Then a SELECT within a new session is enough to crash the
> server.

So...  I have looked more at this one, and from my previous example it
seems that we have a one-off error when looking up at the array
holding the histograms for ranges (lower and upper bound).

In my previous example, we get to build 101 RangeBounds when beginning
to calculate the range operator selectivity in
calc_hist_selectivity().  However, when we get to the point of
calc_hist_selectivity_contained(), upper_index gets calculated at 100
which is just at the limit of the indexed bounds, and the code would
happily look at the last bound as well as the one-after-the-last bound
as range_cmp_bounds() sees fit, but the latter just points to the
void.  The code looks wrong since its introduction in 59d0bf9d and
it seems that the changes done for free_attstatsslot() in 9aab83f make
the issue more easily reproducible.

A fix like the rough POC attached addresses the issue, but I think
that's too naive to not count for the first bin in the ranges
evaluated.  Tomas, you may be more familiar with this area of the code
than I am.  What do you think?
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Andrey Borodin
Дата:

> 10 дек. 2019 г., в 10:38, Michael Paquier <michael@paquier.xyz> написал(а):
>
> On Tue, Nov 19, 2019 at 08:40:56PM +0900, Michael Paquier wrote:
>> If you add an ANALYZE on the table natica_hdu_test after restoring, I
>> am rather sure that you would reproduce the crash more quickly because
>> the handling around the stats of the column are busted here.  Anyway,
>> taking my example of upthread, I have been also able to reproduce the
>> problem on REL_10_STABLE even with assertions enabled: the trick is
>> that you need to leave once the session after the analyze on the
>> table.  Then a SELECT within a new session is enough to crash the
>> server.
>
> So...  I have looked more at this one, and from my previous example it
> seems that we have a one-off error when looking up at the array
> holding the histograms for ranges (lower and upper bound).
>
> In my previous example, we get to build 101 RangeBounds when beginning
> to calculate the range operator selectivity in
> calc_hist_selectivity().  However, when we get to the point of
> calc_hist_selectivity_contained(), upper_index gets calculated at 100
> which is just at the limit of the indexed bounds, and the code would
> happily look at the last bound as well as the one-after-the-last bound
> as range_cmp_bounds() sees fit, but the latter just points to the
> void.  The code looks wrong since its introduction in 59d0bf9d and
> it seems that the changes done for free_attstatsslot() in 9aab83f make
> the issue more easily reproducible.
>
> A fix like the rough POC attached addresses the issue, but I think
> that's too naive to not count for the first bin in the ranges
> evaluated.  Tomas, you may be more familiar with this area of the code
> than I am.  What do you think?

So... I was looking for some patches on CF and found this one. It's one-liner, what could go wrong?

1. New tests do no fail on my machine on added test case without that line. (on master branch; from this thread i
concludedthat they should fail) 
2. I believe line should be not like
+    for (i = upper_index - 1; i >= 0; i--)
but rather
+    for (i = min(upper_index, hist_nvalues - 2); i >= 0; i--)

I will dig into this during this CF. Currently, that's my 2 cents.


Best regards, Andrey Borodin.


Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Andrey Borodin
Дата:

> 4 янв. 2020 г., в 0:05, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> I believe line should be not like
> +    for (i = upper_index - 1; i >= 0; i--)
> but rather
> +    for (i = min(upper_index, hist_nvalues - 2); i >= 0; i--)
>
> I will dig into this during this CF. Currently, that's my 2 cents.
I think I have a little more correct fix.

As far as I understand, we have an array of bin lower bounds hist_lower with size hist_nvalues.
There is a function get_position(..,value, lower, upper), which calculates relative position of value between lower and
upperbound. 
We call get_position(typcache, lower, &hist_lower[i], &hist_lower[i + 1]); when i is last bin, i.e. i + 1 ==
hist_nvalues,thus passing bogus upper to get_position(). 
PFA possible fix for this.

Upper in the code same situation when upper is undefined is treated as if get_position returned 0.
Also, while get_position() is not prepared to bogus upper, but it is well aware of infinite bounds. We could just add
lastinfinite value to hist_lower and remove upper_index < hist_nvalues - 1 and i + 1 != hist_nvalues checks entirely. 

Best regards, Andrey Borodin.

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query

От
Tom Lane
Дата:
Andrey Borodin <x4mmm@yandex-team.ru> writes:
> PFA possible fix for this.

I looked into this a bit too.  I don't have a lot of faith in Michael's
example; it took me four tries to find a machine it would fail on.
Still, I did eventually get a crash, and I concur that the problem is
that calc_hist_selectivity_contained tries to access the entry after
the histogram end.

I believe that Michael's draft fix is actually about right, because
we don't want to treat the key space above the histogram upper bound
as being an actual bin: we should assume that there are no values there.
So just starting the loop at the last real bin is sufficient.  However,
I'd prefer to handle the two edge cases (i.e., probe value below histogram
lower bound or above histogram upper bound) explicitly, because that makes
it more apparent what's going on.

There are a couple of other ways in which this code seems insufficiently
paranoid to me:

* It seems worth spending a couple lines of code at the beginning to
verify that the histogram has at least one bin, as it already did for
the range-length histogram.  Otherwise we've got various divide-by-zero
hazards here, along with a need for extra tests in the looping functions.

* I think we'd better guard against NaNs coming back from the range
subdiff function, as well as the possibility of getting a NaN from
the division in get_position (compare [1]).

* I am not quite totally sure about this part, but it seems to me
that calc_hist_selectivity_contains probably ought to handle the
range-bound cases the same as calc_hist_selectivity_contained,
even though only the latter is trying to access an unsafe array
index.

That all leads me to the attached draft patch.  It lacks a test
case --- I wonder if we can find one that crashes more portably
than Michael's?  And more eyeballs on calc_hist_selectivity_contains
would be good too.

            regards, tom lane

[1] https://www.postgresql.org/message-id/12957.1577974305@sss.pgh.pa.us

diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
index a4d7a7a..0a41016 100644
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -400,6 +400,13 @@ calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata,
                            ATTSTATSSLOT_VALUES)))
         return -1.0;

+    /* check that it's a histogram, not just a dummy entry */
+    if (hslot.nvalues < 2)
+    {
+        free_attstatsslot(&hslot);
+        return -1.0;
+    }
+
     /*
      * Convert histogram of ranges into histograms of its lower and upper
      * bounds.
@@ -696,21 +703,22 @@ get_position(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound
             return 0.5;

         /* Calculate relative position using subdiff function. */
-        bin_width = DatumGetFloat8(FunctionCall2Coll(
-                                                     &typcache->rng_subdiff_finfo,
+        bin_width = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
                                                      typcache->rng_collation,
                                                      hist2->val,
                                                      hist1->val));
-        if (bin_width <= 0.0)
+        if (isnan(bin_width) || bin_width <= 0.0)
             return 0.5;            /* zero width bin */

-        position = DatumGetFloat8(FunctionCall2Coll(
-                                                    &typcache->rng_subdiff_finfo,
+        position = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
                                                     typcache->rng_collation,
                                                     value->val,
                                                     hist1->val))
             / bin_width;

+        if (isnan(position))
+            return 0.5;            /* punt if we have any NaNs or Infs */
+
         /* Relative position must be in [0,1] range */
         position = Max(position, 0.0);
         position = Min(position, 1.0);
@@ -806,11 +814,19 @@ get_distance(TypeCacheEntry *typcache, const RangeBound *bound1, const RangeBoun
          * value of 1.0 if no subdiff is available.
          */
         if (has_subdiff)
-            return
-                DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
-                                                 typcache->rng_collation,
-                                                 bound2->val,
-                                                 bound1->val));
+        {
+            float8        res;
+
+            res = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo,
+                                                   typcache->rng_collation,
+                                                   bound2->val,
+                                                   bound1->val));
+            /* Protect against possible NaN result */
+            if (isnan(res))
+                return 1.0;
+            else
+                return res;
+        }
         else
             return 1.0;
     }
@@ -1021,16 +1037,28 @@ calc_hist_selectivity_contained(TypeCacheEntry *typcache,
                                  false);

     /*
+     * If the upper bound value is below the histogram's lower limit, there
+     * are no matches.
+     */
+    if (upper_index < 0)
+        return 0.0;
+
+    /*
+     * If the upper bound value is at or beyond the histogram's upper limit,
+     * start our loop at the last actual bin.  (This corresponds to assuming
+     * that the data population above the histogram's upper limit is empty,
+     * exactly like what we just assumed for the lower limit.)
+     */
+    upper_index = Min(upper_index, hist_nvalues - 2);
+
+    /*
      * Calculate upper_bin_width, ie. the fraction of the (upper_index,
      * upper_index + 1) bin which is greater than upper bound of query range
      * using linear interpolation of subdiff function.
      */
-    if (upper_index >= 0 && upper_index < hist_nvalues - 1)
-        upper_bin_width = get_position(typcache, upper,
-                                       &hist_lower[upper_index],
-                                       &hist_lower[upper_index + 1]);
-    else
-        upper_bin_width = 0.0;
+    upper_bin_width = get_position(typcache, upper,
+                                   &hist_lower[upper_index],
+                                   &hist_lower[upper_index + 1]);

     /*
      * In the loop, dist and prev_dist are the distance of the "current" bin's
@@ -1125,15 +1153,27 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
                                  true);

     /*
+     * If the lower bound value is below the histogram's lower limit, there
+     * are no matches.
+     */
+    if (lower_index < 0)
+        return 0.0;
+
+    /*
+     * If the lower bound value is at or beyond the histogram's upper limit,
+     * start our loop at the last actual bin.  (This corresponds to assuming
+     * that the data population above the histogram's upper limit is empty,
+     * exactly like what we just assumed for the lower limit.)
+     */
+    lower_index = Min(lower_index, hist_nvalues - 2);
+
+    /*
      * Calculate lower_bin_width, ie. the fraction of the of (lower_index,
      * lower_index + 1) bin which is greater than lower bound of query range
      * using linear interpolation of subdiff function.
      */
-    if (lower_index >= 0 && lower_index < hist_nvalues - 1)
-        lower_bin_width = get_position(typcache, lower, &hist_lower[lower_index],
-                                       &hist_lower[lower_index + 1]);
-    else
-        lower_bin_width = 0.0;
+    lower_bin_width = get_position(typcache, lower, &hist_lower[lower_index],
+                                   &hist_lower[lower_index + 1]);

     /*
      * Loop through all the lower bound bins, smaller than the query lower

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Mon, Jan 06, 2020 at 03:28:34PM -0500, Tom Lane wrote:
> I believe that Michael's draft fix is actually about right, because
> we don't want to treat the key space above the histogram upper bound
> as being an actual bin: we should assume that there are no values there.
> So just starting the loop at the last real bin is sufficient.  However,
> I'd prefer to handle the two edge cases (i.e., probe value below histogram
> lower bound or above histogram upper bound) explicitly, because that makes
> it more apparent what's going on.

Actually, I suspect that the original error comes from a copy-pasto
from calc_hist_selectivity_contains to calc_hist_selectivity_contained
when the code has been originally written.

> There are a couple of other ways in which this code seems insufficiently
> paranoid to me:
>
> * It seems worth spending a couple lines of code at the beginning to
> verify that the histogram has at least one bin, as it already did for
> the range-length histogram.  Otherwise we've got various divide-by-zero
> hazards here, along with a need for extra tests in the looping functions.

Fine by me.

> * I think we'd better guard against NaNs coming back from the range
> subdiff function, as well as the possibility of getting a NaN from
> the division in get_position (compare [1]).

Good idea.

> * I am not quite totally sure about this part, but it seems to me
> that calc_hist_selectivity_contains probably ought to handle the
> range-bound cases the same as calc_hist_selectivity_contained,
> even though only the latter is trying to access an unsafe array
> index.

Both could actually be merged, no?  The assumptions behind the
upper/lower bound lookups are actually just mirrors of each other
based on which side of the operators <@ or @> the code looks at.
Perhaps the current code is cleaner as done currently anyway..

> That all leads me to the attached draft patch.  It lacks a test
> case --- I wonder if we can find one that crashes more portably
> than Michael's?  And more eyeballs on calc_hist_selectivity_contains
> would be good too.

It would be nice to keep the test case as it crashes immediately on
own Debian laptop.  Anyway, if we add a test, we really need
something that pokes at the bin with the histogram with the upper
bound.  Just something like the attached on HEAD proves that we have
zero coverage for it:
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -1050,6 +1050,7 @@ calc_hist_selectivity_contained(TypeCacheEntry *typcache,
    double      dist;
    double      length_hist_frac;
    bool        final_bin = false;
+   Assert(i == upper_index);
So that's not good from the start..

-   if (bin_width <= 0.0)
+   if (isnan(bin_width) || bin_width <= 0.0)
     return 0.5;            /* zero width bin */
This comment is incorrect now?

+   if (isnan(position))
+       return 0.5;            /* punt if we have any NaNs or Infs */
+
This comment is incorrect as well, no?  In this case both histograms
have finite bounds, and you just check if that's a number.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Jan 06, 2020 at 03:28:34PM -0500, Tom Lane wrote:
>> * I am not quite totally sure about this part, but it seems to me
>> that calc_hist_selectivity_contains probably ought to handle the
>> range-bound cases the same as calc_hist_selectivity_contained,
>> even though only the latter is trying to access an unsafe array
>> index.

> Both could actually be merged, no?  The assumptions behind the
> upper/lower bound lookups are actually just mirrors of each other
> based on which side of the operators <@ or @> the code looks at.

Hmm ... maybe, but I'm not entirely convinced that they should be
mirror images.  The endpoint handling is the same right now, but
should it be?  Anyway, merging them seems out of scope for a bug fix.

>> That all leads me to the attached draft patch.  It lacks a test
>> case --- I wonder if we can find one that crashes more portably
>> than Michael's?  And more eyeballs on calc_hist_selectivity_contains
>> would be good too.

> It would be nice to keep the test case as it crashes immediately on
> own Debian laptop.

I wouldn't mind having a test case if it crashes reliably, but it
doesn't for me.  I poked into why not, and realized that even if
get_position is passed a pointer to garbage data in hist2, it will
not crash unless the "!hist2->infinite" test succeeds, because
otherwise it won't call the rng_subdiff function.  Depending
on compiler and optimization level, that usually means that any
nonzero value in that byte will prevent a crash.  Given that the
whole point here is we're accessing memory that hasn't been
initialized by this code, we don't have any way to make it likely
that that won't be true.

(Of course, we could have calc_hist_selectivity() allocate one
extra array slot and zero it, but that's getting a bit silly.)

BTW, I noticed that at -O0, my compiler generates a code sequence
that effectively implements that test as "hist2->infinite != true",
ie it will succeed for any bit pattern that's not exactly 0x01.
This greatly raises the odds of seeing the crash of course (so maybe
your compiler is doing likewise?).  I find that rather disturbing
because my mental model has always been that any nonzero bit pattern
in a bool would be interpreted as true.  Apparently the gcc boys think
that they're entitled to consider _Bool values other than exactly 0
and 1 as undefined, which is scary because I'm not sure we are totally
consistent about that everywhere.  It definitely makes it harder to
reason about what will happen with garbage memory contents.  (This
might also explain why you saw a change in behavior at 9a95a77d9d.)

So the bottom line here is that I don't think that we can build a
test case that will crash with any large probability, especially
not if it's just one part of a test script --- the longer the session
has been running, the greater the odds are that whatever is just past
the allocated histogram array contains nonzero garbage.  I'm not sure
it's worth spending test cycles forevermore on a test that most likely
will fail to reveal a bug even if it's there.

> -   if (bin_width <= 0.0)
> +   if (isnan(bin_width) || bin_width <= 0.0)
>      return 0.5;            /* zero width bin */
> This comment is incorrect now?

Yeah, it could be improved.  Maybe "punt for NaN or zero-width bin"?

> +   if (isnan(position))
> +       return 0.5;            /* punt if we have any NaNs or Infs */
> +
> This comment is incorrect as well, no?  In this case both histograms
> have finite bounds, and you just check if that's a number.

The point is that we could get a NaN from the division either from
the rng_subdiff function returning NaN, or from an undefined case
such as Inf/Inf.  Maybe write "punt for NaN input, Inf/Inf, etc"?
(We don't need to test for Inf explicitly, because the tests just
below can deal with that.)

            regards, tom lane



Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
On Tue, Jan 07, 2020 at 10:08:07AM -0500, Tom Lane wrote:
> Hmm ... maybe, but I'm not entirely convinced that they should be
> mirror images.  The endpoint handling is the same right now, but
> should it be?  Anyway, merging them seems out of scope for a bug fix.

Sure.  That was just a random though while reading again the code.

> I wouldn't mind having a test case if it crashes reliably, but it
> doesn't for me.  I poked into why not, and realized that even if
> get_position is passed a pointer to garbage data in hist2, it will
> not crash unless the "!hist2->infinite" test succeeds, because
> otherwise it won't call the rng_subdiff function.  Depending
> on compiler and optimization level, that usually means that any
> nonzero value in that byte will prevent a crash.  Given that the
> whole point here is we're accessing memory that hasn't been
> initialized by this code, we don't have any way to make it likely
> that that won't be true.

Ah, OK.  That makes sense.  If you don't wish to keep the test around
in core, that's fine to me.  I may keep that into one of my own repos
just for history's sake.

> BTW, I noticed that at -O0, my compiler generates a code sequence
> that effectively implements that test as "hist2->infinite != true",
> ie it will succeed for any bit pattern that's not exactly 0x01.
> This greatly raises the odds of seeing the crash of course (so maybe
> your compiler is doing likewise?).  I find that rather disturbing
> because my mental model has always been that any nonzero bit pattern
> in a bool would be interpreted as true.  Apparently the gcc boys think
> that they're entitled to consider _Bool values other than exactly 0
> and 1 as undefined, which is scary because I'm not sure we are totally
> consistent about that everywhere.  It definitely makes it harder to
> reason about what will happen with garbage memory contents.  (This
> might also explain why you saw a change in behavior at 9a95a77d9d.)

That would not be the first only time my environment behaves different
than the others.  My build scripts have been pretty good at catching
issues with compiled code, particularly when it comes to ordering.

>> -   if (bin_width <= 0.0)
>> +   if (isnan(bin_width) || bin_width <= 0.0)
>>      return 0.5;            /* zero width bin */
>> This comment is incorrect now?
>
> Yeah, it could be improved.  Maybe "punt for NaN or zero-width bin"?

Sounds fine to me.

>> +   if (isnan(position))
>> +       return 0.5;            /* punt if we have any NaNs or Infs */
>> +
>> This comment is incorrect as well, no?  In this case both histograms
>> have finite bounds, and you just check if that's a number.
>
> The point is that we could get a NaN from the division either from
> the rng_subdiff function returning NaN, or from an undefined case
> such as Inf/Inf.  Maybe write "punt for NaN input, Inf/Inf, etc"?
> (We don't need to test for Inf explicitly, because the tests just
> below can deal with that.)

I would rewrite this comment for clarity.  But I am fine to leave that
to you at the end if you feel differently.
--
Michael

Вложения

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Jan 07, 2020 at 10:08:07AM -0500, Tom Lane wrote:
>> I wouldn't mind having a test case if it crashes reliably, but it
>> doesn't for me.

> Ah, OK.  That makes sense.  If you don't wish to keep the test around
> in core, that's fine to me.  I may keep that into one of my own repos
> just for history's sake.

OK.  Hearing no further comments, I made another pass over the code
(improved some comments, added one more guard against silly subdiff
results) and pushed this.

            regards, tom lane



Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Adam Scott
Дата:
Thank you Michael!  I know this is a late response but didn't want to clutter the message stream until we had confirmation, but I applied this patch to our server and could not replicate the bug.  So this patch is working for us presently.

Thanks again Michael,
Adam


On Mon, Dec 9, 2019 at 9:38 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 19, 2019 at 08:40:56PM +0900, Michael Paquier wrote:
> If you add an ANALYZE on the table natica_hdu_test after restoring, I
> am rather sure that you would reproduce the crash more quickly because
> the handling around the stats of the column are busted here.  Anyway,
> taking my example of upthread, I have been also able to reproduce the
> problem on REL_10_STABLE even with assertions enabled: the trick is
> that you need to leave once the session after the analyze on the
> table.  Then a SELECT within a new session is enough to crash the
> server.

So...  I have looked more at this one, and from my previous example it
seems that we have a one-off error when looking up at the array
holding the histograms for ranges (lower and upper bound).

In my previous example, we get to build 101 RangeBounds when beginning
to calculate the range operator selectivity in
calc_hist_selectivity().  However, when we get to the point of
calc_hist_selectivity_contained(), upper_index gets calculated at 100
which is just at the limit of the indexed bounds, and the code would
happily look at the last bound as well as the one-after-the-last bound
as range_cmp_bounds() sees fit, but the latter just points to the
void.  The code looks wrong since its introduction in 59d0bf9d and
it seems that the changes done for free_attstatsslot() in 9aab83f make
the issue more easily reproducible.

A fix like the rough POC attached addresses the issue, but I think
that's too naive to not count for the first bin in the ranges
evaluated.  Tomas, you may be more familiar with this area of the code
than I am.  What do you think?
--
Michael

Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query

От
Michael Paquier
Дата:
Hi Adam,

On Mon, Jan 13, 2020 at 10:56:32AM -0800, Adam Scott wrote:
> Thank you Michael!  I know this is a late response but didn't want to
> clutter the message stream until we had confirmation, but I applied this
> patch to our server and could not replicate the bug.  So this patch is
> working for us presently.

This was broken for ages, so there is a lot that has been done thanks
to you because you took the time to report the bug in the first
place.  Tom has committed a much more paranoid (and better!) fix as of
652686a, so you will have the fix with the next minor release planned
on the second week of February, 10.12 in your case.  The committed
patch did not include a regression test, but I am keeping track of
it.
--
Michael

Вложения