Обсуждение: assertion failure at cost_memoize_rescan()

Поиск
Список
Период
Сортировка

assertion failure at cost_memoize_rescan()

От
Kohei KaiGai
Дата:
Hello,

I met an assertion failure, and identified the root of the problem, but no idea how to fix it.

The location of the problematic Assert() is at cost_memoize_rescan() to check 'hit_ratio' is between 0.0 and 1.0.
The 'calls' is provided by the caller, and 'ndistinct' is the result of estimate_num_groups().

#4  0x000000000084d583 in cost_memoize_rescan (root=0x2e95748, mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at costsize.c:2564
/home/kaigai/source/pgsql-16/src/backend/optimizer/path/costsize.c:2564:83932:beg:0x84d583
(gdb) l
2559             * how many of those scans we expect to get a cache hit.
2560             */
2561            hit_ratio = ((calls - ndistinct) / calls) *
2562                    (est_cache_entries / Max(ndistinct, est_cache_entries));
2563
2564            Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
2565
2566            /*
2567             * Set the total_cost accounting for the expected cache hit ratio.  We
2568             * also add on a cpu_operator_cost to account for a cache lookup. This

(gdb) bt
#0  0x00007f3a39aa154c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f3a39a54d06 in raise () from /lib64/libc.so.6
#2  0x00007f3a39a287f3 in abort () from /lib64/libc.so.6
#3  0x0000000000b6ff2c in ExceptionalCondition (conditionName=0xd28c28 "hit_ratio >= 0 && hit_ratio <= 1.0", fileName=0xd289a4 "costsize.c", lineNumber=2564) at assert.c:66
#4  0x000000000084d583 in cost_memoize_rescan (root=0x2e95748, mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at costsize.c:2564
#5  0x0000000000850831 in cost_rescan (root=0x2e95748, path=0x30aece8, rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at costsize.c:4350
#6  0x000000000084e333 in initial_cost_nestloop (root=0x2e95748, workspace=0x7ffd721412d0, jointype=JOIN_INNER, outer_path=0x3090058, inner_path=0x30aece8, extra=0x7ffd72141500) at costsize.c:2978
#7  0x0000000000860f58 in try_partial_nestloop_path (root=0x2e95748, joinrel=0x30ae158, outer_path=0x3090058, inner_path=0x30aece8, pathkeys=0x0, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:887
#8  0x0000000000862a64 in consider_parallel_nestloop (root=0x2e95748, joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:2083
#9  0x000000000086273d in match_unsorted_outer (root=0x2e95748, joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:1940
#10 0x00000000008600f0 in add_paths_to_joinrel (root=0x2e95748, joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390, jointype=JOIN_INNER, sjinfo=0x7ffd721415f0, restrictlist=0x30ae5a8) at joinpath.c:296
#11 0x0000000000864d10 in populate_joinrel_with_paths (root=0x2e95748, rel1=0x308f428, rel2=0x2eac390, joinrel=0x30ae158, sjinfo=0x7ffd721415f0, restrictlist=0x30ae5a8) at joinrels.c:925
#12 0x00000000008649e1 in make_join_rel (root=0x2e95748, rel1=0x308f428, rel2=0x2eac390) at joinrels.c:776
#13 0x0000000000863ec1 in make_rels_by_clause_joins (root=0x2e95748, old_rel=0x308f428, other_rels_list=0x3088ed0, other_rels=0x3088ee8) at joinrels.c:312
#14 0x000000000086399a in join_search_one_level (root=0x2e95748, level=3) at joinrels.c:123
#15 0x00000000008463f8 in standard_join_search (root=0x2e95748, levels_needed=4, initial_rels=0x3088ed0) at allpaths.c:3454
#16 0x000000000084636d in make_rel_from_joinlist (root=0x2e95748, joinlist=0x306b4f8) at allpaths.c:3385
#17 0x0000000000841548 in make_one_rel (root=0x2e95748, joinlist=0x306b4f8) at allpaths.c:229
#18 0x00000000008806a9 in query_planner (root=0x2e95748, qp_callback=0x886bcb <standard_qp_callback>, qp_extra=0x7ffd72141960) at planmain.c:278
#19 0x0000000000882f5f in grouping_planner (root=0x2e95748, tuple_fraction=0) at planner.c:1495
#20 0x000000000088268c in subquery_planner (glob=0x2e95348, parse=0x2e90e98, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at planner.c:1064
#21 0x0000000000880cdb in standard_planner (parse=0x2e90e98,
    query_string=0x2e3a0e8 "explain\nselect sum(lo_revenue), d_year, p_brand1\n  from lineorder, date1, part, supplier\n  where lo_orderdate = d_datekey\n    and lo_partkey = p_partkey\n    and lo_suppkey = s_suppkey\n    and p_brand1"..., cursorOptions=2048,
    boundParams=0x0) at planner.c:413

I tracked the behavior of estimate_num_groups() using gdb line-by-line to observe how 'input_rows' is changed
and how it affects the result value.
According to the call trace, the problematic estimate_num_groups() invocation is called with "input_rows=3251872.916666667",
then it was rounded up to 3251873 by the clamp_row_est(). Eventually, its result value was calculated larger than the upper
limit, so the return value was suppressed by 3251873, but it is a tiny bit larger than the input value!

Back to the cost_memoize_rescan().
The hit_ratio is calculated as follows:

    hit_ratio = ((calls - ndistinct) / calls) *
        (est_cache_entries / Max(ndistinct, est_cache_entries));


The "calls" is the "input_rows" above, and "ndistinct"  is the return value of the estimate_num_groups().
What happen if "ndistinct" is a tiny bit larger than "calls"?
In the results, the "hit_ratio" is calculated as a very small negative value, then it was terminated by Assert().

How do we fix the logic? Please some ideas.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>

Re: assertion failure at cost_memoize_rescan()

От
Tomas Vondra
Дата:
On 6/14/24 14:54, Kohei KaiGai wrote:
> ...
>
> I tracked the behavior of estimate_num_groups() using gdb line-by-line to
> observe how 'input_rows' is changed
> and how it affects the result value.
> According to the call trace, the problematic estimate_num_groups()
> invocation is called with "input_rows=3251872.916666667",
> then it was rounded up to 3251873 by the clamp_row_est(). Eventually, its
> result value was calculated larger than the upper
> limit, so the return value was suppressed by 3251873, but it is a tiny bit
> larger than the input value!
> 
> Back to the cost_memoize_rescan().
> The hit_ratio is calculated as follows:
> 
>     hit_ratio = ((calls - ndistinct) / calls) *
>         (est_cache_entries / Max(ndistinct, est_cache_entries));
> 
> The "calls" is the "input_rows" above, and "ndistinct"  is the return value
> of the estimate_num_groups().
> What happen if "ndistinct" is a tiny bit larger than "calls"?
> In the results, the "hit_ratio" is calculated as a very small negative
> value, then it was terminated by Assert().
> 
> How do we fix the logic? Please some ideas.
> 

Interesting. Seems like a bug due to the two places clamping the values
inconsistently. It probably does not matter in other contexts because we
don't subtract the values like this, but here it triggers the assert.

I guess the simplest fix would be to clamp "calls" the same way before
calculating hit_ratio. That makes the ">= 0" part of the assert somewhat
pointless, though.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: assertion failure at cost_memoize_rescan()

От
David Rowley
Дата:
On Mon, 17 Jun 2024 at 10:23, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Interesting. Seems like a bug due to the two places clamping the values
> inconsistently. It probably does not matter in other contexts because we
> don't subtract the values like this, but here it triggers the assert.
>
> I guess the simplest fix would be to clamp "calls" the same way before
> calculating hit_ratio. That makes the ">= 0" part of the assert somewhat
> pointless, though.

"calls" comes from the value passed as the final parameter in
create_memoize_path().

There's really only one call to that function and that's in get_memoize_path().

return (Path *) create_memoize_path(root,
                                            innerrel,
                                            inner_path,
                                            param_exprs,
                                            hash_operators,
                                            extra->inner_unique,
                                            binary_mode,
                                            outer_path->rows);

It would be good to know what type of Path outer_path is.  Normally
we'll clamp_row_est() on that field. I suspect we must have some Path
type that isn't doing that.

KaiGai-san,  what type of Path is outer_path?

David

David



Re: assertion failure at cost_memoize_rescan()

От
Kohei KaiGai
Дата:
2024年6月17日(月) 8:27 David Rowley <dgrowleyml@gmail.com>:
>
> On Mon, 17 Jun 2024 at 10:23, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > Interesting. Seems like a bug due to the two places clamping the values
> > inconsistently. It probably does not matter in other contexts because we
> > don't subtract the values like this, but here it triggers the assert.
> >
> > I guess the simplest fix would be to clamp "calls" the same way before
> > calculating hit_ratio. That makes the ">= 0" part of the assert somewhat
> > pointless, though.
>
> "calls" comes from the value passed as the final parameter in
> create_memoize_path().
>
> There's really only one call to that function and that's in get_memoize_path().
>
> return (Path *) create_memoize_path(root,
>                                             innerrel,
>                                             inner_path,
>                                             param_exprs,
>                                             hash_operators,
>                                             extra->inner_unique,
>                                             binary_mode,
>                                             outer_path->rows);
>
> It would be good to know what type of Path outer_path is.  Normally
> we'll clamp_row_est() on that field. I suspect we must have some Path
> type that isn't doing that.
>
> KaiGai-san,  what type of Path is outer_path?
>
It is CustomPath with rows = 3251872.916666667.
(I'm not certain whether the non-integer value in the estimated rows
is legal or not.)

According to the crash dump, try_partial_nestloop_path() takes this two paths,

(gdb) up
#7  0x0000000000860fc5 in try_partial_nestloop_path (root=0x133a968,
joinrel=0x15258d8, outer_path=0x1513c98, inner_path=0x15264c8,
pathkeys=0x0, jointype=JOIN_INNER, extra=0x7ffc494cddf0) at
joinpath.c:887
/home/kaigai/source/pgsql-16/src/backend/optimizer/path/joinpath.c:887:30713:beg:0x860fc5

(gdb) p *(CustomPath *)outer_path
$13 = {path = {type = T_CustomPath, pathtype = T_CustomScan, parent =
0x1513058, pathtarget = 0x1513268, param_info = 0x0, parallel_aware =
true, parallel_safe = true, parallel_workers = 2, rows =
3251872.916666667, startup_cost = 41886.752500000002,
    total_cost = 12348693.488611111, pathkeys = 0x0}, flags = 4,
custom_paths = 0x1514788, custom_private = 0x1514ee8, methods =
0x7f45211feca0 <gpujoin_path_methods>}

(gdb) p *(MemoizePath *)inner_path
$14 = {path = {type = T_MemoizePath, pathtype = T_Memoize, parent =
0x14dc800, pathtarget = 0x14dca10, param_info = 0x150d8a8,
parallel_aware = false, parallel_safe = true, parallel_workers = 0,
rows = 1, startup_cost = 0.44500000000000001,
    total_cost = 8.4284913207446568, pathkeys = 0x150d5c8}, subpath =
0x150d148, hash_operators = 0x1526428, param_exprs = 0x1526478,
singlerow = true, binary_mode = false, calls = 3251872.916666667,
est_entries = 3251873}

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>



Re: assertion failure at cost_memoize_rescan()

От
David Rowley
Дата:
On Tue, 18 Jun 2024 at 14:23, Kohei KaiGai <kaigai@heterodb.com> wrote:
>
> 2024年6月17日(月) 8:27 David Rowley <dgrowleyml@gmail.com>:
> > It would be good to know what type of Path outer_path is.  Normally
> > we'll clamp_row_est() on that field. I suspect we must have some Path
> > type that isn't doing that.
> >
> > KaiGai-san,  what type of Path is outer_path?
> >
> It is CustomPath with rows = 3251872.916666667.

I suspected this might have been a CustomPath.

> (I'm not certain whether the non-integer value in the estimated rows
> is legal or not.)

I guess since it's not documented that Path.rows is always clamped,
it's probably bad to assume that it is.

Since clamp_row_est() will ensure the value is clamped >= 1.0 && <=
MAXIMUM_ROWCOUNT (which ensures non-zero), I tried looking around the
codebase for anything that divides by Path.rows to see if we ever
assume that we can divide without first checking if Path.rows != 0.
Out of the places I saw, it seems we do tend to code things so that we
don't assume the value has been clamped.  E.g.
adjust_limit_rows_costs() does if (*rows < 1) *rows = 1;

I think the best solution is to apply the attached.  I didn't test,
but it should fix the issue you reported and also ensure that
MemoizePath.calls is never zero, which would also cause issues in the
hit_ratio calculation in cost_memoize_rescan().

David

Вложения

Re: assertion failure at cost_memoize_rescan()

От
Richard Guo
Дата:
On Tue, Jun 18, 2024 at 10:53 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Out of the places I saw, it seems we do tend to code things so that we
> don't assume the value has been clamped.  E.g.
> adjust_limit_rows_costs() does if (*rows < 1) *rows = 1;

Agreed.  In costsize.c I saw a few instances where we have

    /* Protect some assumptions below that rowcounts aren't zero */
    if (inner_path_rows <= 0)
        inner_path_rows = 1;

> I think the best solution is to apply the attached.  I didn't test,
> but it should fix the issue you reported and also ensure that
> MemoizePath.calls is never zero, which would also cause issues in the
> hit_ratio calculation in cost_memoize_rescan().

+1.

Thanks
Richard



Re: assertion failure at cost_memoize_rescan()

От
David Rowley
Дата:
On Tue, 18 Jun 2024 at 15:14, Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Tue, Jun 18, 2024 at 10:53 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > I think the best solution is to apply the attached.  I didn't test,
> > but it should fix the issue you reported and also ensure that
> > MemoizePath.calls is never zero, which would also cause issues in the
> > hit_ratio calculation in cost_memoize_rescan().
>
> +1.

Thanks for looking. Pushed.

David