Обсуждение: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

Поиск

Список

Период

Сортировка

[BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

24 февраля, 21:21:06

Hi hackers,

I think I might have stumbled upon a planner bug in
src/backend/utils/adt/selfuncs.c, while working on something else.

I've traced it down to bd3e3e9 (Ensure sanity of hash-join costing when
 there are no MCV statistics.) that added a fallback to compute
mcv_freq, which surfaced an existing bug, due to a branch that was not
taken when mcv_freq was zero.

    if (avgfreq > 0.0 && *mcv_freq > avgfreq)
        estfract *= *mcv_freq / avgfreq;

Here, since this expression is a ratio, it seems important that both the
numerator and divisor are of the same base dimension.

mcv_freq is [freq of most common values in filtered relation]

However, avgfreq is currently [avg freq of all distinct data values in
raw relation]:

    /* Compute avg freq of all distinct data values in raw relation */
    avgfreq = (1.0 - stanullfrac) / ndistinct;

After avgfreq has been assigned, ndistinct is then adjusted, to account
for restriction clauses:

    if (vardata.rel && vardata.rel->tuples > 0)
    {
        ndistinct *= vardata.rel->rows / vardata.rel->tuples;
        ndistinct = clamp_row_est(ndistinct);
    }

The expression further down

    estfract *= *mcv_freq / avgfreq;

therefore divides two frequencies of different bases, when
I think they should be of the same base.

If we simply move the computation of avgfreq to after the adjustment
of ndistinct, avgfreq will be of the same base as mcv_freq.

This bug seems to sometimes cause the wrong table, the larger table, to
be hashed in a Hash Join, and the smaller table to be used for probing.

Here is a demo:

CREATE TABLE orders (
    order_id    bigint      NOT NULL,
    tracking_code bigint,
    region      int         NOT NULL,
    status      text        NOT NULL,
    order_data  text
);

CREATE TABLE shipments (
    shipment_id   bigint  NOT NULL,
    tracking_code bigint  NOT NULL,
    carrier       text    NOT NULL,
    ship_data     text
);

INSERT INTO orders (order_id, tracking_code, region, status, order_data)
SELECT
    g,
    CASE WHEN g <= 150000 THEN g ELSE NULL END,
    (g % 1000) + 1,
    CASE WHEN g <= 150000 THEN 'shipped' ELSE 'pending' END,
    'order-' || g
FROM generate_series(1, 200000) g;

INSERT INTO shipments (shipment_id, tracking_code, carrier, ship_data)
SELECT
    g,
    g,
    CASE (g % 3) WHEN 0 THEN 'FedEx' WHEN 1 THEN 'UPS' ELSE 'DHL' END,
    repeat('x', 200) || g::text
FROM generate_series(1, 150000) g;

CREATE INDEX ON orders (region);

ANALYZE orders;
ANALYZE shipments;

EXPLAIN ANALYZE
SELECT o.order_id, o.tracking_code, s.carrier, s.ship_data
FROM orders o
JOIN shipments s ON s.tracking_code = o.tracking_code
WHERE o.region = 42;

master (65707ed):
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=9138.25..14237.55 rows=149 width=229) (actual time=70.984..78.977 rows=150.00 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Buffers: shared hit=6557, temp read=3987 written=4068
   ->  Parallel Hash Join  (cost=8138.25..13222.65 rows=88 width=229) (actual time=68.237..75.806 rows=75.00 loops=2)
         Hash Cond: (o.tracking_code = s.tracking_code)
         Buffers: shared hit=6557, temp read=3987 written=4068
         ->  Parallel Seq Scan on orders o  (cost=0.00..3188.59 rows=117 width=16) (actual time=0.076..12.053
rows=100.00loops=2)
 
               Filter: (region = 42)
               Rows Removed by Filter: 99900
               Buffers: shared hit=1718
         ->  Parallel Hash  (cost=5464.00..5464.00 rows=62500 width=221) (actual time=53.216..53.217 rows=75000.00
loops=2)
               Buckets: 32768  Batches: 8  Memory Usage: 5120kB
               Buffers: shared hit=4839, temp written=4004
               ->  Parallel Seq Scan on shipments s  (cost=0.00..5464.00 rows=62500 width=221) (actual
time=0.009..16.114rows=75000.00 loops=2)
 
                     Buffers: shared hit=4839
 Planning:
   Buffers: shared hit=86 read=1
 Planning Time: 0.350 ms
 Execution Time: 79.011 ms

0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch:

                                                                    QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1578.75..7292.64 rows=149 width=229) (actual time=0.521..15.592 rows=150.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=5447 read=3
   ->  Hash Join  (cost=578.75..6277.74 rows=62 width=229) (actual time=0.667..12.342 rows=50.00 loops=3)
         Hash Cond: (s.tracking_code = o.tracking_code)
         Buffers: shared hit=5447 read=3
         ->  Parallel Seq Scan on shipments s  (cost=0.00..5464.00 rows=62500 width=221) (actual time=0.006..5.581
rows=50000.00loops=3)
 
               Buffers: shared hit=4839
         ->  Hash  (cost=576.26..576.26 rows=199 width=16) (actual time=0.429..0.429 rows=150.00 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               Buffers: shared hit=608 read=3
               ->  Bitmap Heap Scan on orders o  (cost=5.84..576.26 rows=199 width=16) (actual time=0.083..0.396
rows=200.00loops=3)
 
                     Recheck Cond: (region = 42)
                     Heap Blocks: exact=200
                     Buffers: shared hit=608 read=3
                     ->  Bitmap Index Scan on orders_region_idx  (cost=0.00..5.79 rows=199 width=0) (actual
time=0.039..0.039rows=200.00 loops=3)
 
                           Index Cond: (region = 42)
                           Index Searches: 3
                           Buffers: shared hit=8 read=3
 Planning:
   Buffers: shared hit=86 read=1
 Planning Time: 0.371 ms
 Execution Time: 15.634 ms

The fix causes quite a lot of plans in
src/test/regress/expected/partition_join.out to change, which makes me a
bit worried I might have misunderstood something here. I haven't
verified if all the new plans are improvements, I just copied the result
file to the expected dir.

/Joel

Вложения

0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

25 февраля, 15:19:06

On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
> This bug seems to sometimes cause the wrong table, the larger table, to
> be hashed in a Hash Join, and the smaller table to be used for probing.
...
> The fix causes quite a lot of plans in
> src/test/regress/expected/partition_join.out to change, which makes me a
> bit worried I might have misunderstood something here. I haven't
> verified if all the new plans are improvements, I just copied the result
> file to the expected dir.

I've now investigated all the plan changes in

   src/test/regress/expected/partition_join.out

due to this fix, and now feel confident this is a bug,
and that the bug fix is correct.

To benchmark, I beefed up the populated data in partition_join.sql
by x10000, e.g:

-CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
-CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
-CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
-INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 599) i WHERE i % 2 = 0;
+CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (2500000);
+CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (5000000) TO (6000000);
+CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (2500000) TO (5000000);
+INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 5999999) i WHERE i % 2 = 0;

I then measured all queries that produced a different plan,
using EXPLAIN ANALYZE, here are the results:

joel=# SELECT COUNT(*), COUNT(*) FILTER (WHERE execution_time_head > execution_time_patch) AS faster, COUNT(*) FILTER
(WHEREexecution_time_head < execution_time_patch) AS slower FROM partition_join_bench;
 
 count | faster | slower
-------+--------+--------
    27 |     16 |     11
(1 row)

joel=# SELECT SUM(execution_time_head) AS total_execution_time_master, SUM(execution_time_patch) AS
total_execution_time_patch,1-SUM(execution_time_patch)/SUM(execution_time_head) AS improvement FROM
partition_join_bench;
 total_execution_time_master | total_execution_time_patch |      improvement
-----------------------------+----------------------------+------------------------
                    3577.826 |                   2892.280 | 0.19160965345995026030
(1 row)

joel=# SELECT execution_time_head, execution_time_patch, execution_time_head-execution_time_patch AS diff FROM
partition_join_benchORDER BY 3;
 
 execution_time_head | execution_time_patch |  diff
---------------------+----------------------+---------
             128.481 |              170.469 | -41.988
              59.927 |               84.131 | -24.204
              63.928 |               87.188 | -23.260
              57.315 |               78.443 | -21.128
              65.779 |               84.669 | -18.890
              65.456 |               81.128 | -15.672
              57.349 |               72.832 | -15.483
              63.383 |               77.267 | -13.884
              60.248 |               73.359 | -13.111
              61.173 |               67.388 |  -6.215
              67.052 |               69.475 |  -2.423
              79.368 |               78.874 |   0.494
              61.533 |               56.617 |   4.916
             108.781 |               92.301 |  16.480
             124.661 |               98.540 |  26.121
             146.671 |              117.109 |  29.562
             112.973 |               79.949 |  33.024
             119.745 |               82.465 |  37.280
             145.449 |               99.523 |  45.926
             239.796 |              166.813 |  72.983
             228.056 |              154.956 |  73.100
             225.025 |              145.068 |  79.957
             261.493 |              173.595 |  87.898
             245.301 |              157.054 |  88.247
             239.626 |              149.158 |  90.468
             243.589 |              147.587 |  96.002
             245.668 |              146.322 |  99.346
(27 rows)

In total the improvement is about 20%.

The faster queries are due to swapping the build/probe side,
so that the planner hash the smaller filtered side instead
of the larger unfiltered side.

The slower queries are due to fixing the hash join cost estimate,
which makes the makes hash join look cheaper than nested loop.
But at this data scale, nested loop is still a win for such cases.

I benchmarked just in case I'd missed something.
These results makes me confident we have a bug,
and that this fix is correct.

/Joel

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

25 февраля, 23:31:57

On Wed, Feb 25, 2026, at 13:19, Joel Jacobson wrote:
> On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
>> This bug seems to sometimes cause the wrong table, the larger table, to
>> be hashed in a Hash Join, and the smaller table to be used for probing.

To help reviewers, instead of relying on benchmark results,
I realized it's much better if we can actually prove the calculation
of skew_ratio is incorrect, and that it becomes correct with the fix.

I therefore added debugging here:

        if (avgfreq > 0.0 && *mcv_freq > avgfreq)
+       {
+               elog(DEBUG1, "mcv_freq=%g, avgfreq=%g, skew_ratio=%g",
+                        *mcv_freq, avgfreq, *mcv_freq / avgfreq);
                estfract *= *mcv_freq / avgfreq;
+       }

And created this minimal schema to prove the incorrect calculation:

CREATE TABLE orders (
    order_id      bigint  NOT NULL,
    tracking_code bigint,
    region        int     NOT NULL
);

CREATE TABLE shipments (
    shipment_id   bigint  NOT NULL,
    tracking_code bigint  NOT NULL
);

-- Skewed tracking_code distribution:
--   10% of rows share a single "hot" tracking_code (value 1),
--   the rest get unique codes.
-- This ensures mcv_freq > avgfreq, triggering the skew adjustment
-- even in the fixed version of estimate_hash_bucket_stats.
INSERT INTO orders
SELECT
    g,
    CASE
        WHEN g > 150000 THEN NULL
        WHEN g <= 15000  THEN 1
        ELSE g
    END,
    (g % 1000) + 1
FROM generate_series(1, 200000) g;

INSERT INTO shipments
SELECT g, CASE WHEN g <= 15000 THEN 1 ELSE g END
FROM generate_series(1, 150000) g;

CREATE INDEX ON orders (region);

ANALYZE orders;
ANALYZE shipments;

-- Ground truth of the skew_ratio for orders WHERE region = 42:
WITH base AS (
    SELECT
        count(*)::numeric                  AS total,
        count(tracking_code)::numeric      AS nonnull,
        count(DISTINCT tracking_code)      AS ndistinct
    FROM orders
    WHERE region = 42
),
mcv AS (
    SELECT count(*)::numeric AS mcv_count
    FROM orders
    WHERE region = 42 AND tracking_code IS NOT NULL
    GROUP BY tracking_code
    ORDER BY count(*) DESC
    LIMIT 1
)
SELECT
    round(mcv_count / total, 6)                        AS mcv_freq,
    round(nonnull / total / ndistinct, 6)              AS avgfreq,
    round((mcv_count / total) / (nonnull / total / ndistinct), 6) AS skew_ratio
FROM base, mcv;

 mcv_freq | avgfreq  | skew_ratio
----------+----------+------------
 0.075000 | 0.005515 |  13.600000
(1 row)

SET client_min_messages = DEBUG1;

SELECT *
FROM orders o
JOIN shipments s ON s.tracking_code = o.tracking_code
WHERE o.region = 42;

-- HEAD (65707ed):
DEBUG:  mcv_freq=0.0748333, avgfreq=8.69725e-06, skew_ratio=8604.25

-- 0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch:
DEBUG:  mcv_freq=0.0738667, avgfreq=0.00871124, skew_ratio=8.47947

In HEAD, the skew_ratio for orders.tracking_code is wrongly estimated
to 8604.25, when the ground truth is 13.6, which the fixed estimate
8.47947, is a good approximation of.

The fix only moves the computation of avgfreq from before the
ndistinct adjustment, to after it:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 29fec65559..d19c4b5d96 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -4456,9 +4456,6 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
        else
                stanullfrac = 0.0;

-       /* Compute avg freq of all distinct data values in raw relation */
-       avgfreq = (1.0 - stanullfrac) / ndistinct;
-
        /*
         * Adjust ndistinct to account for restriction clauses.  Observe we are
         * assuming that the data distribution is affected uniformly by the
@@ -4473,6 +4470,9 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
                ndistinct = clamp_row_est(ndistinct);
        }

+       /* Compute avg freq of all distinct data values in the filtered relation */
+       avgfreq = (1.0 - stanullfrac) / ndistinct;
+
        /*
         * Initial estimate of bucketsize fraction is 1/nbuckets as long as the
         * number of buckets is less than the expected number of distinct values;

/Joel

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tender Wang

Дата:

26 февраля, 16:56:18

Hi Joel,

Joel Jacobson <joel@compiler.org> 于2026年2月26日周四 04:32写道：
>
> On Wed, Feb 25, 2026, at 13:19, Joel Jacobson wrote:
> > On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
> >> This bug seems to sometimes cause the wrong table, the larger table, to
> >> be hashed in a Hash Join, and the smaller table to be used for probing.
>
> To help reviewers, instead of relying on benchmark results,
> I realized it's much better if we can actually prove the calculation
> of skew_ratio is incorrect, and that it becomes correct with the fix.
>
> I therefore added debugging here:
>
>         if (avgfreq > 0.0 && *mcv_freq > avgfreq)
> +       {
> +               elog(DEBUG1, "mcv_freq=%g, avgfreq=%g, skew_ratio=%g",
> +                        *mcv_freq, avgfreq, *mcv_freq / avgfreq);
>                 estfract *= *mcv_freq / avgfreq;
> +       }
>
> And created this minimal schema to prove the incorrect calculation:
>
> CREATE TABLE orders (
>     order_id      bigint  NOT NULL,
>     tracking_code bigint,
>     region        int     NOT NULL
> );
>
> CREATE TABLE shipments (
>     shipment_id   bigint  NOT NULL,
>     tracking_code bigint  NOT NULL
> );
>
> -- Skewed tracking_code distribution:
> --   10% of rows share a single "hot" tracking_code (value 1),
> --   the rest get unique codes.
> -- This ensures mcv_freq > avgfreq, triggering the skew adjustment
> -- even in the fixed version of estimate_hash_bucket_stats.
> INSERT INTO orders
> SELECT
>     g,
>     CASE
>         WHEN g > 150000 THEN NULL
>         WHEN g <= 15000  THEN 1
>         ELSE g
>     END,
>     (g % 1000) + 1
> FROM generate_series(1, 200000) g;
>
> INSERT INTO shipments
> SELECT g, CASE WHEN g <= 15000 THEN 1 ELSE g END
> FROM generate_series(1, 150000) g;
>
> CREATE INDEX ON orders (region);
>
> ANALYZE orders;
> ANALYZE shipments;
>
> -- Ground truth of the skew_ratio for orders WHERE region = 42:
> WITH base AS (
>     SELECT
>         count(*)::numeric                  AS total,
>         count(tracking_code)::numeric      AS nonnull,
>         count(DISTINCT tracking_code)      AS ndistinct
>     FROM orders
>     WHERE region = 42
> ),
> mcv AS (
>     SELECT count(*)::numeric AS mcv_count
>     FROM orders
>     WHERE region = 42 AND tracking_code IS NOT NULL
>     GROUP BY tracking_code
>     ORDER BY count(*) DESC
>     LIMIT 1
> )
> SELECT
>     round(mcv_count / total, 6)                        AS mcv_freq,
>     round(nonnull / total / ndistinct, 6)              AS avgfreq,
>     round((mcv_count / total) / (nonnull / total / ndistinct), 6) AS skew_ratio
> FROM base, mcv;
>
>  mcv_freq | avgfreq  | skew_ratio
> ----------+----------+------------
>  0.075000 | 0.005515 |  13.600000
> (1 row)
>
> SET client_min_messages = DEBUG1;
>
> SELECT *
> FROM orders o
> JOIN shipments s ON s.tracking_code = o.tracking_code
> WHERE o.region = 42;
>
> -- HEAD (65707ed):
> DEBUG:  mcv_freq=0.0748333, avgfreq=8.69725e-06, skew_ratio=8604.25
>
> -- 0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch:
> DEBUG:  mcv_freq=0.0738667, avgfreq=0.00871124, skew_ratio=8.47947
>
> In HEAD, the skew_ratio for orders.tracking_code is wrongly estimated
> to 8604.25, when the ground truth is 13.6, which the fixed estimate
> 8.47947, is a good approximation of.
>
> The fix only moves the computation of avgfreq from before the
> ndistinct adjustment, to after it:
>
> diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
> index 29fec65559..d19c4b5d96 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -4456,9 +4456,6 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
>         else
>                 stanullfrac = 0.0;
>
> -       /* Compute avg freq of all distinct data values in raw relation */
> -       avgfreq = (1.0 - stanullfrac) / ndistinct;
> -
>         /*
>          * Adjust ndistinct to account for restriction clauses.  Observe we are
>          * assuming that the data distribution is affected uniformly by the
> @@ -4473,6 +4470,9 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
>                 ndistinct = clamp_row_est(ndistinct);
>         }
>
> +       /* Compute avg freq of all distinct data values in the filtered relation */
> +       avgfreq = (1.0 - stanullfrac) / ndistinct;
> +
>         /*
>          * Initial estimate of bucketsize fraction is 1/nbuckets as long as the
>          * number of buckets is less than the expected number of distinct values;
>
> /Joel
>
>

I think your analysis is correct.
After bd3e3e9, the mcv_freq is calculated by RelOptInfo.rows, which
accounts for restriction clauses.
But avgfreq is for the raw relation.

I tried another way. I used vardata.rel->tuples replacing with
original vardata.rel->rows.
I got the correct plan, and partition_join.sql had a lot plan diff as
your first thread said.
But vardata.rel->tuples may be zero due to an empty relation.
So I agree with your fix.
I added Tom to the cc list. He may know more about this.

--
Thanks,
Tender Wang

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

27 февраля, 18:54:10

On Thu, Feb 26, 2026, at 14:56, Tender Wang wrote:
> I think your analysis is correct.
> After bd3e3e9, the mcv_freq is calculated by RelOptInfo.rows, which
> accounts for restriction clauses.
> But avgfreq is for the raw relation.
...
> So I agree with your fix.
> I added Tom to the cc list. He may know more about this.

Many thanks for testing and reviewing.

Here is the commitfest entry, if you want to register as Reviewer
and/or think it's Ready for Committer:

https://commitfest.postgresql.org/patch/6528/

/Joel

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tom Lane

Дата:

27 февраля, 22:15:26

Tender Wang <tndrwang@gmail.com> writes:
> I added Tom to the cc list. He may know more about this.

Hmm, git blame says I originated this function 25 years ago
(f905d65ee), but I don't claim to remember that.

Looking at it now, though, I think that bd3e3e9e5 is indeed
wrong but not in the way Joel suggests.  The longstanding
way to compute mcv_freq is

            /*
             * The first MCV stat is for the most common value.
             */
            if (sslot.nnumbers > 0)
                *mcv_freq = sslot.numbers[0];

*This number is a fraction measured on the raw relation.*
(Necessarily so, because it's just a number computed by ANALYZE.)
Then bd3e3e9e5 added

            /*
             * If there are no recorded MCVs, but we do have a histogram, then
             * assume that ANALYZE determined that the column is unique.
             */
            if (vardata.rel && vardata.rel->rows > 0)
                *mcv_freq = 1.0 / vardata.rel->rows;

This is a pure thinko.  rel->rows is the estimated number
of filtered rows.  What I should have used is rel->tuples,
which is the estimated raw relation size, so as to get a
number that is commensurate with the longstanding way
of calculating mcv_freq.  Then that also matches up with
computing avgfreq on the raw relation.

So I think the correct fix is basically

-            if (vardata.rel && vardata.rel->rows > 0)
-                *mcv_freq = 1.0 / vardata.rel->rows;
+            if (vardata.rel && vardata.rel->tuples > 0)
+                *mcv_freq = 1.0 / vardata.rel->tuples;

and I wonder if that will wind up in reverting a lot of the plan
choice changes seen in bd3e3e9e5.

Joel, do you want to run this to ground, and in particular
see if that way of fixing it passes your sanity tests?

            regards, tom lane

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tender Wang

Дата:

28 февраля, 02:57:22

Tom Lane <tgl@sss.pgh.pa.us> 于2026年2月28日周六 03:15写道：
>
> Tender Wang <tndrwang@gmail.com> writes:
> > I added Tom to the cc list. He may know more about this.
>
> Hmm, git blame says I originated this function 25 years ago
> (f905d65ee), but I don't claim to remember that.
>
> Looking at it now, though, I think that bd3e3e9e5 is indeed
> wrong but not in the way Joel suggests.  The longstanding
> way to compute mcv_freq is
>
>             /*
>              * The first MCV stat is for the most common value.
>              */
>             if (sslot.nnumbers > 0)
>                 *mcv_freq = sslot.numbers[0];
>
> *This number is a fraction measured on the raw relation.*
> (Necessarily so, because it's just a number computed by ANALYZE.)
> Then bd3e3e9e5 added
>
>             /*
>              * If there are no recorded MCVs, but we do have a histogram, then
>              * assume that ANALYZE determined that the column is unique.
>              */
>             if (vardata.rel && vardata.rel->rows > 0)
>                 *mcv_freq = 1.0 / vardata.rel->rows;
>
> This is a pure thinko.  rel->rows is the estimated number
> of filtered rows.  What I should have used is rel->tuples,
> which is the estimated raw relation size, so as to get a
> number that is commensurate with the longstanding way
> of calculating mcv_freq.  Then that also matches up with
> computing avgfreq on the raw relation.
>
> So I think the correct fix is basically
>
> -            if (vardata.rel && vardata.rel->rows > 0)
> -                *mcv_freq = 1.0 / vardata.rel->rows;
> +            if (vardata.rel && vardata.rel->tuples > 0)
> +                *mcv_freq = 1.0 / vardata.rel->tuples;
>

Yeah, in my last email, I said I tried this way. But I worried that
rel->tuples may be zero for an empty relation.

> and I wonder if that will wind up in reverting a lot of the plan
> choice changes seen in bd3e3e9e5.

Yes, a lot plan diff in partition_join.sql.



--
Thanks,
Tender Wang

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

01 марта, 00:27:45

On Fri, Feb 27, 2026, at 20:15, Tom Lane wrote:
> Hmm, git blame says I originated this function 25 years ago
> (f905d65ee), but I don't claim to remember that.
>
> Looking at it now, though, I think that bd3e3e9e5 is indeed
> wrong but not in the way Joel suggests.  The longstanding
> way to compute mcv_freq is
>
>             /*
>              * The first MCV stat is for the most common value.
>              */
>             if (sslot.nnumbers > 0)
>                 *mcv_freq = sslot.numbers[0];
>
> *This number is a fraction measured on the raw relation.*
> (Necessarily so, because it's just a number computed by ANALYZE.)
> Then bd3e3e9e5 added
>
>             /*
>              * If there are no recorded MCVs, but we do have a histogram, then
>              * assume that ANALYZE determined that the column is unique.
>              */
>             if (vardata.rel && vardata.rel->rows > 0)
>                 *mcv_freq = 1.0 / vardata.rel->rows;
>
> This is a pure thinko.  rel->rows is the estimated number
> of filtered rows.  What I should have used is rel->tuples,
> which is the estimated raw relation size, so as to get a
> number that is commensurate with the longstanding way
> of calculating mcv_freq.  Then that also matches up with
> computing avgfreq on the raw relation.
>
> So I think the correct fix is basically
>
> -            if (vardata.rel && vardata.rel->rows > 0)
> -                *mcv_freq = 1.0 / vardata.rel->rows;
> +            if (vardata.rel && vardata.rel->tuples > 0)
> +                *mcv_freq = 1.0 / vardata.rel->tuples;

Nice catch, it turns out there are actually two bugs.
This is one of them, and I agree with the fix.

> and I wonder if that will wind up in reverting a lot of the plan
> choice changes seen in bd3e3e9e5.
>
> Joel, do you want to run this to ground, and in particular
> see if that way of fixing it passes your sanity tests?

Challenge accepted!
[...hours later...]
My conclusion is that we still need to move avgfreq
computation, like I suggested.
The reason for this is that estfract is calculated as:

    estfract = 1.0 / ndistinct;

where ndistinct has been adjusted to account for restriction clauses.
Therefore, we must also use the adjusted avgfreq when adjusting
estfract here:

        /*
         * Adjust estimated bucketsize upward to account for skewed distribution.
         */
        if (avgfreq > 0.0 && *mcv_freq > avgfreq)
                estfract *= *mcv_freq / avgfreq;

What I first didn't understand was that mcv_freq is of course an
approximation of the mcv not only in the total table, but also in a
restriction, under the uniform restriction assumption. So we should not
adjust mcv_freq here, but we must use the restriction adjusted avgfreq
value, since estfract is calculated from the restriction adjusted
ndistinct value. Otherwise estfract will be garbage.

Feel free to skip looking at 

    demo-estimate_hash_bucket_stats.txt

if the above explanation above is satisfactory. It only shows demo
queries to prove the buggy calculations, and that both fixes are needed.

The queries demonstrates the separate bugs and how the fixes also fixes
both plans. Query 1 demonstrates the mcv_freq bug. Query 2 demonstrates
the avgfreq bug.

/Joel

Вложения

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tom Lane

Дата:

01 марта, 03:08:05

"Joel Jacobson" <joel@compiler.org> writes:
> On Fri, Feb 27, 2026, at 20:15, Tom Lane wrote:
>> Joel, do you want to run this to ground, and in particular
>> see if that way of fixing it passes your sanity tests?

> Challenge accepted!

Thanks!

> [...hours later...]
> My conclusion is that we still need to move avgfreq
> computation, like I suggested.

Hmm ... doesn't this contradict your argument that avgfreq and
mcv_freq need to be calculated on the same basis?  Admittedly
that was just a heuristic, but I'm not seeing why it's wrong.

> The reason for this is that estfract is calculated as:
>     estfract = 1.0 / ndistinct;
> where ndistinct has been adjusted to account for restriction clauses.
> Therefore, we must also use the adjusted avgfreq when adjusting
> estfract here:

It feels like that might end up double-counting the effects of
the restriction clauses.

Anyway, we all seem to agree that s/rel->rows/rel->tuples/ is the
correct fix for a newly-introduced bug.  I'm inclined to proceed
by committing that fix (along with any regression test fallout)
and then investigating the avgfreq change as an independent matter.

            regards, tom lane

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tender Wang

Дата:

01 марта, 06:40:11

Hi all,
>Yeah, in my last email, I said I tried this way. But I worried that
>rel->tuples may be zero for an empty relation.
In my previous email， I worried rel->tuples may be zero for an empty relation.
But here it's safe, because an empty relation has no tuples in pg_statistic.
So it will not enter if (HeapTupleIsValid(vardata.statsTuple)).
Sorry for the noise.

Tom Lane <tgl@sss.pgh.pa.us> 于2026年3月1日周日 08:08写道：


> Hmm ... doesn't this contradict your argument that avgfreq and
> mcv_freq need to be calculated on the same basis?  Admittedly
> that was just a heuristic, but I'm not seeing why it's wrong.
>

Agree

> > The reason for this is that estfract is calculated as:
> >     estfract = 1.0 / ndistinct;
> > where ndistinct has been adjusted to account for restriction clauses.
> > Therefore, we must also use the adjusted avgfreq when adjusting
> > estfract here:
>
> It feels like that might end up double-counting the effects of
> the restriction clauses.
>
> Anyway, we all seem to agree that s/rel->rows/rel->tuples/ is the
> correct fix for a newly-introduced bug.  I'm inclined to proceed
> by committing that fix (along with any regression test fallout)
> and then investigating the avgfreq change as an independent matter.

+1


--
Thanks,
Tender Wang

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tom Lane

Дата:

01 марта, 06:53:33

Tender Wang <tndrwang@gmail.com> writes:
> In my previous email， I worried rel->tuples may be zero for an empty relation.
> But here it's safe, because an empty relation has no tuples in pg_statistic.

Not sure about that --- it seems possible that after a mass delete,
VACUUM could update pg_class.reltuples to zero without touching
pg_statistic.  And I also don't remember whether the planner clamps
rel->tuples to be at least 1.  But it doesn't matter.  If rel->tuples
is zero, the if-test will prevent us from dividing by zero, and then
we'll leave *mcv_freq as zero meaning "unknown", which seems fine.
It's the same thing that would have happened before bd3e3e9e5.

            regards, tom lane

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tender Wang

Дата:

01 марта, 07:14:40

Tom Lane <tgl@sss.pgh.pa.us> 于2026年3月1日周日 11:53写道：
>
> Tender Wang <tndrwang@gmail.com> writes:
> > In my previous email， I worried rel->tuples may be zero for an empty relation.
> > But here it's safe, because an empty relation has no tuples in pg_statistic.
>
> Not sure about that --- it seems possible that after a mass delete,
> VACUUM could update pg_class.reltuples to zero without touching
> pg_statistic.

Yeah,  Possibly.

>And I also don't remember whether the planner clamps
> rel->tuples to be at least 1.

As far as I know, the planner only clamps rel->rows to be at least 1,
not clamps rel->tuples.

>But it doesn't matter.  If rel->tuples
> is zero, the if-test will prevent us from dividing by zero, and then
> we'll leave *mcv_freq as zero meaning "unknown", which seems fine.
> It's the same thing that would have happened before bd3e3e9e5.

In my first email, I only replaced rel->rows in :
  *mcv_freq = 1.0 / vardata.rel->rows;
I forgot to replace the rel->rows in the if-test, so I have a concern.
My mistake.

--
Thanks,
Tender Wang

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

01 марта, 11:40:08

On Sun, Mar 1, 2026, at 01:08, Tom Lane wrote:
> Hmm ... doesn't this contradict your argument that avgfreq and
> mcv_freq need to be calculated on the same basis?  Admittedly
> that was just a heuristic, but I'm not seeing why it's wrong.

No, I didn't make that argument in my last email. But you're right that
I was wrong about that in my initial email.

Initially, I wrongly thought that the fallback mcv_freq value was a
correct calculation of the most-common-value-frequency in the
restriction. In master, without the mcv_freq fix, it's not - it's a
buggy calculation. Here's why.

The 1/rows value doesn't mean anything useful. For a unique column (no
value is more common than any other), the true MCV frequency is simply
1/tuples — a property of the data, not the query. But 1/rows varies with
the restriction: a more selective WHERE clause gives fewer rows, which
inflates the "frequency" of each unique value. That makes no sense.

Consider a unique column on a 100k-row table:

No restriction:   1/rows = 1/100000 = 0.00001  (correct)
10% selectivity:  1/rows = 1/10000  = 0.0001   (10x inflated)
1% selectivity:   1/rows = 1/1000   = 0.001    (100x inflated)
0.1% selectivity: 1/rows = 1/100    = 0.01     (1000x inflated)

The correct value is always 1/tuples = 0.00001 regardless of the
restriction.  The inflated mcv_freq then feeds into the skew ratio
(mcv_freq/avgfreq), making the planner wildly overestimate bucket skew
for selective queries.

Under the uniform restriction assumption, a base-table fraction is also
the value's fraction in the restricted output, so mcv_freq serves both
purposes without modification (with the mcv_freq fix for the fallback).

Ideally we'd have an estimate for mcv_freq in the restricted output, but
we don't. Thanks to the uniform restriction assumption, however, we can
(and should) use (the fixed) mcv_freq as is.

>> The reason for this is that estfract is calculated as:
>>     estfract = 1.0 / ndistinct;
>> where ndistinct has been adjusted to account for restriction clauses.
>> Therefore, we must also use the adjusted avgfreq when adjusting
>> estfract here:
>
> It feels like that might end up double-counting the effects of
> the restriction clauses.

Admittedly this is complex, given how many variables are involved.

Algebra to the rescue!

Code lines copy/pasted verbatim (master with just the mcv_freq fix):

mcv_freq = 1.0 / vardata.rel->tuples;
ndistinct = get_variable_numdistinct(&vardata, &isdefault);
avgfreq = (1.0 - stanullfrac) / ndistinct;
ndistinct *= vardata.rel->rows / vardata.rel->tuples;
estfract = 1.0 / ndistinct;
estfract *= *mcv_freq / avgfreq;

Helper-variables added:

mcv_freq = 1.0 / vardata.rel->tuples;
ndistinct_raw = get_variable_numdistinct(&vardata, &isdefault);
avgfreq_raw = (1.0 - stanullfrac) / ndistinct_raw;
ndistinct_adj = ndistinct_raw * (vardata.rel->rows / vardata.rel->tuples);
estfract = (1.0 / ndistinct_adj) * (mcv_freq / avgfreq_raw);

Expanding by replacing helper-variables with expressions:

estfract = (1.0 / (ndistinct_raw * (vardata.rel->rows / vardata.rel->tuples))) * ((1.0 / vardata.rel->tuples) / ((1.0 -
stanullfrac)/ ndistinct_raw)); 

a = estfract
b = ndistinct_raw
c = vardata.rel->rows
d = vardata.rel->tuples
e = stanullfrac

a = (1 / (b * (c / d))) * ((1 / d) / ((1 - e) / b))

Simplified using WolframAlpha:

a = 1 / ((1 - e) * c)

Plugging back the actual variables:

estfract = 1 / ((1 - stanullfrac) * vardata.rel->rows)

This makes no sense at all.

When vardata.rel->rows goes down (more selective query), estfract goes
up, i.e. the planner thinks the column is more skewed. But nothing about
the column's data changed, only the WHERE clause. A unique column has no
skew regardless of how many rows you select.

Now, let's do the same algebraic exercise, plugging in the fixed
avgfreq_adj instead:

mcv_freq = 1.0 / vardata.rel->tuples;
ndistinct_raw = get_variable_numdistinct(&vardata, &isdefault);
ndistinct_adj = ndistinct_raw * (vardata.rel->rows / vardata.rel->tuples);
avgfreq_adj = (1.0 - stanullfrac) / ndistinct_adj;
estfract = (1.0 / ndistinct_adj) * (mcv_freq / avgfreq_adj);

estfract = (1.0 / (ndistinct_raw * (vardata.rel->rows / vardata.rel->tuples))) * ((1.0 / vardata.rel->tuples) / ((1.0 -
stanullfrac)/ (ndistinct_raw * (vardata.rel->rows / vardata.rel->tuples)))); 

a = (1 / (b * (c / d))) * ((1 / d) / ((1 - e) / (b * (c / d))))

Simplified using WolframAlpha:

a = 1 / (d * (1 - e))

Notice how both ndistinct_raw (b) and rows (c) cancel — the estimate no
longer depends on the restriction selectivity.

Plugging back the variables:

estfract = 1 / (vardata.rel->tuples * (1 - stanullfrac))

Given vardata.rel->tuples != 0, this can be rewritten as:

estfract = 1 / (vardata.rel->tuples * (1 - stanullfrac))
         = (1 / vardata.rel->tuples) * (1 / (1 - stanullfrac)) -- 1/(a*b) = (1/a)*(1/b), a,b != 0
         = (1 / vardata.rel->tuples) / (1 - stanullfrac)       -- a * (1/b) = a/b, b != 0
         = mcv_freq / (1 - stanullfrac)                        -- mcv_freq = 1.0 / vardata.rel->tuples

This makes much more sense.

Now it's just the MCV's share of non-null rows, no restriction factor.

> Anyway, we all seem to agree that s/rel->rows/rel->tuples/ is the
> correct fix for a newly-introduced bug.  I'm inclined to proceed
> by committing that fix (along with any regression test fallout)
> and then investigating the avgfreq change as an independent matter.

Yes, it seems fine to do them as separate fixes. The argument against
would be if the two separate bugs would somehow compensate for each
other, but I think they compound.

Assuming unique column, 100k rows, no nulls, 1% selectivity (rows=1000):

Master (both bugs):  estfract = 100000/1000² = 0.1      (10000x too high)
Only mcv_freq fix:   estfract = 1/1000       = 0.001    (100x too high)
Both fixes:          estfract = 1/100000     = 0.00001  (correct)

/Joel

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tom Lane

Дата:

02 марта, 00:12:22

"Joel Jacobson" <joel@compiler.org> writes:
> On Sun, Mar 1, 2026, at 01:08, Tom Lane wrote:
>> Anyway, we all seem to agree that s/rel->rows/rel->tuples/ is the
>> correct fix for a newly-introduced bug.  I'm inclined to proceed
>> by committing that fix (along with any regression test fallout)
>> and then investigating the avgfreq change as an independent matter.

> Yes, it seems fine to do them as separate fixes. The argument against
> would be if the two separate bugs would somehow compensate for each
> other, but I think they compound.

Did that.  I was interested to see that this resulted in reverting
every single one of the plan changes induced by bd3e3e9e5, except the
one that involved applying a hash join despite having no statistics.
(That can be blamed on final_cost_hashjoin failing to disregard a
zero mcv_freq value, and really has nothing to do with
estimate_hash_bucket_stats at all.)

> Assuming unique column, 100k rows, no nulls, 1% selectivity (rows=1000):
>
> Master (both bugs):  estfract = 100000/1000² = 0.1      (10000x too high)
> Only mcv_freq fix:   estfract = 1/1000       = 0.001    (100x too high)
> Both fixes:          estfract = 1/100000     = 0.00001  (correct)

I don't buy that that's correct.  For a unique column, the result
should be basically 1/nbuckets or 1/ndistinct, whichever is larger;
in this case it's probably bounded by ndistinct=1000, so that 0.001
is the right answer.

Nonetheless, it's inarguable that the code's doing the wrong thing
with the examples you presented upthread.  After staring at it for
a long while I noticed that with your proposed patch, supposing
that stanullfrac = 0 and ndistinct is small, we compute both avgfreq
and the initial estfract as 1 over (corrected) ndistinct.  So the
adjustment corresponds to just setting estfract equal to mcv_freq
in this case.  However, if nbuckets decreases below ndistinct,
estfract rises and then we're scaling it up by some amount.  That
doesn't make a lot of sense: decreasing the number of buckets
doesn't change the number of MCV values.  Also, increasing stanullfrac
results in decreasing avgfreq and therefore making the correction
larger, which doesn't seem to make sense either.

What is more sensible I think is just to clamp estfract to be at least
mcv_freq always, dispensing with a whole bunch of the complication
here.  We actually don't need avgfreq at all, and therefore not
stanullfrac, and also the ad-hoc range clamp at the bottom no longer
seems necessary.  Interestingly, this also makes the logic more
nearly like the "isdefault" early exit:

        *bucketsize_frac = (Selectivity) Max(0.1, *mcv_freq);

which IIRC was added a lot later than the original logic.

So I end with the attached draft patch.  Very interestingly,
neither this version nor your lets-calculate-avgfreq-later
patch change any regression tests at all compared to git HEAD.
Maybe we should try to add a case that does change.

Aside: you could argue that failing to consider stanullfrac is wrong,
and maybe it is.  But the more I looked at this code the more
convinced I got that it was only partially accounting for nulls
anyway.  That seems like perhaps something to look into later.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index bf6bb624942..9f5178620ef 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -4373,8 +4373,8 @@ estimate_multivariate_bucketsize(PlannerInfo *root, RelOptInfo *inner,
  * exactly those that will be probed most often.  Therefore, the "average"
  * bucket size for costing purposes should really be taken as something close
  * to the "worst case" bucket size.  We try to estimate this by adjusting the
- * fraction if there are too few distinct data values, and then scaling up
- * by the ratio of the most common value's frequency to the average frequency.
+ * fraction if there are too few distinct data values, and then clamping to
+ * at least the bucket size implied by the most common value's frequency.
  *
  * If no statistics are available, use a default estimate of 0.1.  This will
  * discourage use of a hash rather strongly if the inner relation is large,
@@ -4394,9 +4394,7 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
 {
     VariableStatData vardata;
     double        estfract,
-                ndistinct,
-                stanullfrac,
-                avgfreq;
+                ndistinct;
     bool        isdefault;
     AttStatsSlot sslot;

@@ -4446,20 +4444,6 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
         return;
     }

-    /* Get fraction that are null */
-    if (HeapTupleIsValid(vardata.statsTuple))
-    {
-        Form_pg_statistic stats;
-
-        stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
-        stanullfrac = stats->stanullfrac;
-    }
-    else
-        stanullfrac = 0.0;
-
-    /* Compute avg freq of all distinct data values in raw relation */
-    avgfreq = (1.0 - stanullfrac) / ndistinct;
-
     /*
      * Adjust ndistinct to account for restriction clauses.  Observe we are
      * assuming that the data distribution is affected uniformly by the
@@ -4485,20 +4469,11 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
         estfract = 1.0 / ndistinct;

     /*
-     * Adjust estimated bucketsize upward to account for skewed distribution.
-     */
-    if (avgfreq > 0.0 && *mcv_freq > avgfreq)
-        estfract *= *mcv_freq / avgfreq;
-
-    /*
-     * Clamp bucketsize to sane range (the above adjustment could easily
-     * produce an out-of-range result).  We set the lower bound a little above
-     * zero, since zero isn't a very sane result.
+     * Clamp the bucketsize fraction to be not less than the MCV frequency,
+     * since whichever bucket the MCV values end up in will have at least that
+     * size.  This has no effect if *mcv_freq is still zero.
      */
-    if (estfract < 1.0e-6)
-        estfract = 1.0e-6;
-    else if (estfract > 1.0)
-        estfract = 1.0;
+    estfract = Max(estfract, *mcv_freq);

     *bucketsize_frac = (Selectivity) estfract;

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

02 марта, 11:39:15

On Sun, Mar 1, 2026, at 22:12, Tom Lane wrote:
> What is more sensible I think is just to clamp estfract to be at least
> mcv_freq always, dispensing with a whole bunch of the complication
> here.  We actually don't need avgfreq at all, and therefore not
> stanullfrac, and also the ad-hoc range clamp at the bottom no longer
> seems necessary.  Interestingly, this also makes the logic more
> nearly like the "isdefault" early exit:
>
>         *bucketsize_frac = (Selectivity) Max(0.1, *mcv_freq);
>
> which IIRC was added a lot later than the original logic.

Nice simplification.

> So I end with the attached draft patch.  Very interestingly,
> neither this version nor your lets-calculate-avgfreq-later
> patch change any regression tests at all compared to git HEAD.

LGTM.

> Maybe we should try to add a case that does change.

Yes, I think that would be good.
Attached patch adds a new test that does change.

/Joel

Вложения

estimate_hash_bucket_stats-test.patch

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

03 марта, 18:05:29

On Sun, Mar 1, 2026, at 22:12, Tom Lane wrote:
> Aside: you could argue that failing to consider stanullfrac is wrong,
> and maybe it is.  But the more I looked at this code the more
> convinced I got that it was only partially accounting for nulls
> anyway.  That seems like perhaps something to look into later.

How about adjusting estfract for the null fraction before clamping?

```diff
@@ -4461,20 +4473,27 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
    /*
     * Initial estimate of bucketsize fraction is 1/nbuckets as long as the
     * number of buckets is less than the expected number of distinct values;
     * otherwise it is 1/ndistinct.
     */
    if (ndistinct > nbuckets)
        estfract = 1.0 / nbuckets;
    else
        estfract = 1.0 / ndistinct;

+   /*
+    * Adjust for null fraction.  NULL keys are not inserted into the hash
+    * table, but inner_path_rows in final_cost_hashjoin includes them, so we
+    * must discount estfract to compensate.
+    */
+   estfract *= (1.0 - stanullfrac);
+
    /*
     * Clamp the bucketsize fraction to be not less than the MCV frequency,
     * since whichever bucket the MCV values end up in will have at least that
     * size.  This has no effect if *mcv_freq is still zero.
     */
    estfract = Max(estfract, *mcv_freq);

    *bucketsize_frac = (Selectivity) estfract;

    ReleaseVariableStats(vardata);
```

Here is an extreme example where 98% of all hj_nullfrac_inner.val are NULL:

CREATE TABLE hj_nullfrac_inner (id int, val int);
INSERT INTO hj_nullfrac_inner
  SELECT g, CASE WHEN g <= 1000 THEN g ELSE NULL END
  FROM generate_series(1, 50000) g;
CREATE INDEX ON hj_nullfrac_inner(val);

CREATE TABLE hj_nullfrac_outer (id int, val int);
INSERT INTO hj_nullfrac_outer
  SELECT g, ((g-1) % 1000)+1
  FROM generate_series(1, 500000) g;

ANALYZE hj_nullfrac_inner;
ANALYZE hj_nullfrac_outer;

EXPLAIN ANALYZE
SELECT * FROM hj_nullfrac_outer o JOIN hj_nullfrac_inner i ON o.val = i.val;

Both master (HEAD) and v2 results in the same plan:

                                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.30..20022.95 rows=499834 width=16) (actual time=0.016..289.314 rows=500000.00 loops=1)
   Buffers: shared hit=5212 read=1
   ->  Seq Scan on hj_nullfrac_outer o  (cost=0.00..7213.00 rows=500000 width=8) (actual time=0.008..31.449
rows=500000.00loops=1)
 
         Buffers: shared hit=2213
   ->  Memoize  (cost=0.30..0.32 rows=1 width=8) (actual time=0.000..0.000 rows=1.00 loops=500000)
         Cache Key: o.val
         Cache Mode: logical
         Estimates: capacity=1000 distinct keys=1000 lookups=500000 hit percent=99.80%
         Hits: 499000  Misses: 1000  Evictions: 0  Overflows: 0  Memory Usage: 106kB
         Buffers: shared hit=2999 read=1
         ->  Index Scan using hj_nullfrac_inner_val_idx on hj_nullfrac_inner i  (cost=0.29..0.31 rows=1 width=8)
(actualtime=0.002..0.002 rows=1.00 loops=1000)
 
               Index Cond: (val = o.val)
               Index Searches: 1000
               Buffers: shared hit=2999 read=1
 Planning:
   Buffers: shared hit=87 read=7
 Planning Time: 0.419 ms
 Execution Time: 303.107 ms
(18 rows)

With v2+stanullfrac adjustment, we get a Hash Join, that is faster:

                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1347.00..15436.61 rows=500161 width=16) (actual time=6.054..153.595 rows=500000.00 loops=1)
   Hash Cond: (o.val = i.val)
   Buffers: shared hit=2435
   ->  Seq Scan on hj_nullfrac_outer o  (cost=0.00..7213.00 rows=500000 width=8) (actual time=0.008..31.597
rows=500000.00loops=1)
 
         Buffers: shared hit=2213
   ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=6.041..6.042 rows=1000.00 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 552kB
         Buffers: shared hit=222
         ->  Seq Scan on hj_nullfrac_inner i  (cost=0.00..722.00 rows=50000 width=8) (actual time=0.004..3.016
rows=50000.00loops=1)
 
               Buffers: shared hit=222
 Planning:
   Buffers: shared hit=6
 Planning Time: 0.130 ms
 Execution Time: 167.903 ms
(14 rows)

Here is elog debugging comparing v2 vs v2+stanullfrac adjustment, for
the above example:

v2:
ndistinct=1003.000000
nbuckets=65536.000000
stanullfrac=0.979933
mcv_freq=0.000020
estfract before clamping=0.000997
estfract after clamping=0.000997

v2+stanullfrac adjustment:
ndistinct=972.000000
nbuckets=65536.000000
stanullfrac=0.980567
mcv_freq=0.000020
estfract before stanullfrac adjustment=0.001029
estfract after stanullfrac adjustment=0.000020
estfract after clamping=0.000020

/Joel

Вложения

fix-stanullfrac.patch

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tom Lane

Дата:

03 марта, 18:31:06

"Joel Jacobson" <joel@compiler.org> writes:
> On Sun, Mar 1, 2026, at 22:12, Tom Lane wrote:
>> Aside: you could argue that failing to consider stanullfrac is wrong,
>> and maybe it is.  But the more I looked at this code the more
>> convinced I got that it was only partially accounting for nulls
>> anyway.  That seems like perhaps something to look into later.

> How about adjusting estfract for the null fraction before clamping?

This reminds me of the unfinished business at [1].  We really ought
to make it true that nulls never get into the hash table before
we assume that's so in costing.  One of the things I was thinking
was being overlooked is the possibility of lots of nulls bloating
whichever hash bucket they get put in --- but if they aren't put
into a bucket then it's not wrong to ignore them here.

(Strictly speaking, that's still not so with non-strict hash operators,
but those are so rare that I don't mind not accounting for them.)

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/3061845.1746486714@sss.pgh.pa.us

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

03 марта, 20:33:31

On Tue, Mar 3, 2026, at 16:31, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> On Sun, Mar 1, 2026, at 22:12, Tom Lane wrote:
>>> Aside: you could argue that failing to consider stanullfrac is wrong,
>>> and maybe it is.  But the more I looked at this code the more
>>> convinced I got that it was only partially accounting for nulls
>>> anyway.  That seems like perhaps something to look into later.
>
>> How about adjusting estfract for the null fraction before clamping?
>
> This reminds me of the unfinished business at [1].  We really ought
> to make it true that nulls never get into the hash table before
> we assume that's so in costing.  One of the things I was thinking
> was being overlooked is the possibility of lots of nulls bloating
> whichever hash bucket they get put in --- but if they aren't put
> into a bucket then it's not wrong to ignore them here.
>
> (Strictly speaking, that's still not so with non-strict hash operators,
> but those are so rare that I don't mind not accounting for them.)
>
>             regards, tom lane
>
> [1] https://www.postgresql.org/message-id/flat/3061845.1746486714@sss.pgh.pa.us

Hmm, OK, so there are cases when we don't discard NULLs when we should
be able to? I was reading these lines in nodeHash.c and thought we would
always be discarding them when possible:

        if (!isnull)
        {
...
        }
        else if (node->keep_null_tuples)
        {
            /* null join key, but we must save tuple to be emitted later */
...
        }
        /* else we can discard the tuple immediately */

Thanks for the pointer to [1], I will dig into that thread, exciting!

/Joel

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

Tom Lane

Дата:

04 марта, 23:50:36

"Joel Jacobson" <joel@compiler.org> writes:
> On Tue, Mar 3, 2026, at 16:31, Tom Lane wrote:
>> This reminds me of the unfinished business at [1].  We really ought
>> to make it true that nulls never get into the hash table before
>> we assume that's so in costing.

> Hmm, OK, so there are cases when we don't discard NULLs when we should
> be able to? I was reading these lines in nodeHash.c and thought we would
> always be discarding them when possible:

>         if (!isnull)
>         {
> ...
>         }
>         else if (node->keep_null_tuples)
>         {
>             /* null join key, but we must save tuple to be emitted later */
> ...
>         }
>         /* else we can discard the tuple immediately */

I'm confused ... that keep_null_tuples bit appears nowhere in HEAD,
but it does appear in the patch at [1].

Anyway, the short answer is that we discard NULLs if possible, but
it's not possible when doing an outer join that requires returning
null-extended rows from the hashed side.

I've now pushed the patch we were discussing before, and all that's
left to worry about (AFAIK) in estimate_hash_bucket_stats is its
handling of null join keys.  I'd prefer to get the other patch
in before worrying more about that.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/3061845.1746486714%40sss.pgh.pa.us

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

От

"Joel Jacobson"

Дата:

05 марта, 09:17:29

On Wed, Mar 4, 2026, at 21:50, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> On Tue, Mar 3, 2026, at 16:31, Tom Lane wrote:
>>> This reminds me of the unfinished business at [1].  We really ought
>>> to make it true that nulls never get into the hash table before
>>> we assume that's so in costing.
>
>> Hmm, OK, so there are cases when we don't discard NULLs when we should
>> be able to? I was reading these lines in nodeHash.c and thought we would
>> always be discarding them when possible:
>
>>         if (!isnull)
>>         {
>> ...
>>         }
>>         else if (node->keep_null_tuples)
>>         {
>>             /* null join key, but we must save tuple to be emitted later */
>> ...
>>         }
>>         /* else we can discard the tuple immediately */
>
> I'm confused ... that keep_null_tuples bit appears nowhere in HEAD,
> but it does appear in the patch at [1].

Oh, sorry, I was looking at nodeHash.c with [1] applied.
I recalled seeing some `if (!isnull)` code, must have been this code:

                if (!isnull)
                    ExecParallelHashTableInsert(hashtable, slot, hashvalue);

> Anyway, the short answer is that we discard NULLs if possible, but
> it's not possible when doing an outer join that requires returning
> null-extended rows from the hashed side.

Thanks for explaining.

> I've now pushed the patch we were discussing before, and all that's
> left to worry about (AFAIK) in estimate_hash_bucket_stats is its
> handling of null join keys.

Nice!

> I'd prefer to get the other patch
> in before worrying more about that.

Makes sense.

>
>             regards, tom lane
>
> [1] 
> https://www.postgresql.org/message-id/flat/3061845.1746486714%40sss.pgh.pa.us

/Joel

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

Вложения

Вложения

Вложения

Вложения