Обсуждение: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Поиск
Список
Период
Сортировка

Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

От
Quan Zongliang
Дата:
We have a small table with only 23 rows and 21 values.

The resulting MCV and histogram is as follows
stanumbers1 | {0.08695652,0.08695652}
stavalues1  | {v1,v2}
stavalues2  | 
{v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}

An incorrect number of rows was estimated when HashJoin was done with 
another large table (about 2 million rows).

Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual 
time=1.943..1528.983 rows=3902 loops=1)

The reason is that the MCV of the small table excludes values with rows 
of 1. Put them in the MCV in the statistics to get the correct result.

Using the conservative samplerows <= attstattarget doesn't completely 
solve this problem. It can solve this case.

After modification we get statistics without histogram:
stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
stavalues1  | {v,v2, ... }

And we have the right estimates:
Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual 
time=1.447..1268.385 rows=3902 loops=1)


Regards,

--
Quan Zongliang
Beijing Vastdata Co., LTD
Вложения

Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

От
Tomas Vondra
Дата:

On 6/16/23 11:25, Quan Zongliang wrote:
> 
> We have a small table with only 23 rows and 21 values.
> 
> The resulting MCV and histogram is as follows
> stanumbers1 | {0.08695652,0.08695652}
> stavalues1  | {v1,v2}
> stavalues2  |
> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
> 
> An incorrect number of rows was estimated when HashJoin was done with
> another large table (about 2 million rows).
> 
> Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
> time=1.943..1528.983 rows=3902 loops=1)
> 

That's interesting. I wonder how come the estimate gets this bad simply
by skipping values entries with a single row in the sample, which means
we know the per-value selectivity pretty well.

I guess the explanation has to be something strange happening when
estimating the join condition selectivity, where we combine MCVs from
both sides of the join (which has to be happening here, otherwise it
would not matter what gets to the MCV).

It'd be interesting to know what's in the other MCV, and what are the
other statistics for the attributes (ndistinct etc.).

Or even better, a reproducer SQL script that builds two tables and then
joins them.

> The reason is that the MCV of the small table excludes values with rows
> of 1. Put them in the MCV in the statistics to get the correct result.
> 
> Using the conservative samplerows <= attstattarget doesn't completely
> solve this problem. It can solve this case.
> 
> After modification we get statistics without histogram:
> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
> stavalues1  | {v,v2, ... }
> 
> And we have the right estimates:
> Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
> time=1.447..1268.385 rows=3902 loops=1)
> 

I'm not against building a "complete" MCV, but I guess the case where
(samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
complete whenever we decide (ndistinct <= num_mcv)?

That would need to happen later, because we don't have the ndistinct
estimate yet at this point - we'd have to do the loop a bit later (or
likely twice).

FWIW the patch breaks the calculation of nmultiple (and thus likely the
ndistinct estimate).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

От
Quan Zongliang
Дата:

On 2023/6/16 23:39, Tomas Vondra wrote:
> 
> 
> On 6/16/23 11:25, Quan Zongliang wrote:
>>
>> We have a small table with only 23 rows and 21 values.
>>
>> The resulting MCV and histogram is as follows
>> stanumbers1 | {0.08695652,0.08695652}
>> stavalues1  | {v1,v2}
>> stavalues2  |
>> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
>>
>> An incorrect number of rows was estimated when HashJoin was done with
>> another large table (about 2 million rows).
>>
>> Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
>> time=1.943..1528.983 rows=3902 loops=1)
>>
> 
> That's interesting. I wonder how come the estimate gets this bad simply
> by skipping values entries with a single row in the sample, which means
> we know the per-value selectivity pretty well.
> 
> I guess the explanation has to be something strange happening when
> estimating the join condition selectivity, where we combine MCVs from
> both sides of the join (which has to be happening here, otherwise it
> would not matter what gets to the MCV).
> 
> It'd be interesting to know what's in the other MCV, and what are the
> other statistics for the attributes (ndistinct etc.).
> 
> Or even better, a reproducer SQL script that builds two tables and then
> joins them.
> 
The other table is severely skewed. Most rows cannot JOIN the small 
table. This special case causes the inaccuracy of cost calculation.

>> The reason is that the MCV of the small table excludes values with rows
>> of 1. Put them in the MCV in the statistics to get the correct result.
>>
>> Using the conservative samplerows <= attstattarget doesn't completely
>> solve this problem. It can solve this case.
>>
>> After modification we get statistics without histogram:
>> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
>> stavalues1  | {v,v2, ... }
>>
>> And we have the right estimates:
>> Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
>> time=1.447..1268.385 rows=3902 loops=1)
>>
> 
> I'm not against building a "complete" MCV, but I guess the case where
> (samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
> complete whenever we decide (ndistinct <= num_mcv)?
> 
> That would need to happen later, because we don't have the ndistinct
> estimate yet at this point - we'd have to do the loop a bit later (or
> likely twice).
> 
> FWIW the patch breaks the calculation of nmultiple (and thus likely the
> ndistinct estimate).
> 
It's not just a small table. If a column's value is nearly unique. It 
also causes the same problem because we exclude values that occur only 
once. samplerows <= num_mcv just solves one scenario.
Perhaps we should discard this (dups cnt > 1) restriction?

> 
> regards
> 




Quan Zongliang <quanzongliang@yeah.net> writes:
> Perhaps we should discard this (dups cnt > 1) restriction?

That's not going to happen on the basis of one test case that you
haven't even shown us.  The implications of doing it are very unclear.
In particular, I seem to recall that there are bits of logic that
depend on the assumption that MCV entries always represent more than
one row.  The nmultiple calculation Tomas referred to may be failing
because of that, but I'm worried about there being other places.

Basically, you're proposing a rather fundamental change in the rules
by which Postgres has gathered statistics for decades.  You need to
bring some pretty substantial evidence to support that.  The burden
of proof is on you, not on the status quo.

            regards, tom lane



Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

От
Quan Zongliang
Дата:

On 2023/6/17 06:46, Tom Lane wrote:
> Quan Zongliang <quanzongliang@yeah.net> writes:
>> Perhaps we should discard this (dups cnt > 1) restriction?
> 
> That's not going to happen on the basis of one test case that you
> haven't even shown us.  The implications of doing it are very unclear.
> In particular, I seem to recall that there are bits of logic that
> depend on the assumption that MCV entries always represent more than
> one row.  The nmultiple calculation Tomas referred to may be failing
> because of that, but I'm worried about there being other places.
> 

The statistics for the other table look like this:
stadistinct | 6
stanumbers1 | {0.50096667,0.49736667,0.0012}
stavalues1  | {v22,v23,v5}

The value that appears twice in the small table (v1 and v2) does not 
appear here. The stadistinct's true value is 18 instead of 6 (three 
values in the small table do not appear here).

When calculating the selectivity:
if (nd2 > sslot2->nvalues)
   totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);

totalsel1 = 0
nd2 = 21
sslot2->nvalues = 2
unmatchfreq1 = 0.99990002016420476
otherfreq2 = 0.82608695328235626

result: totalsel1 = 0.043473913749706022
rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800


> Basically, you're proposing a rather fundamental change in the rules
> by which Postgres has gathered statistics for decades.  You need to
> bring some pretty substantial evidence to support that.  The burden
> of proof is on you, not on the status quo.
> 
>             regards, tom lane




Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

От
Tomas Vondra
Дата:
On 6/17/23 00:32, Quan Zongliang wrote:
> ...
>
> It's not just a small table. If a column's value is nearly unique. It
> also causes the same problem because we exclude values that occur only
> once. samplerows <= num_mcv just solves one scenario.
> Perhaps we should discard this (dups cnt > 1) restriction?
> 

But for larger tables we'll be unable to keep all the values in the MCV.
So I think this only can change things for tiny tables.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

От
Tomas Vondra
Дата:
On 6/17/23 02:02, Quan Zongliang wrote:
> 
> 
> On 2023/6/17 06:46, Tom Lane wrote:
>> Quan Zongliang <quanzongliang@yeah.net> writes:
>>> Perhaps we should discard this (dups cnt > 1) restriction?
>>
>> That's not going to happen on the basis of one test case that you
>> haven't even shown us.  The implications of doing it are very unclear.
>> In particular, I seem to recall that there are bits of logic that
>> depend on the assumption that MCV entries always represent more than
>> one row.  The nmultiple calculation Tomas referred to may be failing
>> because of that, but I'm worried about there being other places.
>>

I don't recall any logic that'd outright fail with MCVs containing
single-row groups, and I haven't noticed anything obvious in analyze.c
during a cursory search. Maybe the paper analyze_mcv_list builds on
makes some assumptions? Not sure.

However, compute_distinct_stats() doesn't seem to have such protection
against single-row MCV groups, so if that's wrong we kinda already have
the issue I think (admittedly, compute_distinct_stats is much less used
than compute_scalar_stats).

> 
> The statistics for the other table look like this:
> stadistinct | 6
> stanumbers1 | {0.50096667,0.49736667,0.0012}
> stavalues1  | {v22,v23,v5}
> 
> The value that appears twice in the small table (v1 and v2) does not
> appear here. The stadistinct's true value is 18 instead of 6 (three
> values in the small table do not appear here).
> 
> When calculating the selectivity:
> if (nd2 > sslot2->nvalues)
>   totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);
> 
> totalsel1 = 0
> nd2 = 21
> sslot2->nvalues = 2
> unmatchfreq1 = 0.99990002016420476
> otherfreq2 = 0.82608695328235626
> 
> result: totalsel1 = 0.043473913749706022
> rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800
> 

Attached is a script reproducing this.

I think the fundamental issue here is that the most common element of
the large table - v22 (~50%) is not in the tiny one at all. IIRC the
join estimation assumes the domain of one table is a subset of the
other. The values 22 / 23 violate that assumption, unfortunately.

Including all values into the small MCV fix this because then

  otherfreq1 = 0.0

and that simply eliminates the impact of stuff that didn't have a match
between the two MCV lists. Which mitigates the violated assumption.

But once the small table gets too large for the MCV, this won't work
that well - it probably helps a bit, as it makes otherfreq1 smaller.

Which doesn't mean it's useless, but it's likely a rare combination that
a table is (and remains) smaller than MCV, and the large table contains
values without a match in the smaller one (think foreign keys).

> 
>> Basically, you're proposing a rather fundamental change in the rules
>> by which Postgres has gathered statistics for decades.  You need to
>> bring some pretty substantial evidence to support that.  The burden
>> of proof is on you, not on the status quo.
>>

Right. It's a good example of a "quick hack" fixing one particular case,
without considering the consequences on other cases too much. Good as a
starting point, but plenty of legwork to do.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения