Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Поиск
Список
Период
Сортировка
От Quan Zongliang
Тема Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Дата
Msg-id 83d056c0-c3eb-6ef7-bf3a-7e461fcde74d@yeah.net
обсуждение исходный текст
Ответ на Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers

On 2023/6/16 23:39, Tomas Vondra wrote:
> 
> 
> On 6/16/23 11:25, Quan Zongliang wrote:
>>
>> We have a small table with only 23 rows and 21 values.
>>
>> The resulting MCV and histogram is as follows
>> stanumbers1 | {0.08695652,0.08695652}
>> stavalues1  | {v1,v2}
>> stavalues2  |
>> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
>>
>> An incorrect number of rows was estimated when HashJoin was done with
>> another large table (about 2 million rows).
>>
>> Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
>> time=1.943..1528.983 rows=3902 loops=1)
>>
> 
> That's interesting. I wonder how come the estimate gets this bad simply
> by skipping values entries with a single row in the sample, which means
> we know the per-value selectivity pretty well.
> 
> I guess the explanation has to be something strange happening when
> estimating the join condition selectivity, where we combine MCVs from
> both sides of the join (which has to be happening here, otherwise it
> would not matter what gets to the MCV).
> 
> It'd be interesting to know what's in the other MCV, and what are the
> other statistics for the attributes (ndistinct etc.).
> 
> Or even better, a reproducer SQL script that builds two tables and then
> joins them.
> 
The other table is severely skewed. Most rows cannot JOIN the small 
table. This special case causes the inaccuracy of cost calculation.

>> The reason is that the MCV of the small table excludes values with rows
>> of 1. Put them in the MCV in the statistics to get the correct result.
>>
>> Using the conservative samplerows <= attstattarget doesn't completely
>> solve this problem. It can solve this case.
>>
>> After modification we get statistics without histogram:
>> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
>> stavalues1  | {v,v2, ... }
>>
>> And we have the right estimates:
>> Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
>> time=1.447..1268.385 rows=3902 loops=1)
>>
> 
> I'm not against building a "complete" MCV, but I guess the case where
> (samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
> complete whenever we decide (ndistinct <= num_mcv)?
> 
> That would need to happen later, because we don't have the ndistinct
> estimate yet at this point - we'd have to do the loop a bit later (or
> likely twice).
> 
> FWIW the patch breaks the calculation of nmultiple (and thus likely the
> ndistinct estimate).
> 
It's not just a small table. If a column's value is nearly unique. It 
also causes the same problem because we exclude values that occur only 
once. samplerows <= num_mcv just solves one scenario.
Perhaps we should discard this (dups cnt > 1) restriction?

> 
> regards
> 




В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Tristan Partin"
Дата:
Сообщение: Re: Default client_connection_check_interval to 10s on supported systems
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics