Обсуждение: O(N^2) when building multi-column MCV lists
Hi,
When experimenting with multi-column MCV lists with statistic target set
to high value (e.g. 10k), I've realized there's an O(N^2) issue in
statext_mcv_build() when computing base frequencies.
We do this:
for (i = 0; i < nitems; i++)
{
...
item->base_frequency = 1.0;
for (j = 0; j < numattrs; j++)
{
int count = 0;
int k;
for (k = 0; k < ngroups; k++)
{
if (multi_sort_compare_dim(j, &groups[i], &groups[k], mss) == 0)
count += groups[k].count;
}
...
}
}
That is, for each item on the MCV list, we walk through all the groups
(for each dimension independently) to determine the total frequency of
the value.
With many groups (which can easily happen for high statistics target),
this can easily get very expensive.
I think the best solution here is to pre-compute frequencies for values
in all dimensions, and then just access that instead of looping through
the groups over and over.
IMHO this is something we should fix for PG12, so I'll put that on the
open items list, and produce a fix shortly.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
Attached is a WIP/PoC fix addressing the O(N^2) behavior in ANALYZE with
high statistic target values. It needs more work, but it's good enough to
show some measurements.
For benchmark, I've created a simple 2-column table, with MCV list on
those two columns:
CREATE TABLE t (a int, b int);
CREATE STATISTICS s (mcv) ON a,b FROM t;
and then loaded data sets with different numbers of random combinations,
determined by number of values in each column. For example with 10 values
in a column, you get ~100 combinations.
INSERT INTO t
SELECT 10*random(), 10*random() FROM generate_series(1,3e6);
The 3M rows is picked because that's the sample size with target 10000.
The results with different statistic targets look like this:
1) master
values 100 1000 5000 10000
====================================================
10 103 586 2419 3041
100 116 789 4778 8934
1000 116 690 3162 499748
2) patched
values 100 1000 5000 10000
====================================================
10 113 606 2460 3716
100 143 711 3371 5231
1000 156 994 3836 6002
3) comparison (patched / master)
values 100 1000 5000 10000
====================================================
10 110% 103% 102% 122%
100 123% 90% 71% 59%
1000 134% 144% 121% 1%
So clearly, the issue for large statistic targets is resolved (duration
drops from 500s to just 6s), but there is measurable regression for the
other cases. That needs more investigation & fix before commit.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services