Обсуждение: hist boundary duplicates bug in head and 8.3
For heavy tailed distributions, it is possible for analyze to
duplicate histogram boundaries.
Here is the output from a test against HEAD; I've attached the test data.
=# create table bug(f float);
COPY 100000
=# copy bug from '/tmp/test_data.txt';
COPY 100000
=# analyze bug;
ANALYZE
=# select histogram_bounds from pg_stats where tablename='bug';
histogram_bounds
------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------
--------
{34,34,34,34,34,34,34,36,36,36,36,36,37,37,37,37,37,38,38,38,38,39,39,39,39,40,40,40,41,41,41,41,42,42,42,43,43,44,44,44,45,45,45,46,46,46,47,47,48,48,48,4
9,49,50,50,51,51,52,53,53,54,55,55,56,57,58,58,59,60,61,62,63,64,65,67,68,70,72,74,76,79,81,83,86,89,91,94,98,101,105,110,117,122,131,139,152,171,202,236,30
8,1457}
(1 row)
Analyze assumes that if a value has a sample count less than
samplesize/num_buckets, ( maxmincount near line 2170 in
commands/analyze.c) then it is safe to include it in a histogram.
However, because the histogram only contains non mcv's, this is
incorrect.
As far as I can see, there are 4 solutions:
1) track all of the distinct values
This wouldn't be *too* expensive in analyze, especially considering we
are tracking all of the sampled values as it is. However, this opens
up the possibility of having huge mcv's lists in the worst case.To see
this, consider a distribution such that the most common value was 20%
of the table, the next mcv was 20% of the remaining entries, etc.
Clearly, for stats targets greater than 5, every value in the table
would overrun a histogram boundary, leading to an mcv list that
contained every distinct value in the sample.
2) reduce number_of_bins if values exist with frequency greater than 1/nbins
This would fix the bug, but at the cost of reducing the utility of the
histogram ( it would introduce a large skew to the ndistinct
distribution, which is assumed to be uniform over non-mcvs ).
3) use variable width histogram bins over all values.
This is probably the cleanest solution, but the most invasive.
4) Fix the binary search in ineqsel to correctly find the boundaries,
even with duplicates
This would also be relatively clean, but are the hist boundaries
assumption of being strictly increasing being satisfied anywhere else
besides ineqsel?
I've attached a patch that is a compromise between 1 and 2. It puts a
hard limit on the number of mcv's at 2x the stats target, and then, if
there are still values with too high a frequency, it reduces the
number of histogram buckets.
-Nathan
			
		Вложения
"Nathan Boley" <npboley@gmail.com> writes:
> For heavy tailed distributions, it is possible for analyze to
> duplicate histogram boundaries.
I don't think this is a bug.  You've got values that didn't make it into
the MCV list, but nonetheless occupy multiple buckets' worth of space in
the remainder of the distribution.  They *should* appear multiple times
in the histogram.  If they didn't, the histogram would be understating
their frequency.
        regards, tom lane
			
		>> For heavy tailed distributions, it is possible for analyze to >> duplicate histogram boundaries. > > I don't think this is a bug. hmmm... Well, I assumed it was a bug from a comment in analyze. From ( near ) line 2130 in analyze.c * least 2 instances in the sample. Also, we won't suppress values* that have a frequency of at least 1/K where K is theintended* number of histogram bins; such values might otherwise cause us to* emit duplicate histogram bin boundaries.*/ If this is expected, I'm also not sure what the use of maxmincount in analyze is... Thanks for the response, Nathan
"Nathan Boley" <npboley@gmail.com> writes:
>> I don't think this is a bug.
> hmmm... Well, I assumed it was a bug from a comment in analyze.
> From ( near ) line 2130 in analyze.c
>  * least 2 instances in the sample.  Also, we won't suppress values
>  * that have a frequency of at least 1/K where K is the intended
>  * number of histogram bins; such values might otherwise cause us to
>  * emit duplicate histogram bin boundaries.
That's talking about a case where we have a choice whether to include a
value in the MCV list or not.  Once the MCV list is maxed out, we can't
do anything to avoid duplicates.
        regards, tom lane
			
		On Tue, 2009-01-06 at 18:40 -0500, Tom Lane wrote: > "Nathan Boley" <npboley@gmail.com> writes: > >> I don't think this is a bug. > > > hmmm... Well, I assumed it was a bug from a comment in analyze. > > > From ( near ) line 2130 in analyze.c > > > * least 2 instances in the sample. Also, we won't suppress values > > * that have a frequency of at least 1/K where K is the intended > > * number of histogram bins; such values might otherwise cause us to > > * emit duplicate histogram bin boundaries. > > That's talking about a case where we have a choice whether to include a > value in the MCV list or not. Once the MCV list is maxed out, we can't > do anything to avoid duplicates. Surely the most important point in the OP was that ineqsel does not correctly binary search in the presence of duplicates. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> Surely the most important point in the OP was that ineqsel does not > correctly binary search in the presence of duplicates. > It would have been if I were correct :-( . Looking at it again, that was from a bug in my code. Thanks for your time, and sorry about the noise. -Nathan