>> OK, I'll bite. How do we decide where to put the cutoff? If we make
>> it too high, it will have a negative effect on join selectivity
>> estimates; if it's too low, it won't really address the problem we're
>> trying to fix. I randomly propose p = 0.001, which should limit
>> eqjoinsel() to about a million equality tests in the worst case. In
>> the synthetic example we were just benchmarking, that causes the
>> entire MCV array to be tossed out the window (which feels about
>> right).
>
> Yeah. One idle thought I had was that maybe the cutoff needs to
> consider both probabilities: if the high-frequency MCVs on one side
> chance to match to not-so-high-frequency MCVs on the other, you
> would like to know about that. As long as we keep the lists in
> frequency order, it seems easy to implement this: for each MCV
> examined by the outer loop, you run the inner loop until the product of
> the outer and current inner frequency drops below whatever your
> threshold is. This doesn't immediately suggest what the threshold
I had this idle thought too, but I didn't write it down because...
> ought to be, but it seems like it ought to be possible to determine
> that given a desired maximum error in the overall estimate. I'm also
> not very clear on what the "total frequency" computations (matchfreq2
> and unmatchfreq2 in the current code) ought to look like if we are using
> a variable subset of the inner list.
...of this exact concern, which I think is an insurmountable problem.
If you don't consider some of the MCVs AT ALL, I think you can add
their frequencies back in to otherfreq{1,2} and go home, but if you
consider them for some elements of the other list but not all, I'm not
sure there's an appropriate way to proceed.
...Robert