Re: PATCH: adaptive ndistinct estimator v4

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: PATCH: adaptive ndistinct estimator v4
Дата
Msg-id CAMkU=1yR=i5F1RnmJsB9C-iEbTE3J1sR+4mRbcPy6mwCotPxfQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PATCH: adaptive ndistinct estimator v4  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: PATCH: adaptive ndistinct estimator v4  (Robert Haas <robertmhaas@gmail.com>)
Re: PATCH: adaptive ndistinct estimator v4  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Thu, Apr 30, 2015 at 6:20 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

On 04/30/15 22:57, Robert Haas wrote:
On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
attached is v4 of the patch implementing adaptive ndistinct estimator.

So, I took a look at this today. It's interesting work, but it looks
more like a research project than something we can commit to 9.5. As
far as I can see, this still computes the estimate the way we do
today, but then also computes the estimate using this new method.
The estimate computed the new way isn't stored anywhere, so this
doesn't really change behavior, except for printing out a WARNING
comparing the values produced by the two estimators.

I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

With the warning it is very hard to correlate the discrepancy you do see with which column is causing it, as the warnings don't include table or column names (Assuming of course that you run it on a substantial database--if you just run it on a few toy cases then the warning works well).  

If we want to have an explicitly experimental patch which we want people with interesting real-world databases to report back on, what kind of patch would it have to be to encourage that to happen?  Or are we never going to get such feedback no matter how friendly we make it?  Another problem is that you really need to have the gold standard to compare them to, and getting that is expensive (which is why we resort to sampling in the first place).  I don't think there is much to be done on that front other than bite the bullet and just do it--perhaps only for the tables which have discrepancies.

Some of the regressions I've seen are at least partly a bug:

+   /* find the 'm' value minimizing the difference */
+   for (m = 1; m <= total_rows; m += step)
+   {
+       double q = k / (sample_rows * m);

sample_rows and m are both integers, and their product overflows vigorously.  A simple cast to double before the multiplication fixes the first example I produced.  The estimate goes from 137,177 to 1,108,076.  The reality is 1,062,223.

Perhaps m should be just be declared a double, as it is frequently used in double arithmetic.

 
Leaving that aside, at some point, you'll say, "OK, there may be some
regressions left but overall I believe this is going to be a win in
most cases". It's going to be really hard for anyone, no matter how
smart, to figure out through code review whether that is true. So
committing this figures to be extremely frightening. It's just not
going to be reasonably possible to know what percentage of users are
going to be more happy after this change and what percentage are
going to be less happy.

For every pair of estimators you can find cases where one of them is better than the other one. It's pretty much impossible to find an estimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it will produce worse estimates in some cases, and we have to admit that. If we think this is unacceptable, we're effectively stuck with the current estimator forever.

Therefore, I think that:

1. This should be committed near the beginning of a release cycle,
not near the end. That way, if there are problem cases, we'll have a
year or so of developer test to shake them out.

It can't hurt, but how effective will it be?  Will developers know or care whether ndistinct happened to get better or worse while they are working on other things?  I would think that problems will be found by focused testing, or during beta, and probably not by accidental discovery during the development cycle.  It can't hurt, but I don't know how much it will help.

 
2. There should be a compatibility GUC to restore the old behavior.
The new behavior should be the default, because if we're not
confident that the new behavior will be better for most people, we
have no business installing it in the first place (plus few people
will try it). But just in case it turns out to suck for some people,
we should provide an escape hatch, at least for a few releases.

I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator to perform worse only on a few data sets / columns. So making this a column-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would make comparing the estimators possible (just run the tests with the GUC set differently). Assuming we'll re-evaluate it at the end, and remove the GUC if possible.

I agree with the "experimental GUC".  That way if hackers do happen to see something suspicious, they can just turn it off and see what difference it makes.  If they have to reverse out a patch from 6 months ago in an area of the code they aren't particularly interested in and then recompile their code and then juggle two different sets of binaries, they will likely just shrug it off without investigation.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Final Patch for GROUPING SETS
Следующее
От: essam Gndelee essam
Дата:
Сообщение: i feel like compelled !