Re: default_statistics_target WAS: max_wal_senders must die

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: default_statistics_target WAS: max_wal_senders must die
Дата
Msg-id 4CBF9169.4030408@agliodbs.com
обсуждение исходный текст
Ответ на Re: default_statistics_target WAS: max_wal_senders must die  (Greg Stark <gsstark@mit.edu>)
Ответы Re: default_statistics_target WAS: max_wal_senders must die  (Greg Stark <gsstark@mit.edu>)
Список pgsql-hackers
> Why? Afaict this has been suggested multiple times by people who don't
> justify it in any way except with handwavy -- larger samples are
> better. The sample size is picked based on what sample statistics
> tells us we need to achieve a given 95th percentile confidence
> interval for the bucket size given.

I also just realized that I confused myself ... we don't really want
more MCVs.  What we want it more *samples* to derive a small number of
MCVs.  Right now # of samples and number of MCVs is inexorably bound,
and they shouldn't be.  On larger tables, you're correct that we don't
necessarily want more MCVs, we just need more samples to figure out
those MCVs accurately.

> Can you explain when this would and wouldn't bias the sample for the
> users so they can decide whether to use it or not?

Sure.  There's some good math in various ACM papers for this.  The
basics are that block-based sampling should be accompanied by an
increased sample size, or you are lowering your confidence level.  But
since block-based sampling allows you to increase your sample size
without increasing I/O or RAM usage, you *can* take a larger sample ...
a *much* larger sample if you have small rows.

The algorithms for deriving stats from a block-based sample are a bit
more complex, because the code needs to determine the level of physical
correlation in the blocks sampled and skew the stats based on that.  So
there would be an increase in CPU time.  As a result, we'd probably give
some advice like "random sampling for small tables, block-based for
large ones".

> I think increasing the MCV is too simplistic since we don't really
> have any basis for any particular value. I think what we need are some
> statistics nerds to come along and say here's this nice tool from
> which you can make the following predictions and understand how
> increasing or decreasing the data set size affects the accuracy of the
> predictions.

Agreed.

Nathan?

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Issues with Quorum Commit
Следующее
От: Greg Stark
Дата:
Сообщение: Re: default_statistics_target WAS: max_wal_senders must die