Re: AW: Call for alpha testing: planner statistics revision s

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: AW: Call for alpha testing: planner statistics revision s
Дата
Msg-id 1731.992874911@sss.pgh.pa.us
обсуждение исходный текст
Ответ на AW: Call for alpha testing: planner statistics revision s  (Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>)
Список pgsql-hackers
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> Imho that is not optimal :-) ** ducks head, to evade flying hammer **
> 1. the random sample approach should be explicitly requested with some 
> syntax extension

I don't think so ... with the current implementation you *must* do
approximate ANALYZE for large tables, or face memory overflow.
We can debate where the threshold should be, but you can't get around
the fact that approximation is essential with large tables.

> 2. the sample size should also be tuneable with some analyze syntax 
> extension (the dba chooses the tradeoff between accuracy and runtime)

The sample size is already driven by the largest SET STATISTICS value
for any of the columns of the table being analyzed.  I'm not sure if we
need a user-tweakable multiplier or not.  The current multiplier is 300
(ie, 3000 sample rows with the default SET STATISTICS target of 10).
This is not a random choice; there is some theory behind it:
    * The following choice of minrows is based on the paper    * "Random sampling for histogram construction: how much
isenough?"    * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in    * Proceedings of ACM SIGMOD
InternationalConference on Management    * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5    * says that
fortable size n, histogram size k, maximum relative    * error in bin size f, and error probability gamma, the minimum
 * random sample size is    *    r = 4 * k * ln(2*n/gamma) / f^2    * Taking f = 0.5, gamma = 0.01, n = 1 million rows,
weobtain    *    r = 305.82 * k    * Note that because of the log function, the dependence on n is    * quite weak;
evenat n = 1 billion, a 300*k sample gives <= 0.59    * bin size error with probability 0.99.  So there's no real need
to   * scale for n, which is a good thing because we don't necessarily    * know it at this point.
 

> 3. if at all, an automatic analyze should do the samples on small tables,
> and accurate stats on large tables

Other way 'round, surely?  It already does that: if your table has fewer
rows than the sampling target, they all get used.

> When on the other hand the optimizer does a "mistake" on a huge table
> the difference is easily a matter of hours, thus you want accurate stats.

Not if it takes hours to get the stats.  I'm more interested in keeping
ANALYZE cheap and encouraging DBAs to run it frequently, so that the
stats stay up-to-date.  It doesn't matter how perfect the stats were
when they were made, if the table has changed since then.

> Because we do not want the dba to decide which statistics are optimal,
> there should probably be an analyze helper application that is invoked
> with "vacuum analyze database optimal" or some such, that also decides 
> whether a table was sufficiently altered to justify new stats gathering
> or vacuum.

And on what are you going to base "sufficiently altered"?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alex Pilosov
Дата:
Сообщение: Re: AW: Call for alpha testing: planner statistics revision s
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Doc translation