Обсуждение: number of rows in analyze

Поиск
Список
Период
Сортировка

number of rows in analyze

От
Willy-Bas Loos
Дата:
Hi,

I'ver been wondering how to set the number of rows that are scanned by
analyze. (I want to increase it)
I couln't find it at first but when i read closer i found that is does:
"The largest statistics target among the columns being analyzed
determines the number of table rows sampled to prepare the
statistics."
(http://www.postgresql.org/docs/8.4/interactive/sql-analyze.html)


My question is *HOW* does it "detirmine the number of table rows
sampled" with the statistics target?
What is the formula behind it?

I am not familiar enough with the source code to figure that out
myself. sry about that.
Would be good to include in the docs if we find out.

Cheers,

WBL

--
"Patriotism is the conviction that your country is superior to all
others because you were born in it." -- George Bernard Shaw

Re: number of rows in analyze

От
"ktm@rice.edu"
Дата:
On Mon, Aug 01, 2011 at 04:42:23PM +0200, Willy-Bas Loos wrote:
> Hi,
>
> I'ver been wondering how to set the number of rows that are scanned by
> analyze. (I want to increase it)
> I couln't find it at first but when i read closer i found that is does:
> "The largest statistics target among the columns being analyzed
> determines the number of table rows sampled to prepare the
> statistics."
> (http://www.postgresql.org/docs/8.4/interactive/sql-analyze.html)
>
>
> My question is *HOW* does it "detirmine the number of table rows
> sampled" with the statistics target?
> What is the formula behind it?
>
> I am not familiar enough with the source code to figure that out
> myself. sry about that.
> Would be good to include in the docs if we find out.
>
> Cheers,
>
> WBL
>
Peruse the above reference a bit more carefully. It is actually
explained pretty well.

Regards,
Ken

Re: number of rows in analyze

От
Tom Lane
Дата:
Willy-Bas Loos <willybas@gmail.com> writes:
> "The largest statistics target among the columns being analyzed
> determines the number of table rows sampled to prepare the
> statistics."
> (http://www.postgresql.org/docs/8.4/interactive/sql-analyze.html)

> My question is *HOW* does it "detirmine the number of table rows
> sampled" with the statistics target?
> What is the formula behind it?

From src/backend/commands/analyze.c:

        /*--------------------
         * The following choice of minrows is based on the paper
         * "Random sampling for histogram construction: how much is enough?"
         * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
         * Proceedings of ACM SIGMOD International Conference on Management
         * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
         * says that for table size n, histogram size k, maximum relative
         * error in bin size f, and error probability gamma, the minimum
         * random sample size is
         *        r = 4 * k * ln(2*n/gamma) / f^2
         * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
         *        r = 305.82 * k
         * Note that because of the log function, the dependence on n is
         * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
         * bin size error with probability 0.99.  So there's no real need to
         * scale for n, which is a good thing because we don't necessarily
         * know it at this point.
         *--------------------
         */
        stats->minrows = 300 * attr->attstattarget;

> I am not familiar enough with the source code to figure that out
> myself. sry about that.
> Would be good to include in the docs if we find out.

Why?  This is purely an implementation detail.  From the user's
viewpoint, either the stats are good enough or they're not --- the exact
number of rows sampled doesn't seem that interesting.

            regards, tom lane

Re: number of rows in analyze

От
Willy-Bas Loos
Дата:
Ok, so 300 times the statistics target!
Thanks a lot for the exact code and formula!

On Mon, Aug 1, 2011 at 4:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Why?  This is purely an implementation detail.  From the user's
> viewpoint, either the stats are good enough or they're not --- the exact
> number of rows sampled doesn't seem that interesting.
>
>                        regards, tom lane

Because the manual says that i can influence the number of rows that
are sampled by increasing the statistics target.
But i don't know how sensitive that value is, i want to know what it
is that i am about to do.

btw i don't like that the number of statistics records are coupled
with the number of rows sampled. it's not only confusing, it doesn't
seem optimal to me.
I guess i will look up the paper to find out more. I didn't make an
academic study about it (yet :P ) .

Cheers!

WBL



--
"Patriotism is the conviction that your country is superior to all
others because you were born in it." -- George Bernard Shaw