От: Jim C. Nasby
Тема: Re: Sort and index
Дата: ,
Msg-id: 20050424220146.GN58835@decibel.org
(см: обсуждение, исходный текст)
Ответ на: Re: Sort and index  (Tom Lane)
Ответы: Re: Sort and index  (Manfred Koizar)
Список: pgsql-performance

Скрыть дерево обсуждения

Sort and index  (Andrei Gaspar, )
 Re: Sort and index  ("Dave Held", )
  Re: Sort and index  (Andrei Gaspar, )
  Re: Sort and index  (Michael Fuhr, )
   Re: Sort and index  (Andrei Gaspar, )
  Re: Sort and index  ("Jim C. Nasby", )
   Re: Sort and index  (Tom Lane, )
    Re: Sort and index  ("Jim C. Nasby", )
     Re: Sort and index  ("Jim C. Nasby", )
      Re: Sort and index  (Tom Lane, )
       Re: Sort and index  ("Jim C. Nasby", )
        Re: Sort and index  (Tom Lane, )
         Re: Sort and index  ("Jim C. Nasby", )
          Re: Sort and index  (Manfred Koizar, )
           Re: Sort and index  ("Jim C. Nasby", )
            Re: Sort and index  (Manfred Koizar, )
             Re: Sort and index  ("Jim C. Nasby", )

On Sat, Apr 23, 2005 at 01:00:40AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <> writes:
> >> Feel free to propose better cost equations.
>
> > Where would I look in code to see what's used now?
>
> All the gold is hidden in src/backend/optimizer/path/costsize.c.
>
>             regards, tom lane

After setting up a second test that orders the table by a highly
non-correlated column, I think I've found part of the problem. The
estimated index scan cost for (project_id, id, date) is
0.00..100117429.34 while the estimate for work_units is
0.00..103168408.62; almost no difference, even though project_id
correlation is .657 while work_units correlation is .116. This is with
random_page_cost set to 1.1; if I set it much higher I can't force the
index scan (BTW, would it make more sense to set the cost of a disable
seqscan to either pages or tuples * disable_cost?), but even with only a
10% overhead on random page fetches it seems logical that the two
estimates should be much farther apart. If you look at the results of
the initial run (http://stats.distributed.net/~decibel/timing.log),
you'll see that the cost of the index scan is way overestimated. Looking
at the code, the runcost is calculated as

    run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

where csquared is indexCorrelation^2. Why is indexCorrelation squared?
The comments say a linear interpolation between min_IO and max_IO is
used, but ISTM that if it was linear then instead of csquared,
indexCorrelation would just be used.

By the way, I'm running a test for ordering by work_units right now, and
I included code to allocate and zero 3.3G of memory (out of 4G) between
steps to clear the kernel buffers. This brought the seqscan times up to
~6800 seconds, so it seems there was in fact buffering going on in the
first test. The second test has been running an index scan for over 14
hours now, so clearly a seqscan+sort is the way to go for a highly
uncorrelated index (at least one that won't fit in
effective_cache_size).
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


В списке pgsql-performance по дате сообщения:

От: Richard Plotkin
Дата:
Сообщение: Re: Disk filling, CPU filling, renegade inserts and deletes?
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?