Re: Super Optimizing Postgres

Поиск
Список
Период
Сортировка
От mlw
Тема Re: Super Optimizing Postgres
Дата
Msg-id 3BF6733A.1D8A48BF@mohawksoft.com
обсуждение исходный текст
Ответ на Re: Super Optimizing Postgres  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
Tom Lane wrote:
> 
> Justin Clift <justin@postgresql.org> writes:
> > I think it's an interesting thought of having a program which will test
> > a system and work out the Accurate and Correct values for this.
> 
> I think if you start out with the notion that there is an Accurate
> and Correct value for these parameters, you've already lost the game.

Pardon me, but this is a very scary statement. If you believe that this is
true, then the planner/optimizer is inherently flawed.

If the numbers are meaningless, they should not be used. If the numbers are not
meaningless, then they must be able to be tuned.

In my example, two computers with exactly the same hardware, except one has a
5400 RPM IDE drive, the other has a 10,000 RPM IDE drive. These machines should
not use the same settings, it is obvious that a sequential scan block read on
one will be faster than the other.

> They're inherently fuzzy numbers because they are parameters of an
> (over?) simplified model of reality.

Very true, this also scares me. Relating processing time with disk I/O seems
like a very questionable approach these days. Granted, this strategy was
probably devised when computers systems were a lot simpler, but today with
internal disk caching, cpu instruction caches, pipelining, L2 caching
techniques, clock multiplication, RAID controllers, and so on, the picture is
far more complicated.

That being said, a running system should have a "consistent" performance which
should be measurable.

> 
> It would be interesting to try to fit the model to reality on a wide
> variety of queries, machines & operating environments, and see what
> numbers we come up with.  But there's always going to be a huge fuzz
> factor involved.  Because of that, I'd be *real* wary of any automated
> tuning procedure.  Without a good dollop of human judgement in the loop,
> an automated parameter-setter will likely go off into never-never land
> :-(

It is an interesting problem:

It appears that "sequential scan" is the root measure for the system.
Everything is biased off that. A good "usable" number for sequential scan would
need to be created.

Working on the assumption that a server will have multiple back-ends running,
we will start (n) threads (Or forks). (n) will be tunable by the admin based on
the expected concurrency of their system. There would two disk I/O test
routines, one which performs a sequential scan, one which performs a series of
random page reads.

(n)/2 threads would perform sequential scans.
(n)/2 threads would perform random page reads.

(Perhaps we can even have the admin select the ratio between random and
sequential? Or let the admin choose how many of each?)

The result of the I/O profiling would be to get a reasonable average number of
microseconds it takes to do a sequential scan and a ratio to random page reads.
This will be done on files who's size is larger than the available memory of
the machine to ensure the files do not stay in the OS cache. Each routine will
make several iterations before quitting.

This should produce a reasonable picture of the user's system in action.

We could then take standard code profiling techniques against representative
test routines to do each of the cpu_xxx modules and compare the profile of the
routines against the evaluated time of a sequential scan.

The real trick is the code profiling "test routines." What kind of processing
do the cpu_xxx settings represent?


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masaru Sugawara
Дата:
Сообщение: Re: regress test db
Следующее
От: Justin Clift
Дата:
Сообщение: Re: Super Optimizing Postgres