Обсуждение: Questions on query planner, join types, and work_mem

От:
Peter Hussey
Дата:

I have spent the last couple of weeks digging into a Postgres performance problem that ultimately boiled down to this:  the planner was choosing to use hash joins on a set of join keys that were much larger than the configured work_mem.  We found we could make the  performance much better by either
1) increasing work_mem to 500MB or more, or
2) forcing the planner to choose index-backed nested loops by turning off hash and merge joins as well as bitmap and sequential scans. 

Now we are trying to decide which of these paths to choose, and asking why the planner doesn't handle this for us.

Background:  LabKey builds an open source platform for biomedical research data.  The platform consists of a tomcat web application and a relational database.  we support two databases, Postgres and SQL Server.  We started with SQL Server because we were very familiar with it.  Two of our technical team came from the SQL Server development team.  We chose Postgres because we assessed that it was the open source database most likely to be able to handle our application  requirements for capacity and complex, nested, generated SQL handling.  Postgres is now the default database for our platform and most of our key customers use it.  In general we've been very satisfied with Postgres' performance and compatibility, but our customers are starting to hit situations where we really need to be able to understand why a particular operation is slow.  We are currently recommending version 8.4 and using that ourselves. 

The core of the problem query was

SELECT * INTO snapshot_table FROM
  (SELECT ... FROM  tableA A LEFT  OUTER JOIN tableB B ON (A.lsid = B.lsid) and A.datasetid = ? )  query1

the join column, lsid, is a poor choice for a join column as it is a long varchar value (avg length 101 characters) that us only gets unique way out on the right hand side.  But we are stuck with this choice.  I can post the SQL query and table definitions if it will help, but changes to either of those would be risky and difficult, whereas setting the work_mem value or forcing nested loop joins is less risky. 

The Performance curve looks something like this

Join Type      work_mem(MB)     time to populate snapshot (min)
______________________________________________________________
Hash              50                        85
Hash              200                       38
Hash              400                       21
Hash              500                       12
Hash             1000                       12
_______________________________________________________________
NestedLoop        50                        15
NestedLoop        200                       11
NestedLoop        400                       11
NestedLoop        500                       10
NestedLoop       1000                       10
________________________________________________________

Table A contains about 3.5 million rows, and table B contains about 4.4 million rows.  By looking at the EXPLAIN ANALYZE reports I concluded that the planner seemed to be accurately determining the approximate number of rows returned on each side of the join node.  I also noticed that at the work_mem = 50 test, the hash join query execution was using over a GB of space in the pgsql_tmp, space that grew and shrank slowly over the course of the test.

Now for the questions:
1)  If we tell the customer to set his work_mem value to 500MB or 1GB in postgres.config, what problems might they see?  the documentation and the guidelines we received from Rupinder Singh in support suggest a much lower value, e.g. a max work_mem of 10MB.  Other documentation such as the "Guide to Posting Slow Query Questions" suggest at least testing up to 1GB.  What is a reasonable maximum to configure for all connnections?

2) How is work_mem used by a query execution?  For example, does each hash table in an execution get allocated a full work_mem's worth of memory ?   Is this memory released when the query is finished, or does it stay attached to the connection or some other object?

3) is there a reason why the planner doesn't seem to recognize the condition when the hash table won't fit in the current work_mem, and choose a low-memory plan instead?

Excuse the long-winded post; I was trying to give the facts and nothing but the facts.

Thanks,
Peter Hussey
LabKey Software

От:
Andres Freund
Дата:

Hi,

On Tue, Jul 27, 2010 at 04:08:16PM -0700, Peter Hussey wrote:
> Now for the questions:
> 1)  If we tell the customer to set his work_mem value to 500MB or 1GB in
> postgres.config, what problems might they see?  the documentation and the
> guidelines we received from Rupinder Singh in support suggest a much lower
> value, e.g. a max work_mem of 10MB.  Other documentation such as the "Guide
> to Posting Slow Query Questions" suggest at least testing up to 1GB.  What
> is a reasonable maximum to configure for all connnections?
Well. That depends on the amount of expected concurrency and available
memory. Obviously you can set it way much higher in an OLAPish, low
concurrency setting than in an OLTP environment.

That setting is significantly complex to estimate in my opinion. For
one the actualy usage depends on the complexity of the queries, for
another to be halfway safe you have to use avail_mem/(max_connections
* max_nodes_of_most_complex_query). Which is often a very pessimistic
and unusably low estimate.

> 2) How is work_mem used by a query execution?  For example, does each hash
> table in an execution get allocated a full work_mem's worth of memory ?   Is
> this memory released when the query is finished, or does it stay attached to
> the connection or some other object?
Each Node of the query can use one work_mem worth of data (sometimes a
bit more). The memory is released after the query finished (or
possibly earlier, dependent of the structure of the query).
The specific allocation pattern and implementation details (of malloc)
influence how and when that memory is actually returned to the os.

> 3) is there a reason why the planner doesn't seem to recognize the condition
> when the hash table won't fit in the current work_mem, and choose a
> low-memory plan instead?
Hard to say without more information. Bad estimates maybe? Best show
your query plan (EXPLAIN ANALYZE), the table definition and some
details about common hardware (i.e. whether it has 1GB of memory or
256GB).

Andres

От:
Tom Lane
Дата:

Peter Hussey <> writes:
> I have spent the last couple of weeks digging into a Postgres performance
> problem that ultimately boiled down to this:  the planner was choosing to
> use hash joins on a set of join keys that were much larger than the
> configured work_mem.

What Postgres version is this, exactly?  ("8.4" is not the answer I want.)

> the join column, lsid, is a poor choice for a join column as it is a long
> varchar value (avg length 101 characters) that us only gets unique way out
> on the right hand side.

Hm, but it is unique eventually?  It's not necessarily bad for hashing
as long as that's so.

> 1)  If we tell the customer to set his work_mem value to 500MB or 1GB in
> postgres.config, what problems might they see?

That would almost certainly be disastrous.  If you have to follow the
hack-work_mem path, I'd suggest increasing it locally in the session
executing the problem query, and only for the duration of that query.
Use SET, or even SET LOCAL.

> 2) How is work_mem used by a query execution?

Well, the issue you're hitting is that the executor is dividing the
query into batches to keep the size of the in-memory hash table below
work_mem.  The planner should expect that and estimate the cost of
the hash technique appropriately, but seemingly it's failing to do so.
Since you didn't provide EXPLAIN ANALYZE output, though, it's hard
to be sure.

> 3) is there a reason why the planner doesn't seem to recognize the condition
> when the hash table won't fit in the current work_mem, and choose a
> low-memory plan instead?

That's the question, all right.  I wonder if it's got something to do
with the wide-varchar nature of the join key ... but again that's just
speculation with no facts.  Please show us EXPLAIN ANALYZE results
for the hash plan with both small and large work_mem, as well as for
the nestloop plan.

            regards, tom lane

От:
Alvaro Herrera
Дата:

Excerpts from Tom Lane's message of mar jul 27 20:05:02 -0400 2010:
> Peter Hussey <> writes:

> > 2) How is work_mem used by a query execution?
>
> Well, the issue you're hitting is that the executor is dividing the
> query into batches to keep the size of the in-memory hash table below
> work_mem.  The planner should expect that and estimate the cost of
> the hash technique appropriately, but seemingly it's failing to do so.
> Since you didn't provide EXPLAIN ANALYZE output, though, it's hard
> to be sure.

Hmm, I wasn't aware that hash joins worked this way wrt work_mem.  Is
this visible in the explain output?  If it's something subtle (like an
increased total cost), may I suggest that it'd be a good idea to make it
explicit somehow in the machine-readable outputs?

От:
Jayadevan M
Дата:

Hello,
> the join column, lsid, is a poor choice for a join column as it is a
> long varchar value (avg length 101 characters) that us only gets
> unique way out on the right hand side.
Would a join on subtring on the 'way out on the right hand side' (did you
mean 'rightmost characters' or 'only when we take almost all the 101
characters'?) together with a function based index help?
Regards,
Jayadevan





DISCLAIMER:

"The information in this e-mail and any attachment is intended only for
the person to whom it is addressed and may contain confidential and/or
privileged material. If you have received this e-mail in error, kindly
contact the sender and destroy all copies of the original communication.
IBS makes no warranty, express or implied, nor guarantees the accuracy,
adequacy or completeness of the information contained in this email or any
attachment and is not liable for any errors, defects, omissions, viruses
or for resultant loss or damage, if any, direct or indirect."






От:
Tom Lane
Дата:

Alvaro Herrera <> writes:
> Excerpts from Tom Lane's message of mar jul 27 20:05:02 -0400 2010:
>> Well, the issue you're hitting is that the executor is dividing the
>> query into batches to keep the size of the in-memory hash table below
>> work_mem.  The planner should expect that and estimate the cost of
>> the hash technique appropriately, but seemingly it's failing to do so.

> Hmm, I wasn't aware that hash joins worked this way wrt work_mem.  Is
> this visible in the explain output?

As of 9.0, any significant difference between "Hash Batches" and
"Original Hash Batches" would be a cue that the planner blew the
estimate.  For Peter's problem, we're just going to have to look
to see if the estimated cost changes in a sane way between the
small-work_mem and large-work_mem cases.

            regards, tom lane

От:
Tom Lane
Дата:

Peter Hussey <> writes:
> Using the default of 1MB work_mem, the planner chooses a hash join plan :
> "Hash Left Join  (cost=252641.82..11847353.87 rows=971572 width=111) (actual
> time=124196.670..280461.604 rows=968080 loops=1)"
> ...
> For the same default 1MB work_mem, a nested loop plan is better
> "Nested Loop Left Join  (cost=8.27..15275401.19 rows=971572 width=111)
> (actual time=145.015..189957.023 rows=968080 loops=1)"
> ...

Hm.  A nestloop with nearly a million rows on the outside is pretty
scary.  The fact that you aren't unhappy with that version of the plan,
rather than the hash, indicates that the "object" table must be
fully cached in memory, otherwise the repeated indexscans would be a
lot slower than this:

> "  ->  Index Scan using uq_object on object obj  (cost=0.00..3.51 rows=1
> width=95) (actual time=0.168..0.170 rows=1 loops=968080)"
> "        Index Cond: ((sd.lsid)::text = (obj.objecturi)::text)"

My take on it is that the estimate of the hash plan's cost isn't bad;
what's bad is that the planner is mistakenly estimating the nestloop as
being worse.  What you need to do is adjust the planner's cost
parameters so that it has a better idea of the true cost of repeated
index probes in your environment.  Crank up effective_cache_size if
you didn't already, and experiment with lowering random_page_cost.
See the list archives for more discussion of these parameters.

            regards, tom lane

От:
Peter Hussey
Дата:

I already had effective_cache_size set to 500MB.

I experimented with lowering  random_page_cost to 3 then 2.  It made no difference in the choice of plan that I could see.  In the explain analyze output the estimated costs of nested loop were in fact lowererd, but so were the costs of the hash join plan, and the hash join remained the lowest predicted costs in all tests i tried.

What seems wrong to me is that the hash join strategy shows almost no difference in estimated costs as work_mem goes from 1MB to 500MB. The cost function decreases by 1%, but the actual time for the query to execute decreases by 86% as work_mem goes from 1MB to 500MB.

My questions are still
1)  Does the planner have any component of cost calculations based on the size of work_mem, and if so why do those calculations  seem to have so little effect here?

2) Why is the setting of work_mem something left to the admin and/or developer?  Couldn't the optimizer say how much it thinks it needs to build a hash table based on size of the keys and estimated number of rows?

It is difficult for a software development platform like ours to take advantage of suggestions to set work_mem, or to change the cost function, or turn on/off join strategies for individual queries.  The SQL we issue is formed by user interaction with the product and rarely static.  How would we know when to turn something on or off?  That's why I'm looking for a configuratoin solution that I can set on a database-wide basis and have it work well for all queries.

thanks
Peter


On Fri, Jul 30, 2010 at 7:03 AM, Tom Lane <> wrote:
Peter Hussey <> writes:
> Using the default of 1MB work_mem, the planner chooses a hash join plan :
> "Hash Left Join  (cost=252641.82..11847353.87 rows=971572 width=111) (actual
> time=124196.670..280461.604 rows=968080 loops=1)"
> ...
> For the same default 1MB work_mem, a nested loop plan is better
> "Nested Loop Left Join  (cost=8.27..15275401.19 rows=971572 width=111)
> (actual time=145.015..189957.023 rows=968080 loops=1)"
> ...

Hm.  A nestloop with nearly a million rows on the outside is pretty
scary.  The fact that you aren't unhappy with that version of the plan,
rather than the hash, indicates that the "object" table must be
fully cached in memory, otherwise the repeated indexscans would be a
lot slower than this:

> "  ->  Index Scan using uq_object on object obj  (cost=0.00..3.51 rows=1
> width=95) (actual time=0.168..0.170 rows=1 loops=968080)"
> "        Index Cond: ((sd.lsid)::text = (obj.objecturi)::text)"

My take on it is that the estimate of the hash plan's cost isn't bad;
what's bad is that the planner is mistakenly estimating the nestloop as
being worse.  What you need to do is adjust the planner's cost
parameters so that it has a better idea of the true cost of repeated
index probes in your environment.  Crank up effective_cache_size if
you didn't already, and experiment with lowering random_page_cost.
See the list archives for more discussion of these parameters.

                       regards, tom lane



--
Peter Hussey
LabKey Software
206-667-7193 (office)
206-291-5625 (cell)
От:
Robert Haas
Дата:

On Mon, Aug 2, 2010 at 5:23 PM, Peter Hussey <> wrote:
> I already had effective_cache_size set to 500MB.
>
> I experimented with lowering  random_page_cost to 3 then 2.  It made no
> difference in the choice of plan that I could see.  In the explain analyze
> output the estimated costs of nested loop were in fact lowererd, but so were
> the costs of the hash join plan, and the hash join remained the lowest
> predicted costs in all tests i tried.

What do you get if you set random_page_cost to a small value such as 0.01?

> What seems wrong to me is that the hash join strategy shows almost no
> difference in estimated costs as work_mem goes from 1MB to 500MB. The cost
> function decreases by 1%, but the actual time for the query to execute
> decreases by 86% as work_mem goes from 1MB to 500MB.

Wow.  It would be interesting to find out how many batches are being
used.  Unfortunately, releases prior to 9.0 don't display that
information.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

От:
Tom Lane
Дата:

Peter Hussey <> writes:
> My questions are still
> 1)  Does the planner have any component of cost calculations based on the
> size of work_mem,

Sure.

> and if so why do those calculations  seem to have so
> little effect here?

Since you haven't provided sufficient information to let someone else
reproduce what you're seeing, it's pretty hard to say.  It might have
something to do with the particularly wide join key values you're using,
but that's mere speculation based on the one tidbit you provided.  There
might be some other effect altogether that's making it do the wrong thing.

> 2) Why is the setting of work_mem something left to the admin and/or
> developer?

Because we're not smart enough to find a way to avoid that.

            regards, tom lane

От:
Hannu Krosing
Дата:

On Mon, 2010-08-02 at 14:23 -0700, Peter Hussey wrote:
> I already had effective_cache_size set to 500MB.
>
> I experimented with lowering  random_page_cost to 3 then 2.

In case of fully cached database it is closer to 1.

> 2) Why is the setting of work_mem something left to the admin and/or
> developer?  Couldn't the optimizer say how much it thinks it needs to
> build a hash table based on size of the keys and estimated number of
> rows?

Yes, It can say how much it thinks it needs to build a hash table, the
part it can't figure out is how much it can afford, based on things like
number concurrent queries and how much work-mem these are using, and any
work-mem used will be substracted from total memory pool, affecting also
how much of the files the system caches.

> It is difficult for a software development platform like ours to take
> advantage of suggestions to set work_mem, or to change the cost
> function, or turn on/off join strategies for individual queries.  The
> SQL we issue is formed by user interaction with the product and rarely
> static.  How would we know when to turn something on or off?  That's
> why I'm looking for a configuration solution that I can set on a
> database-wide basis and have it work well for all queries.

Keep trying. The close you get with your conf to real conditions, the
better choices the optimiser can make ;)



--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



От:
Robert Haas
Дата:

On Tue, Aug 3, 2010 at 3:03 AM, Hannu Krosing <> wrote:
> In case of fully cached database it is closer to 1.

In the case of a fully cached database I believe the correct answer
begins with a decimal point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

От:
"Kevin Grittner"
Дата:

Peter Hussey <> wrote:

> I already had effective_cache_size set to 500MB.

That seems awfully small.  You do realize that this setting does not
cause PostgreSQL to allocate any memory; it merely advises how much
disk space is likely to be cached.  It should normally be set to the
sum of your shared_buffers setting and whatever your OS reports as
cached.  Setting it too small will discourage the optimizer from
picking plans which use indexes.

> I experimented with lowering  random_page_cost to 3 then 2.

As others have said, in a fully cached system that's still too high.
If the active portion of your database is fully cached, you should
set random_page_cost and seq_page_cost to the same value, and that
value should probably be in the range of 0.1 to 0.005.  It can get
trickier if the active portion is largely but not fully cached; we
have one server where we found, through experimentation, that we got
better plans overall with seq_page_cost = 0.3 and random_page_cost =
0.5 than any other settings we tried.

-Kevin

От:
Hannu Krosing
Дата:

On Wed, 2010-08-04 at 09:14 -0400, Robert Haas wrote:
> On Tue, Aug 3, 2010 at 3:03 AM, Hannu Krosing <> wrote:
> > In case of fully cached database it is closer to 1.
>
> In the case of a fully cached database I believe the correct answer
> begins with a decimal point.

The number 1 here was suggested in relation to seq_page_cost, which is
1.

For fully cached db there is no additional seek time for random access,
so seq_page_cost == random_page_cost.

Of course there are more variables than just *_page_cost, so if you nail
down any other one, you may end with less than 1 for both page costs.

I have always used seq_page_cost = 1 in my thinking and adjusted others
relative to it.

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



От:
Tom Lane
Дата:

Hannu Krosing <> writes:
> Of course there are more variables than just *_page_cost, so if you nail
> down any other one, you may end with less than 1 for both page costs.

> I have always used seq_page_cost = 1 in my thinking and adjusted others
> relative to it.

Right, seq_page_cost = 1 is sort of the traditional reference point,
but you don't have to do it that way.  The main point here is that for
an all-in-RAM database, the standard page access costs are too high
relative to the CPU effort costs:

regression=# select name, setting from pg_settings where name like '%cost';
         name         | setting
----------------------+---------
 cpu_index_tuple_cost | 0.005
 cpu_operator_cost    | 0.0025
 cpu_tuple_cost       | 0.01
 random_page_cost     | 4
 seq_page_cost        | 1
(5 rows)

To model an all-in-RAM database, you can either dial down both
random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost
to 1 and increase all the CPU costs.  The former is less effort ;-)

It should be noted also that there's not all that much evidence backing
up the default values of the cpu_xxx_cost variables.  In the past those
didn't matter much because I/O costs always swamped CPU costs anyway.
But I can foresee us having to twiddle those defaults and maybe refine
the CPU cost model more, as all-in-RAM cases get more common.

            regards, tom lane

От:
Hannu Krosing
Дата:

On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote:
> Hannu Krosing <> writes:
> > Of course there are more variables than just *_page_cost, so if you nail
> > down any other one, you may end with less than 1 for both page costs.
>
> > I have always used seq_page_cost = 1 in my thinking and adjusted others
> > relative to it.
>
> Right, seq_page_cost = 1 is sort of the traditional reference point,
> but you don't have to do it that way.  The main point here is that for
> an all-in-RAM database, the standard page access costs are too high
> relative to the CPU effort costs:
>
> regression=# select name, setting from pg_settings where name like '%cost';
>          name         | setting
> ----------------------+---------
>  cpu_index_tuple_cost | 0.005
>  cpu_operator_cost    | 0.0025
>  cpu_tuple_cost       | 0.01
>  random_page_cost     | 4
>  seq_page_cost        | 1
> (5 rows)
>
> To model an all-in-RAM database, you can either dial down both
> random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost
> to 1 and increase all the CPU costs.  The former is less effort ;-)
>
> It should be noted also that there's not all that much evidence backing
> up the default values of the cpu_xxx_cost variables.  In the past those
> didn't matter much because I/O costs always swamped CPU costs anyway.
> But I can foresee us having to twiddle those defaults and maybe refine
> the CPU cost model more, as all-in-RAM cases get more common.

Especially the context switch + copy between shared buffers and system
disk cache will become noticeable at these speeds.

An easy way to test it is loading a table with a few indexes, once with
a shared_buffers value, which is senough for only the main table and
once with one that fits both table and indexes,


>             regards, tom lane


--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



От:
Hannu Krosing
Дата:

On Wed, 2010-08-04 at 21:41 +0300, Hannu Krosing wrote:
> On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote:

> > regression=# select name, setting from pg_settings where name like '%cost';
> >          name         | setting
> > ----------------------+---------
> >  cpu_index_tuple_cost | 0.005
> >  cpu_operator_cost    | 0.0025
> >  cpu_tuple_cost       | 0.01
> >  random_page_cost     | 4
> >  seq_page_cost        | 1
> > (5 rows)
> >
> > To model an all-in-RAM database, you can either dial down both
> > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost
> > to 1 and increase all the CPU costs.  The former is less effort ;-)
> >
> > It should be noted also that there's not all that much evidence backing
> > up the default values of the cpu_xxx_cost variables.  In the past those
> > didn't matter much because I/O costs always swamped CPU costs anyway.
> > But I can foresee us having to twiddle those defaults and maybe refine
> > the CPU cost model more, as all-in-RAM cases get more common.
>
> Especially the context switch + copy between shared buffers and system
> disk cache will become noticeable at these speeds.
>
> An easy way to test it is loading a table with a few indexes, once with
> a shared_buffers value, which is senough for only the main table and
> once with one that fits both table and indexes,

ok, just to back this up I ran the following test with 28MB and 128MB
shared buffers.

create table sbuf_test(f1 float, f2 float, f3 float);
create index sbuf_test1 on sbuf_test(f1);
create index sbuf_test2 on sbuf_test(f2);
create index sbuf_test3 on sbuf_test(f3);

and then did 3 times the following for each shared_buffers setting

truncate sbuf_test;
insert into sbuf_test
select random(), random(), random() from generate_series(1,600000);

the main table size was 31MB, indexes were 18MB each for total size of
85MB

in case of 128MB shared buffers, the insert run in 14sec (+/- 1 sec)

in case of 28MB shared buffers, the insert run between 346 and 431 sec,
that is 20-30 _times_ slower.

There was ample space for keeping the indexes in linux cache (it has 1GB
cached currently) though the system may have decided to start writing it
to disk, so I suspect that most of the time was spent copying random
index pages back and forth between shared buffers and disk cache.

I did not verify this, so there may be some other factors involved, but
this seems like the most obvious suspect.

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



От:
Greg Smith
Дата:

Hannu Krosing wrote:
> There was ample space for keeping the indexes in linux cache (it has 1GB
> cached currently) though the system may have decided to start writing it
> to disk, so I suspect that most of the time was spent copying random
> index pages back and forth between shared buffers and disk cache.
>

Low shared_buffers settings will result in the same pages more often
being written multiple times per checkpoint, particularly index pages,
which is less efficient than keeping in the database cache and updating
them there.  This is a slightly different issue than just the overhead
of copying them back and forth; by keeping them in cache, you actually
reduce writes to the OS cache.  What I do to quantify that is...well,
the attached shows it better than I can describe; only works on 9.0 or
later as it depends on a feature I added for this purpose there.  It
measures exactly how much buffer cache churn happened during a test, in
this case creating a pgbench database.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Greg Smith
Дата:

This time with attachment...

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Tom Lane
Дата:

Hannu Krosing <> writes:
> There was ample space for keeping the indexes in linux cache (it has 1GB
> cached currently) though the system may have decided to start writing it
> to disk, so I suspect that most of the time was spent copying random
> index pages back and forth between shared buffers and disk cache.

If you're on a platform that has oprofile, you could probably verify
that rather than just guess it ...

            regards, tom lane

От:
"Kevin Grittner"
Дата:

Greg Smith <> wrote:

> What I do to quantify that is...well, the attached shows it better
> than I can describe; only works on 9.0 or later as it depends on a
> feature I added for this purpose there. It measures exactly how
> much buffer cache churn happened during a test, in this case
> creating a pgbench database.

I'm not entirely sure I understand what I'm supposed to get from
that.  On a 3GB workstation, a compile from a recent HEAD checkout,
with a default postgresql.conf file, I get this:

-[ RECORD 1 ]------+------------------------------
now                | 2010-08-04 14:25:46.683766-05
checkpoints_timed  | 0
checkpoints_req    | 0
buffers_checkpoint | 0
buffers_clean      | 0
maxwritten_clean   | 0
buffers_backend    | 0
buffers_alloc      | 73

Initializing pgbench
-[ RECORD 1 ]------+------------------------------
now                | 2010-08-04 14:27:49.062551-05
checkpoints_timed  | 0
checkpoints_req    | 0
buffers_checkpoint | 0
buffers_clean      | 0
maxwritten_clean   | 0
buffers_backend    | 633866
buffers_alloc      | 832

I boost shared_buffers from 32MB to 320MB, restart, and get this:

-[ RECORD 1 ]------+------------------------------
now                | 2010-08-04 14:30:42.816719-05
checkpoints_timed  | 0
checkpoints_req    | 0
buffers_checkpoint | 0
buffers_clean      | 0
maxwritten_clean   | 0
buffers_backend    | 0
buffers_alloc      | 0

Initializing pgbench
-[ RECORD 1 ]------+------------------------------
now                | 2010-08-04 14:32:40.750098-05
checkpoints_timed  | 0
checkpoints_req    | 0
buffers_checkpoint | 0
buffers_clean      | 0
maxwritten_clean   | 0
buffers_backend    | 630794
buffers_alloc      | 2523

So run time dropped from 123 seconds to 118 seconds, buffers_backend
dropped by less than 0.5%, and buffers_alloc went up.  Assuming this
is real, and not just "in the noise" -- what conclusions would you
draw from this?  Dedicating an additional 10% of my free memory got
me a 4% speed improvement?  Was I supposed to try with other scales?
Which ones?

-Kevin

От:
Hannu Krosing
Дата:

On Wed, 2010-08-04 at 22:03 +0300, Hannu Krosing wrote:
> On Wed, 2010-08-04 at 21:41 +0300, Hannu Krosing wrote:
> > On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote:
>
> > > regression=# select name, setting from pg_settings where name like '%cost';
> > >          name         | setting
> > > ----------------------+---------
> > >  cpu_index_tuple_cost | 0.005
> > >  cpu_operator_cost    | 0.0025
> > >  cpu_tuple_cost       | 0.01
> > >  random_page_cost     | 4
> > >  seq_page_cost        | 1
> > > (5 rows)
> > >
> > > To model an all-in-RAM database, you can either dial down both
> > > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost
> > > to 1 and increase all the CPU costs.  The former is less effort ;-)
> > >
> > > It should be noted also that there's not all that much evidence backing
> > > up the default values of the cpu_xxx_cost variables.  In the past those
> > > didn't matter much because I/O costs always swamped CPU costs anyway.
> > > But I can foresee us having to twiddle those defaults and maybe refine
> > > the CPU cost model more, as all-in-RAM cases get more common.
> >
> > Especially the context switch + copy between shared buffers and system
> > disk cache will become noticeable at these speeds.
> >
> > An easy way to test it is loading a table with a few indexes, once with
> > a shared_buffers value, which is senough for only the main table and
> > once with one that fits both table and indexes,

I re-ran the test, and checked idx_blks_read for 28MB case

hannu=# select * from pg_statio_user_indexes where relname =
'sbuf_test';
| schemaname |  relname  | indexrelname | idx_blks_read | idx_blks_hit
+------------+-----------+--------------+---------------+--------------
| hannu      | sbuf_test | sbuf_test1   |         71376 |      1620908
| hannu      | sbuf_test | sbuf_test2   |         71300 |      1620365
| hannu      | sbuf_test | sbuf_test3   |         71436 |      1619619


this means that there were a total of 214112 index blocks read back from
disk cache (obviously at least some of these had to be copied the other
way as well).

This seems to indicate about 1 ms for moving pages over user/system
boundary. (Intel Core2 Duo T7500 @ 2.20GHz, Ubuntu 9.10, 4GB RAM)

for 128MB shared buffers the total idx_blks_read for 3 indexes was about
6300 .


> ok, just to back this up I ran the following test with 28MB and 128MB
> shared buffers.
>
> create table sbuf_test(f1 float, f2 float, f3 float);
> create index sbuf_test1 on sbuf_test(f1);
> create index sbuf_test2 on sbuf_test(f2);
> create index sbuf_test3 on sbuf_test(f3);
>
> and then did 3 times the following for each shared_buffers setting
>
> truncate sbuf_test;
> insert into sbuf_test
> select random(), random(), random() from generate_series(1,600000);
>
> the main table size was 31MB, indexes were 18MB each for total size of
> 85MB
>
> in case of 128MB shared buffers, the insert run in 14sec (+/- 1 sec)
>
> in case of 28MB shared buffers, the insert run between 346 and 431 sec,
> that is 20-30 _times_ slower.
>
> There was ample space for keeping the indexes in linux cache (it has 1GB
> cached currently) though the system may have decided to start writing it
> to disk, so I suspect that most of the time was spent copying random
> index pages back and forth between shared buffers and disk cache.
>
> I did not verify this, so there may be some other factors involved, but
> this seems like the most obvious suspect.
>
> --
> Hannu Krosing   http://www.2ndQuadrant.com
> PostgreSQL Scalability and Availability
>    Services, Consulting and Training
>
>
>



--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



От:
Greg Smith
Дата:

Kevin Grittner wrote:
> Assuming this is real, and not just "in the noise" -- what conclusions would you
> draw from this?

Was trying to demonstrate the general ability of pg_stat_bgwriter
snapshots at points in time to directly measure the buffer activity
Hannu was theorizing about, not necessarily show a useful benchmark of
any sort with that.  Watching pgbench create a database isn't all that
interesting unless you either a) increase the database scale such that
at least one timed checkpoint kicks in, or b) turn on archive_mode so
the whole WAL COPY optimization is defeated.  More on this topic later,
just happened to have that little example script ready to demonstrate
the measurement concept.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Greg Smith
Дата:

Hannu Krosing wrote:
> Do you mean "written to disk", or written out from shared_buffers to
> disk cache ?
>

The later turns into the former eventually, so both really.  The kernel
will do some amount of write combining for you if you're lucky.  But not
in all cases; it may decide to write something out to physical disk
before the second write shows up.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
"Kevin Grittner"
Дата:

Hannu Krosing <> wrote:

> This seems to indicate about 1 ms for moving pages over
> user/system boundary. (Intel Core2 Duo T7500 @ 2.20GHz, Ubuntu
> 9.10, 4GB RAM)

Using Greg's test script on a box with two cores like this:

Intel(R) Pentium(R) D CPU 3.40GHz
Linux kgrittn-desktop 2.6.31-22-generic #60-Ubuntu SMP Thu May 27
00:22:23 UTC 2010 i686 GNU/Linux

Dividing the run time by accumulated buffers_backend, it comes to
less than 0.2 ms per dirty buffer flushed.  If I get a few spare
ticks I'll try again while checking what vmstat and oprofile say
about how much of that went to things besides the transfer from
shared buffers to the OS.  I mean, it's possible I was waiting on
actual disk I/O at some point.

-Kevin

От:
"Kevin Grittner"
Дата:

Greg Smith <> wrote:

> Was trying to demonstrate the general ability of pg_stat_bgwriter
> snapshots at points in time to directly measure the buffer
> activity Hannu was theorizing about, not necessarily show a useful
> benchmark of any sort with that.

Ah, OK.  Sorry I didn't pick up on that; I was struggling to tease
out some particular effect you expected to see in the numbers from
that particular run.  :-/

-Kevin

От:
Hannu Krosing
Дата:

On Wed, 2010-08-04 at 15:16 -0400, Greg Smith wrote:
> Hannu Krosing wrote:
> > There was ample space for keeping the indexes in linux cache (it has 1GB
> > cached currently) though the system may have decided to start writing it
> > to disk, so I suspect that most of the time was spent copying random
> > index pages back and forth between shared buffers and disk cache.
> >
>
> Low shared_buffers settings will result in the same pages more often
> being written multiple times per checkpoint,

Do you mean "written to disk", or written out from shared_buffers to
disk cache ?

>  particularly index pages,
> which is less efficient than keeping in the database cache and updating
> them there.  This is a slightly different issue than just the overhead
> of copying them back and forth; by keeping them in cache, you actually
> reduce writes to the OS cache.

That's what I meant. Both writes to and read from the OS cache take a
significant amount of time once you are not doing real disk I/O.

> What I do to quantify that is...well,
> the attached shows it better than I can describe; only works on 9.0 or
> later as it depends on a feature I added for this purpose there.  It
> measures exactly how much buffer cache churn happened during a test, in
> this case creating a pgbench database.
>
> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
>    www.2ndQuadrant.us
>
>



От:
Hannu Krosing
Дата:

On Wed, 2010-08-04 at 22:03 +0300, Hannu Krosing wrote:
> On Wed, 2010-08-04 at 21:41 +0300, Hannu Krosing wrote:
> > On Wed, 2010-08-04 at 14:00 -0400, Tom Lane wrote:
>
> > > regression=# select name, setting from pg_settings where name like '%cost';
> > >          name         | setting
> > > ----------------------+---------
> > >  cpu_index_tuple_cost | 0.005
> > >  cpu_operator_cost    | 0.0025
> > >  cpu_tuple_cost       | 0.01
> > >  random_page_cost     | 4
> > >  seq_page_cost        | 1
> > > (5 rows)
> > >
> > > To model an all-in-RAM database, you can either dial down both
> > > random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost
> > > to 1 and increase all the CPU costs.  The former is less effort ;-)
> > >
> > > It should be noted also that there's not all that much evidence backing
> > > up the default values of the cpu_xxx_cost variables.  In the past those
> > > didn't matter much because I/O costs always swamped CPU costs anyway.
> > > But I can foresee us having to twiddle those defaults and maybe refine
> > > the CPU cost model more, as all-in-RAM cases get more common.
> >
> > Especially the context switch + copy between shared buffers and system
> > disk cache will become noticeable at these speeds.
> >
> > An easy way to test it is loading a table with a few indexes, once with
> > a shared_buffers value, which is senough for only the main table and
> > once with one that fits both table and indexes,

I re-ran the test, and checked idx_blks_read for 28MB case

hannu=# select * from pg_statio_user_indexes where relname =
'sbuf_test';
| schemaname |  relname  | indexrelname | idx_blks_read | idx_blks_hit
+------------+-----------+--------------+---------------+--------------
| hannu      | sbuf_test | sbuf_test1   |         71376 |      1620908
| hannu      | sbuf_test | sbuf_test2   |         71300 |      1620365
| hannu      | sbuf_test | sbuf_test3   |         71436 |      1619619


this means that there were a total of 214112 index blocks read back from
disk cache (obviously at least some of these had to be copied the other
way as well).

This seems to indicate about 1 ms for moving pages over user/system
boundary. (Intel Core2 Duo T7500 @ 2.20GHz, Ubuntu 9.10, 4GB RAM)

for 128MB shared buffers the total idx_blks_read for 3 indexes was about
6300 .


> ok, just to back this up I ran the following test with 28MB and 128MB
> shared buffers.
>
> create table sbuf_test(f1 float, f2 float, f3 float);
> create index sbuf_test1 on sbuf_test(f1);
> create index sbuf_test2 on sbuf_test(f2);
> create index sbuf_test3 on sbuf_test(f3);
>
> and then did 3 times the following for each shared_buffers setting
>
> truncate sbuf_test;
> insert into sbuf_test
> select random(), random(), random() from generate_series(1,600000);
>
> the main table size was 31MB, indexes were 18MB each for total size of
> 85MB
>
> in case of 128MB shared buffers, the insert run in 14sec (+/- 1 sec)
>
> in case of 28MB shared buffers, the insert run between 346 and 431 sec,
> that is 20-30 _times_ slower.
>
> There was ample space for keeping the indexes in linux cache (it has 1GB
> cached currently) though the system may have decided to start writing it
> to disk, so I suspect that most of the time was spent copying random
> index pages back and forth between shared buffers and disk cache.
>
> I did not verify this, so there may be some other factors involved, but
> this seems like the most obvious suspect.
>
> --
> Hannu Krosing   http://www.2ndQuadrant.com
> PostgreSQL Scalability and Availability
>    Services, Consulting and Training
>
>
>



От:
Bruce Momjian
Дата:

Tom Lane wrote:
> Hannu Krosing <> writes:
> > Of course there are more variables than just *_page_cost, so if you nail
> > down any other one, you may end with less than 1 for both page costs.
>
> > I have always used seq_page_cost = 1 in my thinking and adjusted others
> > relative to it.
>
> Right, seq_page_cost = 1 is sort of the traditional reference point,
> but you don't have to do it that way.  The main point here is that for
> an all-in-RAM database, the standard page access costs are too high
> relative to the CPU effort costs:
>
> regression=# select name, setting from pg_settings where name like '%cost';
>          name         | setting
> ----------------------+---------
>  cpu_index_tuple_cost | 0.005
>  cpu_operator_cost    | 0.0025
>  cpu_tuple_cost       | 0.01
>  random_page_cost     | 4
>  seq_page_cost        | 1
> (5 rows)
>
> To model an all-in-RAM database, you can either dial down both
> random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost
> to 1 and increase all the CPU costs.  The former is less effort ;-)
>
> It should be noted also that there's not all that much evidence backing
> up the default values of the cpu_xxx_cost variables.  In the past those
> didn't matter much because I/O costs always swamped CPU costs anyway.
> But I can foresee us having to twiddle those defaults and maybe refine
> the CPU cost model more, as all-in-RAM cases get more common.

This confused me.  If we are assuing the data is in
effective_cache_size, why are we adding sequential/random page cost to
the query cost routines?

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

От:
Robert Haas
Дата:

On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian <> wrote:
> This confused me.  If we are assuing the data is in
> effective_cache_size, why are we adding sequential/random page cost to
> the query cost routines?

See the comments for index_pages_fetched().  We basically assume that
all data starts uncached at the beginning of each query - in fact,
each plan node.  effective_cache_size only measures the chances that
if we hit the same block again later in the execution of something
like a nested-loop-with-inner-indexscan, it'll still be in cache.

It's an extremely weak knob, and unless you have tables or indices
that are larger than RAM, the only mistake you can make is setting it
too low.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

От:
Bruce Momjian
Дата:

Robert Haas wrote:
> On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian <> wrote:
> > This confused me. ?If we are assuing the data is in
> > effective_cache_size, why are we adding sequential/random page cost to
> > the query cost routines?
>
> See the comments for index_pages_fetched().  We basically assume that
> all data starts uncached at the beginning of each query - in fact,
> each plan node.  effective_cache_size only measures the chances that
> if we hit the same block again later in the execution of something
> like a nested-loop-with-inner-indexscan, it'll still be in cache.
>
> It's an extremely weak knob, and unless you have tables or indices
> that are larger than RAM, the only mistake you can make is setting it
> too low.

The attached patch documents that there is no assumption that data
remains in the disk cache between queries.  I thought this information
might be helpful.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77cacdd..520170b 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** SET ENABLE_SEQSCAN TO OFF;
*** 2424,2430 ****
          space.  This parameter has no effect on the size of shared
          memory allocated by <productname>PostgreSQL</productname>, nor
          does it reserve kernel disk cache; it is used only for estimation
!         purposes.  The default is 128 megabytes (<literal>128MB</>).
         </para>
        </listitem>
       </varlistentry>
--- 2424,2432 ----
          space.  This parameter has no effect on the size of shared
          memory allocated by <productname>PostgreSQL</productname>, nor
          does it reserve kernel disk cache; it is used only for estimation
!         purposes.  The system also does not assume data remains in
!         the disk cache between queries.  The default is 128 megabytes
!         (<literal>128MB</>).
         </para>
        </listitem>
       </varlistentry>

От:
Bruce Momjian
Дата:

Bruce Momjian wrote:
> Robert Haas wrote:
> > On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian <> wrote:
> > > This confused me. ?If we are assuing the data is in
> > > effective_cache_size, why are we adding sequential/random page cost to
> > > the query cost routines?
> >
> > See the comments for index_pages_fetched().  We basically assume that
> > all data starts uncached at the beginning of each query - in fact,
> > each plan node.  effective_cache_size only measures the chances that
> > if we hit the same block again later in the execution of something
> > like a nested-loop-with-inner-indexscan, it'll still be in cache.
> >
> > It's an extremely weak knob, and unless you have tables or indices
> > that are larger than RAM, the only mistake you can make is setting it
> > too low.
>
> The attached patch documents that there is no assumption that data
> remains in the disk cache between queries.  I thought this information
> might be helpful.

Applied.

--
  Bruce Momjian  <>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +