Re: lru_multiplier and backend page write-outs

От: Peter Schuller
Тема: Re: lru_multiplier and backend page write-outs
Дата: ,
(см: обсуждение, исходный текст)
Ответ на: Re: lru_multiplier and backend page write-outs  (Greg Smith)
Список: pgsql-performance

Скрыть дерево обсуждения

lru_multiplier and backend page write-outs  (Peter Schuller, )
 Re: lru_multiplier and backend page write-outs  (Greg Smith, )
  Re: lru_multiplier and backend page write-outs  (Peter Schuller, )
   Re: lru_multiplier and backend page write-outs  (Greg Smith, )
    Re: lru_multiplier and backend page write-outs  (Peter Schuller, )

> > no table was ever large enough that 256k buffers would ever be filled by
> > the process of vacuuming a single table.
> Not 256K buffers--256K, 32 buffers.


> > In addition, when I say "constantly" above I mean that the count
> > increases even between successive SELECT:s (of the stat table) with
> > only a second or two in between.
> Writes to the database when only doing read operations are usually related
> to hint bits:

Sorry, I didn't mean to imply read-only operations (I did read the
hint bits information a while back though). What I meant was that
while I was constantly generating the insert/delete/update activity, I
was selecting the bg writer stats with only a second or two in
between. The intent was to convey that the count of backend written
pages was systematically and constantly (as in a few hundreds per
handful of seconds) increasing, in spite of no long running vacuum and
the buffer cache not being close to full.

> > On this topic btw, was it considered to allow the administrator to
> > specify a fixed-size margin to use when applying the JIT policy?
> Right now, there's no way to know exactly what's in the buffer cache
> without scanning the individual buffers, which requires locking their
> headers so you can see them consistently.  No one process can get the big
> picture without doing something intrusive like that, and on a busy system
> the overhead of collecting more data to know how exactly far ahead the
> cleaning is can drag down overall performance.  A lot can happen while the
> background writer is sleeping.


> One next-generation design which has been sketched out but not even
> prototyped would take cleaned buffers and add them to the internal list of
> buffers that are free, which right now is usually empty on the theory that
> cached data is always more useful than a reserved buffer.  If you
> developed a reasonable model for how many buffers you needed and padded
> that appropriately, that's the easiest way (given the rest of the buffer
> manager code) to get close to ensuring there aren't any backend writes.
> Because you've got the OS buffering writes anyway in most cases, it's hard
> to pin down whether that actually improved worst-case latency though. And
> moving in that direction always seems to reduce average throughput even in
> write-heavy benchmarks.


> The important thing to remember is that the underlying OS has its own read
> and write caching mechanisms here, and unless the PostgreSQL ones are
> measurably better than those you might as well let the OS manage the
> problem instead.

The problem though is that though the OS may be good in the common
cases it is designed for, it can have specific features that are
directly counter-productive if your goals do not line up with that of
the commonly designed-for use case (in particular, if you care about
latency a lot and not necessarily about absolute max throughput).

For example, in Linux up until recently if not still, there is the
1024 per-inode buffer limit that limited the number of buffers written
as a result of expiry, which means that when PostgreSQL does its
fsync(), you may end up having a lot more to write out than what would
have been the case if the centisecs_expiry had been enforced,
regardless of whether PostgreSQL was tuned to write dirty pages out
sufficiently aggressively. If the amount built up exceeds the capacity
of the RAID controller cache...

I had a case where I suspect this was exaserbating the
situation. Manually doing a 'sync' on the system every few seconds
noticably helped (the theory being, because it forced page write-outs
to happen earlier and in smaller storms).

>  It's easy to demonstrate that's happening when you give
> a decent amount of memory to shared_buffers, it's much harder to prove
> that's the case for an improved write scheduling algorithm.  Stepping back
> a bit, you might even consider that one reason PostgreSQL has grown as
> well as it has in scalability is exactly because it's been riding
> improvements the underlying OS in many of these cases, rather than trying
> to do all the I/O scheduling itself.

Sure. In this case with the backend writes, I am nore interesting in
understanding better what is happening and having better indications
of when backends block on I/O, than necessarily having a proven
improvement in throughput or latency. It makes it easier to reason
about what is happening when you *do* have a measured performance

Thanks for all the insightful information.

/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <>'
Key retrieval: Send an E-Mail to 
E-Mail:  Web:


В списке pgsql-performance по дате сообщения:

От: "David Rees"
Сообщение: Re: Create and drop temp table in 8.3.4
От: "Віталій Тимчишин"
Сообщение: Re: PostgreSQL OR performance