Обсуждение: checkpointer continuous flushing

Поиск

Список

Период

Сортировка

checkpointer continuous flushing

От

Fabien COELHO

Дата:

01 июня 2015 г., 14:40:33

Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's 
August 2014 patch for flushing while writing during checkpoints, with some 
documentation and configuration warnings added.

For the initial patch, see:
  http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:
  http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints, 
and in general to get better latency-bound performance.

Flushes are managed with pg throttled writes instead of waiting for the
checkpointer final "fsync" which induces occasional stalls. From
"pgbench -P 1 ...", such stalls look like this:
  progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043   # ok  progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev
123.828# stalled  progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346  # ...  progress: 38.0 s, 161.0 tps, lat
6.968ms stddev 32.964  # restart  progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326   # ok
 

I've seen similar behavior on FreeBSD with its native FS, so it is not a
Linux-specific or ext4-specific issue, even if both factor may contribute.

There are two implementations, first one based on "sync_file_range" is Linux
specific, while the other relies on "posix_fadvise". Tests below ran on Linux.
If someone could test the posix_fadvise version on relevant platforms, that
would be great...

The Linux specific "sync_file_range" approach was suggested among other ideas
by Theodore Ts'o on Robert Haas blog in March 2014:
  http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html

Two guc variables control whether the feature is activated for writes of 
dirty pages issued by checkpointer and bgwriter. Given that the settings 
may improve or degrade performance, having GUC seems justified.  In 
particular the stalling issue disappears with SSD.

The effect is significant on a series of tests shown below with scale 10 
pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw 
RAID), with shared_buffers=1GB checkpoint_completion_target=0.8 
completion_timeout=30s, unless stated otherwise.

Note: I know that this completion_timeout is too small for a normal 
config, but the point is to test how checkpoints behave, so the test 
triggers as many checkpoints as possible, hence the minimum timeout 
setting. I have also done some tests with larger timeout.


(1) THROTTLED PGBENCH

The objective of the patch is to be able to reduce the latency of transactions
under a moderate load. These first serie of tests focuses on this point with
the help of pgbench -R (rate) and -L (skip/count late transactions).
The measure counts transactions which were skipped or beyond the expected
latency limit while targetting a transaction rate.

* "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during  100 seconds, and latency limit is 100
ms),over 256 runs, 7 hours per case:
 
  flush     | percent of skipped  cp  | bgw | & out of latency limit transactions  off | off | 6.5 %  off |  on | 6.1 %
 on | off | 0.4 %   on |  on | 0.4 %
 

* Same as above (100 tps target) over one run of 4000 seconds with  shared_buffers=256MB and checkpoint_timeout=10mn:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 1.3 %  off |  on | 1.5
%  on | off | 0.6 %   on |  on | 0.6 %
 

* Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 8.0 %  off |  on | 8.0
%  on | off | 0.4 %   on |  on | 0.4 %
 

* Same as above (150 tps target) over one run of 4000 seconds with  shared_buffers=256MB and checkpoint_timeout=10mn:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 1.7 %  off |  on | 1.9
%  on | off | 0.7 %   on |  on | 0.6 %
 

Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.

bgwriter_flush_to_disk does not seem to have a significant impact on these 
tests, maybe because pg shared_buffers size is much larger than the 
database, so the bgwriter is seldom active.


(2) FULL SPEED PGBENCH

This is not the target use case, but it seems necessary to assess the 
impact of these options of tps figures and their variability.

* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case.
      flush   | performance on ...    cp  | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds)    off | off
|691 +- 36 tps        | 691 +- 236 tps    off |  on | 677 +- 29 tps        | 677 +- 230 tps     on | off | 655 +- 23
tps       | 655 +- 130 tps     on |  on | 657 +- 22 tps        | 657 +- 130 tps
 

On this first test, setting checkpoint_flush_to_disk reduces the performance by
5%, but the per second standard deviation is nearly halved, that is the
performance is more stable over the runs, although lower.
Option bgwriter_flush_to_disk effect is inconclusive.

* "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with  checkpoint_timeout=10mn and shared_buffers=256MB
(atleast 6 checkpoints  during the run, probably more because segments are filled more often than  every 10mn):
 
       flush   | performance ... (stddev over per second tps)     off | off | 877 +- 179 tps     off |  on | 880 +- 183
tps     on | off | 896 +- 131 tps      on |  on | 888 +- 132 tps
 

On this second short test, setting checkpoint_flush_to_disk seems to maybe 
slightly improve performance (maybe 2% ?) and significantly reduces 
variability, so it looks like a good move.

* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)
      flush   | performance on ...    cp  | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds)    off | off |
1970+- 60 tps      | 1970 +- 783 tps    off |  on | 1928 +- 61 tps      | 1928 +- 813 tps     on | off | 1578 +- 45 tps
    | 1578 +- 631 tps     on |  on | 1594 +- 47 tps      | 1594 +- 618 tps
 

On this test both average and standard deviation are both reduced by 20%.
This does not look like a win.


CONCLUSION

This approach is simple and significantly improves pg fsync behavior under
moderate load, where the database stays mostly responsive.  Under full load,
the situation may be improved or degraded, it depends.


OTHER OPTIONS

Another idea suggested by Theodore Ts'o seems impractical: playing with 
Linux io-scheduler priority (ioprio_set) looks only relevant with the 
"sfq" scheduler on actual hard disk, but does not work with other 
schedulers, especially "deadline" which seems more advisable for Pg, nor 
for hardware RAID, which is a common setting.

Also, Theodore Ts'o suggested to use "sync_file_range" to check whether 
the writes have reached the disk, and possibly to delay the actual 
fsync/checkpoint conclusion if not... I have not tried that, the 
implementation is not as trivial, and I'm not sure what to do when the 
completion target is coming, but possibly that could be an interesting 
option to investigate. Preliminary tests by adding a sleep between the 
writes and the final fsync did not yield very good results.

I've also played with numerous other options (changing checkpointer 
throttling parameters, reducing checkpoint timeout to 1 second, playing 
around with various kernel settings), but that did not seem to be very 
effective for the problem at hand.


I also attached a test script I used, that can be adapted if someone wants 
to collect some performance data. I also have some basic scripts to 
extract and compute stats, ask if needed.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Langote

Дата:

02 июня 2015 г., 12:44:04

Hi Fabien,

On 2015-06-01 PM 08:40, Fabien COELHO wrote:
> 
> Turning "checkpoint_flush_to_disk = on" reduces significantly the number
> of late transactions. These late transactions are not uniformly distributed,
> but are rather clustered around times when pg is stalled, i.e. more or less
> unresponsive.
> 
> bgwriter_flush_to_disk does not seem to have a significant impact on these
> tests, maybe because pg shared_buffers size is much larger than the database,
> so the bgwriter is seldom active.
> 

Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?

Thanks,
Amit

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

02 июня 2015 г., 13:19:54

Hello Amit,

> Not that the GUC naming is the most pressing issue here, but do you think
> "*_flush_on_write" describes what the patch does?

It is currently "*_flush_to_disk". In Andres Freund version the name is 
"sync_on_checkpoint_flush", but I did not found it very clear. Using 
"*_flush_on_write" instead as your suggest, would be fine as well, it 
emphasizes the "when/how" it occurs instead of the final "destination", 
why not...

About words: checkpoint "write"s pages, but this really mean passing the 
pages to the memory manager, which will think about it... "flush" seems to 
suggest a more effective write, but really it may mean the same, the page 
is just passed to the OS. So "write/flush" is really "to OS" and not "to 
disk". I like the data to be on "disk" in the end, and as soon as 
possible, hence the choice to emphasize that point.

Now I would really be okay with anything that people find simple to 
understand, so any opinion is welcome!

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

02 июня 2015 г., 13:45:19

Hi,

It's nice to see the topic being picked up.

If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental.  Did you
consider combining both approaches?

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

02 июня 2015 г., 14:33:59

On Mon, Jun 1, 2015 at 5:10 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's August 2014 patch for flushing while writing during checkpoints, with some documentation and configuration warnings added.

For the initial patch, see:

http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:

http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints, and in general to get better latency-bound performance.

-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)

+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk)

{

XLogRecPtr recptr;

ErrorContextCallback errcallback;

@@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)

buf->tag.forkNum,

buf->tag.blockNum,

bufToWrite,

- false);

+ false,

+ flush_to_disk);

Won't this lead to more-unsorted writes (random I/O) as the

FlushBuffer requests (by checkpointer or bgwriter) are not sorted as

per files or order of blocks on disk?

I remember sometime back there was some discusion regarding

sorting writes during checkpoint, one idea could be try to

check this idea along with that patch. I just saw that Andres has

also given same suggestion which indicates that it is important

to see both the things together.

Also here another related point is that I think currently even fsync

requests are not in order of the files as they are stored on disk so

that also might cause random I/O?

Yet another idea could be to allow BGWriter to also fsync the dirty

buffers, that may have side impact of not able to clear the dirty pages

at speed required by system, but I think if that happens one can

think of having multiple BGwriter tasks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

02 июня 2015 г., 15:56:41

Hello Andres,

> If I see correctly you picked up the version without sorting durch
> checkpoints. I think that's not going to work - there'll be too many
> situations where the new behaviour will be detrimental.  Did you
> consider combining both approaches?

Ja, I thought that it was a more complex patch with uncertain/less clear 
benefits, and as this simpler version was already effective enough as it 
was, so I decided to start with that and try to have reasonable proof of 
benefits so that it could get through.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

02 июня 2015 г., 16:15:58

Hello Amit,

> [...]
>> The objective is to help avoid PG stalling when fsyncing on checkpoints,
>> and in general to get better latency-bound performance.
>
> Won't this lead to more-unsorted writes (random I/O) as the
> FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
> per files or order of blocks on disk?

Yep, probably. Under "moderate load" this is not an issue. The 
io-scheduler and other hd firmware will probably reorder writes anyway. 
Also, if several data are updated together, probably they are likely to be 
already neighbours in memory as well as on disk.

> I remember sometime back there was some discusion regarding
> sorting writes during checkpoint, one idea could be try to
> check this idea along with that patch.  I just saw that Andres has
> also given same suggestion which indicates that it is important
> to see both the things together.

I would rather separate them, unless this is a blocker. This version seems 
already quite effective and very light. ISTM that adding a sort phase 
would mean reworking significantly how the checkpointer processes pages.

> Also here another related point is that I think currently even fsync
> requests are not in order of the files as they are stored on disk so
> that also might cause random I/O?

I think that currently the fsync is on the file handler, so what happens 
depends on how fsync is implemented by the system.

> Yet another idea could be to allow BGWriter to also fsync the dirty
> buffers,

ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".

> that may have side impact of not able to clear the dirty pages at speed 
> required by system, but I think if that happens one can think of having 
> multiple BGwriter tasks.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

02 июня 2015 г., 16:20:26

On 2015-06-02 15:15:39 +0200, Fabien COELHO wrote:
> >Won't this lead to more-unsorted writes (random I/O) as the
> >FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
> >per files or order of blocks on disk?
> 
> Yep, probably. Under "moderate load" this is not an issue. The io-scheduler
> and other hd firmware will probably reorder writes anyway.

They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.

> Also, if several
> data are updated together, probably they are likely to be already neighbours
> in memory as well as on disk.

No, that's not how it'll happen outside of simplistic cases where you
start with an empty shared_buffers. Shared buffers are maintained by a
simplified LRU, so how often individual blocks are touched will define
the buffer replacement.

> >I remember sometime back there was some discusion regarding
> >sorting writes during checkpoint, one idea could be try to
> >check this idea along with that patch.  I just saw that Andres has
> >also given same suggestion which indicates that it is important
> >to see both the things together.
> 
> I would rather separate them, unless this is a blocker.

I think it is a blocker.

> This version seems
> already quite effective and very light. ISTM that adding a sort phase would
> mean reworking significantly how the checkpointer processes pages.

Meh. The patch for that wasn't that big.

The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

02 июня 2015 г., 16:42:34

Hello Andres,

>> I would rather separate them, unless this is a blocker.
>
> I think it is a blocker.

Hmmm. This is an argument...

>> This version seems already quite effective and very light. ISTM that 
>> adding a sort phase would mean reworking significantly how the 
>> checkpointer processes pages.
>
> Meh. The patch for that wasn't that big.

Hmmm. I think it should be implemented as Tom suggested, that is per 
chunks of shared buffers, in order to avoid allocating a "large" memory.

> The problem with doing this separately is that without the sorting this
> will be slower for throughput in a good number of cases. So we'll have
> yet another GUC that's very hard to tune.

ISTM that the two aspects are orthogonal, which would suggests two gucs 
anyway.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

02 июня 2015 г., 17:44:02

On 2015-06-02 15:42:14 +0200, Fabien COELHO wrote:
> >>This version seems already quite effective and very light. ISTM that
> >>adding a sort phase would mean reworking significantly how the
> >>checkpointer processes pages.
> >
> >Meh. The patch for that wasn't that big.
> 
> Hmmm. I think it should be implemented as Tom suggested, that is per chunks
> of shared buffers, in order to avoid allocating a "large" memory.

I don't necessarily agree. But that's really just a minor implementation
detail. The actual problem is sorting & fsyncing in a way that deals
efficiently with tablespaces, i.e. doesn't write to tablespaces
one-by-one.  Not impossible, but it requires some thought.

> >The problem with doing this separately is that without the sorting this
> >will be slower for throughput in a good number of cases. So we'll have
> >yet another GUC that's very hard to tune.
> 
> ISTM that the two aspects are orthogonal, which would suggests two gucs
> anyway.

They're pretty closely linked from their performance impact. IMO this
feature, if done correctly, should result in better performance in 95+%
of the workloads and be enabled by default. And that'll not be possible
without actually writing mostly sequentially.

It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out. That's important
because it means the file will get fewer writes done independently
(i.e. backends writing out dirty buffers) which will make the final
fsync more expensive.

It might be that we want to different gucs, but I don't think we can
release without both features.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

02 июня 2015 г., 18:02:06

>> Hmmm. I think it should be implemented as Tom suggested, that is per chunks
>> of shared buffers, in order to avoid allocating a "large" memory.
>
> I don't necessarily agree. But that's really just a minor implementation
> detail.

Probably.

> The actual problem is sorting & fsyncing in a way that deals efficiently 
> with tablespaces, i.e. doesn't write to tablespaces one-by-one.
> Not impossible, but it requires some thought.

Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some 
performance.

>> ISTM that the two aspects are orthogonal, which would suggests two gucs
>> anyway.
>
> They're pretty closely linked from their performance impact.

Sure.

> IMO this feature, if done correctly, should result in better performance 
> in 95+% of the workloads

To demonstrate that would require time...

> and be enabled by default.

I did not had such an ambition with the submitted patch:-)

> And that'll not be possible without actually writing mostly 
> sequentially.

> It's also not just the sequential writes making this important, it's 
> also that it allows to do the final fsync() of the individual segments 
> as soon as their last buffer has been written out.

Hmmm... I'm not sure this would have a large impact. The writes are 
throttled as much as possible, so fsync will catch plenty other writes 
anyway, if there are some.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

02 июня 2015 г., 18:17:16

On 2015-06-02 17:01:50 +0200, Fabien COELHO wrote:
> >The actual problem is sorting & fsyncing in a way that deals efficiently
> >with tablespaces, i.e. doesn't write to tablespaces one-by-one.
> >Not impossible, but it requires some thought.
> 
> Hmmm... I would have neglected this point in a first approximation,
> but I agree that not interleaving tablespaces could indeed loose some
> performance.

I think it'll be a hard to diagnose performance regression. So we'll
have to fix it. That argument actually was the blocker in previous
attempts...

> >IMO this feature, if done correctly, should result in better performance
> >in 95+% of the workloads
> 
> To demonstrate that would require time...

Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.

> >and be enabled by default.
> 
> I did not had such an ambition with the submitted patch:-)

I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.

> >And that'll not be possible without actually writing mostly sequentially.
> 
> >It's also not just the sequential writes making this important, it's also
> >that it allows to do the final fsync() of the individual segments as soon
> >as their last buffer has been written out.
> 
> Hmmm... I'm not sure this would have a large impact. The writes are
> throttled as much as possible, so fsync will catch plenty other writes
> anyway, if there are some.

That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

02 июня 2015 г., 20:00:17

>>> IMO this feature, if done correctly, should result in better performance
>>> in 95+% of the workloads
>>
>> To demonstrate that would require time...
>
> Well, that's part of the contribution process. Obviously you can't test
> 100% of the problems, but you can work hard with coming up with very
> adversarial scenarios and evaluate performance for those.

I did spent time (well, a machine spent time, really) to collect some 
convincing data for the simple version without sorting to demonstrate that 
it brings a clear value, which seems not to be enough...

> I don't think we want yet another tuning knob that's hard to tune
> because it's critical for one factor (latency) but bad for another
> (throughput); especially when completely unnecessarily.

Hmmm.

My opinion is that throughput is given too much attention in general, but 
if both can be kept/improved, this would be easier to sell, obviously.


>>> It's also not just the sequential writes making this important, it's also
>>> that it allows to do the final fsync() of the individual segments as soon
>>> as their last buffer has been written out.
>>
>> Hmmm... I'm not sure this would have a large impact. The writes are
>> throttled as much as possible, so fsync will catch plenty other writes
>> anyway, if there are some.
>
> That might be the case in a database with a single small table;
> i.e. where all the writes go to a single file. But as soon as you have
> large tables (i.e. many segments) or multiple tables, a significant part
> of the writes issued independently from checkpointing will be outside
> the processing of the individual segment.

Statistically, I think that it would reduce the number of unrelated writes 
taken in a fsync by about half: the last table to be written on a 
tablespace, at the end of the checkpoint, will have accumulated 
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint 
time, while the first table will have avoided most of them.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

02 июня 2015 г., 21:42:02

On 2015-06-02 18:59:05 +0200, Fabien COELHO wrote:
> 
> >>>IMO this feature, if done correctly, should result in better performance
> >>>in 95+% of the workloads
> >>
> >>To demonstrate that would require time...
> >
> >Well, that's part of the contribution process. Obviously you can't test
> >100% of the problems, but you can work hard with coming up with very
> >adversarial scenarios and evaluate performance for those.
> 
> I did spent time (well, a machine spent time, really) to collect some
> convincing data for the simple version without sorting to demonstrate that
> it brings a clear value, which seems not to be enough...

"which seems not to be enough" - man. It's trivial to make things
faster/better/whatever if you don't care about regressions in other
parts. And if we'd add a guc for each of these cases we'd end up with
thousands of them.

> My opinion is that throughput is given too much attention in general, but if
> both can be kept/improved, this would be easier to sell, obviously.

Your priorities are not everyone's. That's life.


> >That might be the case in a database with a single small table;
> >i.e. where all the writes go to a single file. But as soon as you have
> >large tables (i.e. many segments) or multiple tables, a significant part
> >of the writes issued independently from checkpointing will be outside
> >the processing of the individual segment.
> 
> Statistically, I think that it would reduce the number of unrelated writes
> taken in a fsync by about half: the last table to be written on a
> tablespace, at the end of the checkpoint, will have accumulated
> checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
> time, while the first table will have avoided most of them.

That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.

Re: checkpointer continuous flushing

От

Amit Langote

Дата:

03 июня 2015 г., 03:49:19

Hi,

On 2015-06-02 PM 07:19, Fabien COELHO wrote:
> 
>> Not that the GUC naming is the most pressing issue here, but do you think
>> "*_flush_on_write" describes what the patch does?
> 
> It is currently "*_flush_to_disk". In Andres Freund version the name is
> "sync_on_checkpoint_flush", but I did not found it very clear. Using
> "*_flush_on_write" instead as your suggest, would be fine as well, it
> emphasizes the "when/how" it occurs instead of the final "destination", why
> not...
> 
> About words: checkpoint "write"s pages, but this really mean passing the pages
> to the memory manager, which will think about it... "flush" seems to suggest a
> more effective write, but really it may mean the same, the page is just passed
> to the OS. So "write/flush" is really "to OS" and not "to disk". I like the
> data to be on "disk" in the end, and as soon as possible, hence the choice to
> emphasize that point.
> 
> Now I would really be okay with anything that people find simple to
> understand, so any opinion is welcome!
> 

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems *_sync_on_write
may even do the job.

Again, this is a minor issue.

By the way, I tend to agree with others here that there needs to be found a
good balance such that this sync-blocks-one-at-time-in-random-order approach
does not hurt generalized workload too much although it seems to help with
solving the latency problem that you seem set out to solve.

Thanks,
Amit

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

03 июня 2015 г., 07:46:16

On Tue, Jun 2, 2015 at 6:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Amit,
>
>> [...]
>>>
>>> The objective is to help avoid PG stalling when fsyncing on checkpoints,
>>> and in general to get better latency-bound performance.
>>
>>
>> Won't this lead to more-unsorted writes (random I/O) as the
>> FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
>> per files or order of blocks on disk?
>
>
> Yep, probably. Under "moderate load" this is not an issue. The io-scheduler and other hd firmware will probably reorder writes anyway. Also, if several data are updated together, probably they are likely to be already neighbours in memory as well as on disk.
>
>> I remember sometime back there was some discusion regarding
>> sorting writes during checkpoint, one idea could be try to
>> check this idea along with that patch. I just saw that Andres has
>> also given same suggestion which indicates that it is important
>> to see both the things together.
>
>
> I would rather separate them, unless this is a blocker. This version seems already quite effective and very light. ISTM that adding a sort phase would mean reworking significantly how the checkpointer processes pages.
>

I agree with you that if we have to add a sort phase, there is additional

work and that work could be significant depending on the design we

choose, however without that, this patch can have impact on many kind

of workloads, even in your mail in one of the tests

("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))

it has shown 20% degradation which is quite significant and test also

seems to be representative of the workload which many users in real-world

will use.

Now one can say that for such workloads turn the new knob to off, but

in reality it could be difficult to predict if the load is always moderate.

I think users might be able to predict that at table level, but inspite of that

I don't think having any such knob can give us ticket to flush the buffers

in random order.

>> Also here another related point is that I think currently even fsync
>> requests are not in order of the files as they are stored on disk so
>> that also might cause random I/O?
>
>
> I think that currently the fsync is on the file handler, so what happens depends on how fsync is implemented by the system.
>

That can also lead to random I/O if the fsync for different files is not in

order as they are actually stored on disk.

>> Yet another idea could be to allow BGWriter to also fsync the dirty
>> buffers,
>
>
> ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".
>

I think patch just issues an async operation not the actual flush. Why

I have suggested so is that in your tests when the checkpoint_timeout

is small it seems there is a good gain in performance that means if

keep on flushing dirty buffers at regular intervals, the system's performance

is good and BGWriter is the process where that can be done conveniently

apart from checkpoint, one might think that if same can be achieved by using

shorter checkpoint_timeout interval, then why to do this incremental flushes

by bgwriter, but in reality I think checkpoint is responsible for other things

as well other than dirty buffers, so we can't leave everything till checkpoint

happens.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

03 июня 2015 г., 08:38:49

>>> That might be the case in a database with a single small table; i.e. 
>>> where all the writes go to a single file. But as soon as you have 
>>> large tables (i.e. many segments) or multiple tables, a significant 
>>> part of the writes issued independently from checkpointing will be 
>>> outside the processing of the individual segment.
>>
>> Statistically, I think that it would reduce the number of unrelated writes
>> taken in a fsync by about half: the last table to be written on a
>> tablespace, at the end of the checkpoint, will have accumulated
>> checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
>> time, while the first table will have avoided most of them.
>
> That's disregarding that a buffer written out by a backend starts to get
> written out by the kernel after ~5-30s, even without a fsync triggering
> it.

I meant my argument with "continuous flushing" activated, so there is no 
up to 30 seconds delay induced my the memory manager. Hmmm, maybe I do not 
understood your argument.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

03 июня 2015 г., 08:53:28

Hello Amit,

>> It is currently "*_flush_to_disk". In Andres Freund version the name is
>> "sync_on_checkpoint_flush", but I did not found it very clear. Using
>> "*_flush_on_write" instead as your suggest, would be fine as well, it
>> emphasizes the "when/how" it occurs instead of the final "destination", why
>> not...
> [...]
>
> It seems 'sync' gets closer to what I really wanted 'flush' to mean. If 
> I understand this and the previous discussion(s) correctly, the patch 
> tries to alleviate the problems caused by one-big-sync-at-the 
> end-of-writes by doing the sync in step with writes (which do abide by 
> the checkpoint_completion_target). Given that impression, it seems 
> *_sync_on_write may even do the job.

I desagree with this one, because the sync is only *initiated*, not done. 
For this reason I think that "flush" seems a better word. I understand 
"sync" as "committed to disk". For the data to be synced, it should call 
with the "wait after" option, which is a partial "fsync", but that would 
be terrible for performance as all checkpointed pages would be written one 
by one, without any opportunity for reordering them.

For what it's worth and for the record, Linux sync_file_range 
documentation says "This is an asynchronous flush-to-disk operation" to 
describe the corresponding option. This is probably where I took it.

So two contenders:
  *_flush_to_disk  *_flush_on_write

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

03 июня 2015 г., 09:04:26

> I agree with you that if we have to add a sort phase, there is additional
> work and that work could be significant depending on the design we
> choose, however without that, this patch can have impact on many kind
> of workloads, even in your mail in one of the tests
> ("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
> it has shown 20% degradation which is quite significant and test also
> seems to be representative of the workload which many users in real-world
> will use.

Yes, I do agree with the 4 clients, but I doubt that many user run their 
application at maximum available throughput all the time (like always 
driving foot to the floor). So for me throttled runs are more 
representative of real life.

> Now one can say that for such workloads turn the new knob to off, but
> in reality it could be difficult to predict if the load is always moderate.

Hmmm. The switch says "I prefer stable (say latency bounded) performance", 
if you run a web site probably you should want that.

Anyway, I'll look at sorting when I have some time.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Langote

Дата:

03 июня 2015 г., 09:07:41

Fabien,

On 2015-06-03 PM 02:53, Fabien COELHO wrote:
> 
>>
>> It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
>> understand this and the previous discussion(s) correctly, the patch tries to
>> alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
>> the sync in step with writes (which do abide by the
>> checkpoint_completion_target). Given that impression, it seems
>> *_sync_on_write may even do the job.
> 
> I desagree with this one, because the sync is only *initiated*, not done. For
> this reason I think that "flush" seems a better word. I understand "sync" as
> "committed to disk". For the data to be synced, it should call with the "wait
> after" option, which is a partial "fsync", but that would be terrible for
> performance as all checkpointed pages would be written one by one, without any
> opportunity for reordering them.
> 
> For what it's worth and for the record, Linux sync_file_range documentation
> says "This is an asynchronous flush-to-disk operation" to describe the
> corresponding option. This is probably where I took it.
> 

Ah, okay! I didn't quite think about the async aspect here. But, I sure do
hope that the added mechanism turns out to be *less* async than kernel's own
dirty cache handling to achieve the hoped for gain.

> So two contenders:
> 
>   *_flush_to_disk
>   *_flush_on_write
> 

Yep!

Regards,
Amit

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

07 июня 2015 г., 17:53:27

Hello Andres,

> They pretty much can't if you flush things frequently. That's why I
> think this won't be acceptable without the sorting in the checkpointer.


* VERSION 2 "WORK IN PROGRESS".

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:
 - as version 1 : simplified asynchronous flush based on Andres Freund   patch, with sync_file_range/posix_fadvise used
tohint the OS that   the buffer must be sent to disk "now".
 
 - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki   but with a smaller and static buffer
allocatedonce. Also,   sorting is done by chunks in the current version.
 
 - also added: sync/advise calls are now merged if possible,   so less calls are used, especially when buffers are
sorted,  but also if there are few files.
 


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1" scale 10  (simple update pgbench
with a mostly-write activity),  with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds  with checkpoint_timeout=30s
  flush | sort | late transactions    off |  off | 6.0 %    off |   on | 6.1 %     on |  off | 0.4 %     on |   on |
0.4% (93% improvement)
 

- 100 tps throttled max 100 ms latency over 4000 seconds  with checkpoint_timeout=10mn
  flush | sort | late transactions    off |  off | 1.5 %    off |   on | 0.6 % (?!)     on |  off | 0.8 %     on |   on
|0.6 % (60% improvement)
 

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)  with checkpoint_timeout=30s
  flush | sort | late transactions    off |  off | 8.5 %    off |   on | 8.1 %     on |  off | 0.5 %     on |   on |
0.4% (95% improvement)
 

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s
  flush | sort | tps performance over per second data    off |  off | 676 +- 230    off |   on | 683 +- 213     on |
off| 712 +- 130     on |   on | 725 +- 116 (7.2% avg/50% stddev improvements)
 

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn
  flush | sort | tps performance over per second data    off |  off | 885 +- 188    off |   on | 940 +- 120 (6%/36%!)
 on |  off | 778 +- 245 (hmmm... not very consistent?)     on |   on | 927 +- 108 (4.5% avg/43% sttdev improvements)
 

- full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s
  flush | sort | tps performance over per second data    off |  off | 2012 +- 747    off |   on | 2086 +- 708     on |
off| 2099 +- 459     on |   on | 2114 +- 422 (5% avg/44% stddev improvements)
 


* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions) 
orthogonal: - latency is essentially improved (60 to 95% reduction) by flushing - throughput is improved (4 to 7%
better)thanks to sorting
 

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.


* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed  from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off.... This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.


- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed 
randomly over tablespaces, inducing lower performance on such systems.

How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason 
there are 2 tablespaces by default (pg_global and  pg_default), that would 
mean at least 2 threads.

Alternatively, maybe it can be done from one thread, but it would probably 
involve some strange hocus-pocus to switch frequently between tablespaces.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Cédric Villemain

Дата:

08 июня 2015 г., 10:45:16

Le 07/06/2015 16:53, Fabien COELHO a écrit :
> +» » /*·Others:·say·that·data·should·not·be·kept·in·memory...
> +» » ·*·This·is·not·exactly·what·we·want·to·say,·because·we·want·to·write
> +» » ·*·the·data·for·durability·but·we·may·need·it·later·nevertheless.
> +» » ·*·It·seems·that·Linux·would·free·the·memory·*if*·the·data·has
> +» » ·*·already·been·written·do·disk,·else·it·is·ignored.
> +» » ·*·For·FreeBSD·this·may·have·the·desired·effect·of·moving·the
> +» » ·*·data·to·the·io·layer.
> +» » ·*/
> +» » rc·=·posix_fadvise(context->fd,·context->offset,·context->nbytes,
> +» » » » » » ···POSIX_FADV_DONTNEED);
> +

It looks a bit hazardous, do you have a benchmark for freeBSD ?

Sources says:case POSIX_FADV_DONTNEED:    /*     * Flush any open FS buffers and then remove pages     * from the
backingVM object.  Using vinvalbuf() here     * is a bit heavy-handed as it flushes all buffers for     * the given
vnode,not just the buffers covering the     * requested range.
 

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

08 июня 2015 г., 11:00:33

Hello Cédric,

> It looks a bit hazardous, do you have a benchmark for freeBSD ?

No, I just consulted the FreeBSD man page for posix_fadvise. I someone can 
run tests on something which HDDs is not linux, that would be nice.

> Sources says:
>     case POSIX_FADV_DONTNEED:
>         /*
>          * Flush any open FS buffers and then remove pages
>          * from the backing VM object.  Using vinvalbuf() here
>          * is a bit heavy-handed as it flushes all buffers for
>          * the given vnode, not just the buffers covering the
>          * requested range.

It is indeed heavy-handed, but that would probably trigger the expected 
behavior which is to start writing to disk, so I would expect to see 
benefits similar to those of "sync_file_range" on Linux.

Buffer writes from bgwriter & checkpointer are throttled, which reduces 
the potential impact of a "heavy-handed" approach in the kernel.

Now if on some platforms the behavior is absurd, obviously it would be 
better to turn the feature off on those.

Note that this is already used by pg in "initdb", but the impact would 
probably be very small anyway.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

17 июня 2015 г., 09:24:53


Hello,

Here is version 3, including many performance tests with various settings, 
representing about 100 hours of pgbench run. This patch aims at improving 
checkpoint I/O behavior so that tps throughput is improved, late 
transactions are less frequent, and overall performances are more stable.


* SOLILOQUIZING

> - The bgwriter flushing option seems ineffective, it could be removed
>  from the patch?

I did that.

> - Move fsync as early as possible, suggested by Andres Freund?
>
> My opinion is that this should be left out for the nonce.

I did that.

> - Take into account tablespaces, as pointed out by Andres Freund?
>
> Alternatively, maybe it can be done from one thread, but it would probably 
> involve some strange hocus-pocus to switch frequently between tablespaces.

I did the hocus-pocus approach, including a quasi-proof (not sure what is 
this mathematical object:-) in comments to show how/why it works.


* PATCH CONTENTS
 - as version 1: simplified asynchronous flush based on Andres Freund   patch, with sync_file_range/posix_fadvise used
tohint the OS that   the buffer must be sent to disk "now".
 
 - as version 2: checkpoint buffer sorting based on a 2007 patch by   Takahiro Itagaki but with a smaller and static
bufferallocated once.   Also, sorting is done by chunks of 131072 pages in the current version,   with a guc to change
thisvalue.
 
 - as version 2: sync/advise calls are now merged if possible,   so less calls will be used, especially when buffers
aresorted,   but also if there are few files written.
 
 - new: the checkpointer balance its page writes per tablespace.   this is done by choosing to write pages for a
tablespacefor which   the progress ratio (written/to_write) is beyond the overall progress   ratio for all tablespace,
andby doing that in a round robin manner   so that all tablespaces regularly get some attention. No threads.
 
 - new: some more documentation is added.
 - removed: "bgwriter_flush_to_write" is removed, as there was no clear   benefit on the (simple) tests. It could be
consideredfor another patch.
 
 - question: I'm not sure I understand the checkpointer memory management.   There is some exception handling in the
checkpointermain. I wonder   whether the allocated memory would be lost in such event and should   be reallocated.  The
patchcurrently assumes that the memory is kept.
 


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:
 . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s . small: scale=120 shared_buffers=2GB
checkpoint_timeout=300stime=4000s . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s . large:
scale=1000shared_buffers=4GB checkpoint_timeout=40min time=7500s
 

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond "pretty bad".

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.

- full-speed 1-client
     options   | tps performance over per second data  flush | sort |    tiny    |    small     |   medium     |
large   off |  off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6    off |   on | 699 +- 223 | 457 +- 315
|479 +- 319   | 48.4 +- 28.8     on |  off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3     on |   on |
722+- 119 | 550 +- 140   | 549 +- 180   | 47.2 +- 16.8
 

- full speed 4-clients
      options  | tps performance over per second data  flush | sort |    tiny     |     small     |    medium    off |
off| 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *    off |   on | 2086 +- 673 | 819 +-  905 * | 807 +- 1029 *     on |
off| 2212 +- 451 | 169 +- 1269 * | 160 +-  502 *     on |   on | 2073 +- 437 | 743 +-  413   | 822 +-  467
 

- 100-tps 1-client max 100-ms latency
     options   | percent of late transactions  flush | sort |  tiny | small | medium    off |  off |  6.31 | 29.44 |
30.74   off |   on |  6.23 |  8.93 |  7.12     on |  off |  0.44 |  7.01 |  8.14     on |   on |  0.59 |  0.83 |  1.84
 

- 200-tps 1-client max 100-ms latency
     options   | percent of late transactions  flush | sort |  tiny | small | medium    off |  off | 10.00 | 50.61 |
45.51   off |   on |  8.82 | 12.75 | 12.89     on |  off |  0.59 | 40.48 | 42.64     on |   on |  0.53 |  1.76 |  2.59
 

- 400-tps 1-client (or 4 for medium) max 100-ms latency
     options   | percent of late transactions  flush | sort | tiny | small | medium    off |  off | 12.0 | 64.28 | 68.6
  off |   on | 11.3 | 22.05 | 22.6     on |  off |  1.1 | 67.93 | 67.9     on |   on |  0.6 |  3.24 |  3.1
 


* CONCLUSION :

For most of these HDD tests, when both options are activated the tps 
throughput is improved (+3 to +300%), late transactions are reduced (by 
91% to 97%) and overall the performance is more stable (tps standard 
deviation is typically halved).

The option effects are somehow orthogonal:
 - latency is essentially limited by flushing, although sorting also   contributes.
 - throughput is mostly improved thanks to sorting, with some occasional   small positive or negative effect from
flushing.

In detail, some loads may benefit more from only one option activated. In 
particular, flushing may have a small adverse effect on throughput in some 
conditions, although not always. With SSD probably both options would 
probably have limited benefit.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

19 июня 2015 г., 18:33:47

Hi,

On 2015-06-17 08:24:38 +0200, Fabien COELHO wrote:
> Here is version 3, including many performance tests with various settings,
> representing about 100 hours of pgbench run. This patch aims at improving
> checkpoint I/O behavior so that tps throughput is improved, late
> transactions are less frequent, and overall performances are more stable.

First off: This is pretty impressive stuff. Being at pgcon, I don't have
time to look into this in detail, but I do plan to comment more
extensively.

> >- Move fsync as early as possible, suggested by Andres Freund?
> >
> >My opinion is that this should be left out for the nonce.

"for the nonce" - what does that mean?

> I did that.

I'm doubtful that it's a good idea to separate this out, if you did.

>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>    Also, sorting is done by chunks of 131072 pages in the current version,
>    with a guc to change this value.

I think it's a really bad idea to do this in chunks. That'll mean we'll
frequently uselessly cause repetitive random IO, often interleaved. That
pattern is horrible for SSDs too. We should always try to do this at
once, and only fail back to using less memory if we couldn't allocate
everything.

> * PERFORMANCE TESTS
> 
> Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
> random write activity on one table), checkpoint_completion_target=0.8, with
> different settings on a 16GB 8-core host:
> 
>  . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
>  . small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
>  . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
>  . large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

> Note: figures noted with a star (*) had various issues during their run, so
> pgbench progress figures were more or less incorrect, thus the standard
> deviation computation is not to be trusted beyond "pretty bad".
> 
> Caveat: these are only benches on one host at a particular time and
> location, which may or may not be reproducible nor be representative
> as such of any other load.  The good news is that all these tests tell
> the same thing.
> 
> - full-speed 1-client
> 
>      options   | tps performance over per second data
>   flush | sort |    tiny    |    small     |   medium     |    large
>     off |  off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
>     off |   on | 699 +- 223 | 457 +- 315   | 479 +- 319   | 48.4 +- 28.8
>      on |  off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
>      on |   on | 722 +- 119 | 550 +- 140   | 549 +- 180   | 47.2 +- 16.8
> 
> - full speed 4-clients
> 
>       options  | tps performance over per second data
>   flush | sort |    tiny     |     small     |    medium
>     off |  off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
>     off |   on | 2086 +- 673 | 819 +-  905 * | 807 +- 1029 *
>      on |  off | 2212 +- 451 | 169 +- 1269 * | 160 +-  502 *
>      on |   on | 2073 +- 437 | 743 +-  413   | 822 +-  467
> 
> - 100-tps 1-client max 100-ms latency
> 
>      options   | percent of late transactions
>   flush | sort |  tiny | small | medium
>     off |  off |  6.31 | 29.44 | 30.74
>     off |   on |  6.23 |  8.93 |  7.12
>      on |  off |  0.44 |  7.01 |  8.14
>      on |   on |  0.59 |  0.83 |  1.84
> 
> - 200-tps 1-client max 100-ms latency
> 
>      options   | percent of late transactions
>   flush | sort |  tiny | small | medium
>     off |  off | 10.00 | 50.61 | 45.51
>     off |   on |  8.82 | 12.75 | 12.89
>      on |  off |  0.59 | 40.48 | 42.64
>      on |   on |  0.53 |  1.76 |  2.59
> 
> - 400-tps 1-client (or 4 for medium) max 100-ms latency
> 
>      options   | percent of late transactions
>   flush | sort | tiny | small | medium
>     off |  off | 12.0 | 64.28 | 68.6
>     off |   on | 11.3 | 22.05 | 22.6
>      on |  off |  1.1 | 67.93 | 67.9
>      on |   on |  0.6 |  3.24 |  3.1
> 

So you've not run things at more serious concurrency, that'd be
interesting to see.

I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.

It's also curious that sorting is detrimental for full speed 'tiny'.

> * CONCLUSION :
> 
> For most of these HDD tests, when both options are activated the tps
> throughput is improved (+3 to +300%), late transactions are reduced (by 91%
> to 97%) and overall the performance is more stable (tps standard deviation
> is typically halved).
> 
> The option effects are somehow orthogonal:
> 
>  - latency is essentially limited by flushing, although sorting also
>    contributes.
> 
>  - throughput is mostly improved thanks to sorting, with some occasional
>    small positive or negative effect from flushing.
> 
> In detail, some loads may benefit more from only one option activated. In
> particular, flushing may have a small adverse effect on throughput in some
> conditions, although not always.

> With SSD probably both options would probably have limited benefit.

I doubt that. Small random writes have bad consequences for wear
leveling. You might not notice that with a short tests - again, I doubt
it - but it'll definitely become visible over time.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

20 июня 2015 г., 09:58:07

Hello Andres,

>>> - Move fsync as early as possible, suggested by Andres Freund?
>>>
>>> My opinion is that this should be left out for the nonce.
>
> "for the nonce" - what does that mean?
 Nonce \Nonce\ (n[o^]ns), n. [For the nonce, OE. for the nones, ...     {for the nonce}, i. e. for the present time.

> I'm doubtful that it's a good idea to separate this out, if you did.

Actually I did, because as explained in another mail the fsync time when 
the other options are activated as reported in the logs is essentially 
null, so it would not bring significant improvements on these runs,
and also the patch changes enough things as it is.

So this is an evidence-based decision.

I also agree that it seems interesting on principle and should be 
beneficial in some case, but I would rather keep that on a TODO list 
together with trying to do better things in the bgwriter and try to focus 
on the current proposal which already changes significantly the 
checkpointer throttling logic.

>>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>>    Also, sorting is done by chunks of 131072 pages in the current version,
>>    with a guc to change this value.
>
> I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be 
several seconds or even minutes of sorting, which may or may not be 
desirable, so having some control on that seems a good idea.

Another argument is that Tom said he wanted that:-)

In practice the value can be set at a high value so that it is nearly 
always sorted in one go. Maybe value "0" could be made special and used to 
trigger this behavior systematically, and be the default.

> That'll mean we'll frequently uselessly cause repetitive random IO,

This is not an issue if the chunks are large enough, and anyway the guc 
allows to change the behavior as desired. As I said, keeping some control 
seems a good idea, and the "full sorting" can be made the default 
behavior.

> often interleaved. That pattern is horrible for SSDs too. We should 
> always try to do this at once, and only fail back to using less memory 
> if we couldn't allocate everything.

The memory is needed anyway in order to avoid a double or significantly 
more heavy implementation for the throttling loop. It is allocated once on 
the first checkpoint. The allocation could be moved to the checkpointer 
initialization if this is a concern. The memory needed is one int per 
buffer, which is smaller than the 2007 patch.

>>  . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
>
> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

The point of tiny was to trigger a lot of checkpoints. The size is pretty 
ridiculous anyway, as "tiny" implies. I think I did some tests on other 
versions of the patch and longer checkpoint_timeout on pretty small 
database that showed smaller benefit from the options, as one would 
expect. I'll try to re-run some.

> So you've not run things at more serious concurrency, that'd be
> interesting to see.

I do not have a box available for "serious concurrency".

> I'd also like to see concurrent workloads with synchronous_commit=off -
> I've seen absolutely horrible latency behaviour for that, and I'm hoping
> this will help. It's also a good way to simulate faster hardware than
> you have.

> It's also curious that sorting is detrimental for full speed 'tiny'.

Yep.

>> With SSD probably both options would probably have limited benefit.
>
> I doubt that. Small random writes have bad consequences for wear
> leveling. You might not notice that with a short tests - again, I doubt
> it - but it'll definitely become visible over time.

Possibly. Testing such effects does not seem easy, though. At least I have 
not seen "write stalls" on SSD, which is my primary concern.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

22 июня 2015 г., 07:59:48

Hi,

On 2015-06-20 08:57:57 +0200, Fabien COELHO wrote:
> Actually I did, because as explained in another mail the fsync time when the
> other options are activated as reported in the logs is essentially null, so
> it would not bring significant improvements on these runs,
> and also the patch changes enough things as it is.
> 
> So this is an evidence-based decision.

Meh. You're testing on low concurrency.

> >> - as version 2: checkpoint buffer sorting based on a 2007 patch by
> >>   Takahiro Itagaki but with a smaller and static buffer allocated once.
> >>   Also, sorting is done by chunks of 131072 pages in the current version,
> >>   with a guc to change this value.
> >
> >I think it's a really bad idea to do this in chunks.
> 
> The small problem I see is that for a very large setting there could be
> several seconds or even minutes of sorting, which may or may not be
> desirable, so having some control on that seems a good idea.

If the sorting of the dirty blocks alone takes minutes, it'll never
finish writing that many buffers out. That's a utterly bogus argument.

> Another argument is that Tom said he wanted that:-)

I don't think he said that when we discussed this last.

> In practice the value can be set at a high value so that it is nearly always
> sorted in one go. Maybe value "0" could be made special and used to trigger
> this behavior systematically, and be the default.

You're just making things too complicated.

> >That'll mean we'll frequently uselessly cause repetitive random IO,
> 
> This is not an issue if the chunks are large enough, and anyway the guc
> allows to change the behavior as desired.

I don't think this is true. If two consecutive blocks are dirty, but you
sync them in two different chunks, you *always* will cause additional
random IO. Either the drive will have to skip the write for that block,
or the os will prefetch the data. More importantly with SSDs it voids
the wear leveling advantages.
> >often interleaved. That pattern is horrible for SSDs too. We should always
> >try to do this at once, and only fail back to using less memory if we
> >couldn't allocate everything.
> 
> The memory is needed anyway in order to avoid a double or significantly more
> heavy implementation for the throttling loop. It is allocated once on the
> first checkpoint. The allocation could be moved to the checkpointer
> initialization if this is a concern. The memory needed is one int per
> buffer, which is smaller than the 2007 patch.

There's a reason the 2007 patch (and my revision of it last year) did
what it did. You can't just access buffer descriptors without
locking. Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Jim Nasby

Дата:

22 июня 2015 г., 08:15:46

On 6/20/15 2:57 AM, Fabien COELHO wrote:
>>>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>>>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>>>    Also, sorting is done by chunks of 131072 pages in the current
>>> version,
>>>    with a guc to change this value.
>>
>> I think it's a really bad idea to do this in chunks.
>
> The small problem I see is that for a very large setting there could be
> several seconds or even minutes of sorting, which may or may not be
> desirable, so having some control on that seems a good idea.

ISTM a more elegant way to handle that would be to start off with a very 
small number of buffers and sort larger and larger lists while the OS is 
busy writing/syncing.

> Another argument is that Tom said he wanted that:-)

Did he elaborate why? I don't see him on this thread (though I don't 
have all of it).

> In practice the value can be set at a high value so that it is nearly
> always sorted in one go. Maybe value "0" could be made special and used
> to trigger this behavior systematically, and be the default.

It'd be nice if it was just self-tuning, with no GUC.

It looks like it'd be much better to get this committed without more 
than we have now than to do without it though...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

22 июня 2015 г., 08:51:51

Hello Andres,

>> So this is an evidence-based decision.
>
> Meh. You're testing on low concurrency.

Well, I'm just testing on the available box.

I do not see the link between high concurrency and whether moving fsync as 
early as possible would have a large performance impact. I think it might 
be interesting if bgwriter is doing a lot of writes, but I'm not sure 
under which configuration & load that would be.

>>> I think it's a really bad idea to do this in chunks.
>>
>> The small problem I see is that for a very large setting there could be
>> several seconds or even minutes of sorting, which may or may not be
>> desirable, so having some control on that seems a good idea.
>
> If the sorting of the dirty blocks alone takes minutes, it'll never
> finish writing that many buffers out. That's a utterly bogus argument.

Well, if in the future you have 8 TB of memory (I've seen a 512GB memory 
server a few weeks ago), set shared_buffers=2TB, then if I'm not mistaken 
in the worst case you may have 256 millions 8k-buffers to checkpoint. Then 
it really depends on the I/O backend stuff used by the box, but if you 
bought 8 TB of RAM probably you would have a nice I/O stuff attached.

>> Another argument is that Tom said he wanted that:-)
>
> I don't think he said that when we discussed this last.

That is what I was recalling when I wrote this sentence:

http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us

But it had more to do with memory-allocation management.

>> In practice the value can be set at a high value so that it is nearly always
>> sorted in one go. Maybe value "0" could be made special and used to trigger
>> this behavior systematically, and be the default.
>
> You're just making things too complicated.

ISTM that it is not really complicated, but anyway it is easy to change 
the checkpoint_sort stuff to a boolean.

In the reported performance tests, the is usually just one chunk anyway, 
sometimes two, so this gives an idea of the overall performance effect.

>> This is not an issue if the chunks are large enough, and anyway the guc
>> allows to change the behavior as desired.
>
> I don't think this is true. If two consecutive blocks are dirty, but you
> sync them in two different chunks, you *always* will cause additional
> random IO.

I think that it could be a small number if the chunks are large, i.e. the 
performance benefit of sorting larger and larger chunks is decreasing.

> Either the drive will have to skip the write for that block,
> or the os will prefetch the data. More importantly with SSDs it voids
> the wear leveling advantages.

Possibly. I do not understand wear leveling done by SSD firmware.

>>> often interleaved. That pattern is horrible for SSDs too. We should always
>>> try to do this at once, and only fail back to using less memory if we
>>> couldn't allocate everything.
>>
>> The memory is needed anyway in order to avoid a double or significantly more
>> heavy implementation for the throttling loop. It is allocated once on the
>> first checkpoint. The allocation could be moved to the checkpointer
>> initialization if this is a concern. The memory needed is one int per
>> buffer, which is smaller than the 2007 patch.
>
> There's a reason the 2007 patch (and my revision of it last year) did
> what it did. You can't just access buffer descriptors without
> locking.

I really think that you can because the sorting is really "advisory", i.e. 
the checkpointer will work fine if the sorting is wrong or not done at 
all, as it is now, when the checkpointer writes buffers. The only 
condition is that the buffers must not be moved with their "to write in 
this checkpoint" flag, but this is also necessary for the current 
checkpointer stuff to work.

Moreover, this trick is alreay pre-existing from the patch I submitted: 
some tests are done without locking, but the actual "buffer write" does 
the locking and would skip it if the previous test was wrong, as described 
in comments in the code.

> Besides, causing additional cacheline bouncing during the
> sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4 
(buf_id, relation, forknum, offset), instead of just buf_id, and I 
understood that memory was a concern.

Moreover, once the sort process get the lines which contain the sorting 
data from the buffer descriptor in its cache, I think that it should be 
pretty much okay. Incidentally, they would probably have been brought to 
cache by the scan to collect them. Also, I do not think that the sorting 
time for 128000 buffers, and possible cache misses, was a big issue, but I 
do not have a measure to defend that. I could try to collect some data 
about that.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

22 июня 2015 г., 10:29:54

Hello Jim,

>> The small problem I see is that for a very large setting there could be
>> several seconds or even minutes of sorting, which may or may not be
>> desirable, so having some control on that seems a good idea.
>
> ISTM a more elegant way to handle that would be to start off with a very 
> small number of buffers and sort larger and larger lists while the OS is busy 
> writing/syncing.

You really have to have done a significant part/most/all of sorting before 
starting to write.

>> Another argument is that Tom said he wanted that:-)
>
> Did he elaborate why? I don't see him on this thread (though I don't have all 
> of it).

http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us

But it has more to do with memory management.

>> In practice the value can be set at a high value so that it is nearly
>> always sorted in one go. Maybe value "0" could be made special and used
>> to trigger this behavior systematically, and be the default.
>
> It'd be nice if it was just self-tuning, with no GUC.

Hmmm. It can easilly be turned into a boolean, but otherwise I have no 
clue about how to decide whether to sort and/or flush.

> It looks like it'd be much better to get this committed without more than we 
> have now than to do without it though...

Yep, I think the figures are definitely encouraging.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

22 июня 2015 г., 11:11:16

<sorry, resent stalled post, wrong from>

> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min         max_wal_size=1GB warmup=600 time=4000
  flsh |      full speed tps      | percent of late tx, 4 clients, for tps:  /srt |  1 client  |  4 clients  |  100 |
200|  400 |  800 | 1200 | 1600  N/N  | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38  N/Y  | 924
+-122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67  Y/N  | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 |
0.71| 0.81 | 1.26  Y/Y  | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13
 

There seems to be a small 1-2% performance benefit with 4 clients, this is 
reversed for 1 client, there are significantly and consistently less late 
transactions when options are activated, the performance is more stable
(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times 
over in 5 minutes, so the checkpoint basically writes everything in 220 
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less 
in order in memory, so may be written out in order anyway.


medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min          max_wal_size=4GB warmup=1200 time=7500
  flsh |      full speed tps       | percent of late tx, 4 clients  /srt |  1 client   |  4 clients  |   100 |   200 |
400 |   N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |   N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 |
24.07|   Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |   Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |
2.84|
 

The effect of sorting is very positive (+150% to 270% tps). On this run, 
flushing has a positive (+20% with 1 client) or negative (-8 % with 4 
clients) on throughput, and late transactions are reduced by 92-95% when 
both options are activated.

At 550 tps checkpoints are xlog-triggered and write about 1/3 of the 
database, (170000 buffers to write very 220-260 seconds, 4 MB/s).

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

23 июня 2015 г., 07:15:46

On Mon, Jun 22, 2015 at 1:41 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> <sorry, resent stalled post, wrong from>
>
>> It'd be interesting to see numbers for tiny, without the overly small
>> checkpoint timeout value. 30s is below the OS's writeback time.
>
>
> Here are some tests with longer timeout:
>
> tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min
> max_wal_size=1GB warmup=600 time=4000
>
> flsh | full speed tps | percent of late tx, 4 clients, for tps:
> /srt | 1 client | 4 clients | 100 | 200 | 400 | 800 | 1200 | 1600
> N/N | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38
> N/Y | 924 +- 122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67
> Y/N | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71 | 0.81 | 1.26
> Y/Y | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13
>
> There seems to be a small 1-2% performance benefit with 4 clients, this is reversed for 1 client, there are significantly and consistently less late transactions when options are activated, the performance is more stable
> (standard deviation reduced by 10-18%).
>
> The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times over in 5 minutes, so the checkpoint basically writes everything in 220 seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less in order in memory, so may be written out in order anyway.
>
>
> medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
> max_wal_size=4GB warmup=1200 time=7500
>
> flsh | full speed tps | percent of late tx, 4 clients
> /srt | 1 client | 4 clients | 100 | 200 | 400 |
> N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
> N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
> Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
> Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |
>
> The effect of sorting is very positive (+150% to 270% tps). On this run, flushing has a positive (+20% with 1 client) or negative (-8 % with 4 clients) on throughput, and late transactions are reduced by 92-95% when both options are activated.
>

Why there is dip in performance with multiple clients, can it be

due to reason that we started doing more stuff after holding bufhdr

lock in below code?

BufferSync()

{

for (buf_id = 0; buf_id < NBuffers; buf_id++)

{

volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);

@@ -1621,32 +1719,185 @@ BufferSync(int flags)

if ((bufHdr->flags & mask) == mask)

{

+ Oid spc;

+ TableSpaceCountEntry * entry;

+ bool found;

bufHdr->flags |= BM_CHECKPOINT_NEEDED;

+ CheckpointBufferIds[num_to_write] = buf_id;

num_to_write++;

+ /* keep track of per tablespace buffers */

+ spc = bufHdr->tag.rnode.spcNode;

+ entry = (TableSpaceCountEntry *)

+ hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);

+ if (found) entry->count++;

+ else entry->count = 1;

}

BufferSync()

{

- buf_id = StrategySyncStart(NULL, NULL);

- num_to_scan = NBuffers;

+ active_spaces = nb_spaces;

+ space = 0;

num_written = 0;

- while (num_to_scan-- > 0)

+ while (active_spaces != 0)

}

The changed code doesn't seems to give any consideration to

clock-sweep point which might not be helpful for cases when checkpoint

could have flushed soon-to-be-recycled buffers. I think flushing the

sorted buffers w.r.t tablespaces is a good idea, but not giving any

preference to clock-sweep point seems to me that we would loose in

some cases by this new change.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

23 июня 2015 г., 07:59:53

Hello Amit,

>> medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
>>           max_wal_size=4GB warmup=1200 time=7500
>>
>>   flsh |      full speed tps       | percent of late tx, 4 clients
>>   /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
>>    N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
>>    N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
>>    Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
>>    Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |
>>
>> The effect of sorting is very positive (+150% to 270% tps). On this run,
> flushing has a positive (+20% with 1 client) or negative (-8 % with 4
> clients) on throughput, and late transactions are reduced by 92-95% when
> both options are activated.
>
> Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients 
compared to 1 client?

> can it be due to reason that we started doing more stuff after holding 
> bufhdr lock in below code?

I think it is very unlikely that the buffer being locked would be 
simultaneously requested by one of the 4 clients for an UPDATE, so I do 
not think it should have a significant impact.

> BufferSync() [...]

> BufferSync()
> {
> ..
> - buf_id = StrategySyncStart(NULL, NULL);
> - num_to_scan = NBuffers;
> + active_spaces = nb_spaces;
> + space = 0;
>  num_written = 0;
> - while (num_to_scan-- > 0)
> +
> + while (active_spaces != 0)
> ..
> }
>
> The changed code doesn't seems to give any consideration to
> clock-sweep point

Indeed.

> which might not be helpful for cases when checkpoint could have flushed 
> soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t 
> tablespaces is a good idea, but not giving any preference to clock-sweep 
> point seems to me that we would loose in some cases by this new change.

I do not see how to do both, as these two orders seem more or less 
unrelated?  The traditionnal assumption is that the I/O are very slow and 
they are to be optimized first, so going for buffer ordering to be nice to 
the disk looks like the priority.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

23 июня 2015 г., 10:00:14

> I'd also like to see concurrent workloads with synchronous_commit=off -
> I've seen absolutely horrible latency behaviour for that, and I'm hoping
> this will help. It's also a good way to simulate faster hardware than
> you have.

It helps. I've done a few runs, where the very-very-bad situation is 
improved to... I would say very-bad:

medium3: scale=200 shared_buffers=4GB checkpoint_timeout=15min    max_wal_size=4GB warmup=1200 time=6000 clients=4
synchronous_commit=off
 flush sort |  tps | percent of seconds offline off   off  |  296 | 83% offline off   on   | 1496 | 33% offline off
on  | 1641 | 59% offline on    on   | 1515 | 31% offline
 

The offline figure is the percentage of seconds in the 6000 seconds run 
where 0.0 tps are reported, or where nothing is reported because pgbench 
is stuck.

It is somehow better... on an abysmal scale: sorting and flushing reduced 
the offline time by a factor of 2.6. Too bad it is so high to begin with. 
The tps is improved by a factor of 5 with either options.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

23 июня 2015 г., 18:38:48

> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min        max_wal_size=1GB warmup=600 time=4000
 flsh |      full speed tps      | percent of late tx, 4 clients, for tps: /srt |  1 client  |  4 clients  |  100 |
200|  400 |  800 | 1200 | 1600 N/N  | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38 N/Y  | 924 +-
122| 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67 Y/N  | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71
|0.81 | 1.26 Y/Y  | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13
 

There seems to be a small 1-2% performance benefit with 4 clients, this is 
reversed for 1 client, there are significantly and consistently less late 
transactions when options are activated, the performance is more stable
(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times 
over in 5 minutes, so the checkpoint basically writes everything over 220 
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less 
in order in memory, so would be written out in order.


medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min         max_wal_size=4GB warmup=1200 time=7500
 flsh |      full speed tps       | percent of late tx, 4 clients /srt |  1 client   |  4 clients  |   100 |   200 |
400|  N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |  N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 |
24.07|  Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |  Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |
2.84|
 

The effect of sorting is very positive (+150% to 270% tps). On this run, 
flushing has a positive (+20% with 1 client) or negative (-8 % with 4 
clients) on throughput, and late transactions are reduced by 92-95% when 
both options are activated.

At 550 tps checkpoints are xlog-triggered and write about 1/3 of the 
database, (170000 buffers to write very 220-260 seconds, 4 MB/s).

-- 
Fabien.

Re: checkpointer continuous flushing

От

Jim Nasby

Дата:

23 июня 2015 г., 23:19:33

On 6/22/15 11:59 PM, Fabien COELHO wrote:
>> which might not be helpful for cases when checkpoint could have
>> flushed soon-to-be-recycled buffers. I think flushing the sorted
>> buffers w.r.t tablespaces is a good idea, but not giving any
>> preference to clock-sweep point seems to me that we would loose in
>> some cases by this new change.
>
> I do not see how to do both, as these two orders seem more or less
> unrelated?  The traditionnal assumption is that the I/O are very slow
> and they are to be optimized first, so going for buffer ordering to be
> nice to the disk looks like the priority.

The point is that it's already expensive for backends to advance the 
clock; if they then have to wait on IO as well it gets REALLY expensive. 
So we want to avoid that.

Other than that though, it is pretty orthogonal, so perhaps another 
indication that the clock should be handled separately from both 
backends and bgwriter...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

24 июня 2015 г., 05:49:41

On Tue, Jun 23, 2015 at 10:29 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
max_wal_size=4GB warmup=1200 time=7500

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,
flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients compared to 1 client?

What do you mean by "negative (-8 % with 4 clients) on throughput"

in above sentence? I thought by that you mean that there is dip

in TPS with patch as compare to HEAD at 4 clients.

Also I am not completely sure what's +- means in your data above?

can it be due to reason that we started doing more stuff after holding bufhdr lock in below code?

I think it is very unlikely that the buffer being locked would be simultaneously requested by one of the 4 clients for an UPDATE, so I do not think it should have a significant impact.

BufferSync() [...]

BufferSync()
{
..
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
+ active_spaces = nb_spaces;
+ space = 0;
num_written = 0;
- while (num_to_scan-- > 0)
+
+ while (active_spaces != 0)
..
}

The changed code doesn't seems to give any consideration to
clock-sweep point

Indeed.

which might not be helpful for cases when checkpoint could have flushed soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t tablespaces is a good idea, but not giving any preference to clock-sweep point seems to me that we would loose in some cases by this new change.

I do not see how to do both, as these two orders seem more or less unrelated?

I understand your point and I also don't have any specific answer

for it at this moment, the point of worry is that it should not lead

to degradation of certain cases as compare to current algorithm.

The workload where it could effect is when your data doesn't fit

in shared buffers, but can fit in RAM.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

24 июня 2015 г., 07:20:45

>>>>   flsh |      full speed tps       | percent of late tx, 4 clients
>>>>   /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
>>>>    N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
>>>>    N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
>>>>    Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
>>>>    Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |
>>>>
>>>> The effect of sorting is very positive (+150% to 270% tps). On this run,
>>>>
>>> flushing has a positive (+20% with 1 client) or negative (-8 % with 4
>>> clients) on throughput, and late transactions are reduced by 92-95% when
>>> both options are activated.
>>>
>>> Why there is dip in performance with multiple clients,
>>
>> I'm not sure to see the "dip". The performances are better with 4 clients
>> compared to 1 client?
>
> What do you mean by "negative (-8 % with 4 clients) on throughput" in 
> above sentence? I thought by that you mean that there is dip in TPS with 
> patch as compare to HEAD at 4 clients.

Ok, I misunderstood your question. I thought you meant a dip between 1 
client and 4 clients. I meant that when flush is turned on tps goes down 
by 8% (743 to 681 tps) on this particular run. Basically tps improvements 
mostly come from "sort", and "flush" has uncertain effects on tps 
(throuput), but much more on latency and performance stability (lower late 
rate, lower standard deviation).

Note that I'm not comparing to HEAD in the above tests, but with the new 
options desactivated, which should be more or less comparable to current 
HEAD, i.e. there is no sorting nor flushing done, but this is not strictly 
speaking HEAD behavior. Probably I should get some figures with HEAD as 
well to check the "more or less" assumption.

> Also I am not completely sure what's +- means in your data above?

The first figure before "+-" is the tps, the second after is its standard 
deviation computed in per-second traces. Some runs are very bad, with 
pgbench stuck at times, and result on stddev larger than the average, they 
ere noted with "*".

> I understand your point and I also don't have any specific answer
> for it at this moment, the point of worry is that it should not lead
> to degradation of certain cases as compare to current algorithm.
> The workload where it could effect is when your data doesn't fit
> in shared buffers, but can fit in RAM.

Hmmm. My point of view is still that the logical priority is to optimize 
for disk IO first, then look for compatible RAM optimisations later.

I can run tests with a small shared_buffers, but probably it would just 
trigger a lot of checkpoints, or worse rely on the bgwriter to find space, 
which would generate random IOs.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

24 июня 2015 г., 07:26:19

>> I do not see how to do both, as these two orders seem more or less
>> unrelated?  The traditionnal assumption is that the I/O are very slow
>> and they are to be optimized first, so going for buffer ordering to be
>> nice to the disk looks like the priority.
>
> The point is that it's already expensive for backends to advance the clock; 
> if they then have to wait on IO as well it gets REALLY expensive. So we want 
> to avoid that.

I do not know what this clock stuff does. Note that the checkpoint buffer 
scan is done once at the beginning of the checkpoint and its time is 
relatively small compared to everything else in the checkpoint.

If this scan is an issue, it can be done in reverse order, or in some 
other order, but I think it is better to do it in order for better cache 
behavior, although the effect should be marginal.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

24 июня 2015 г., 09:30:17

>> Besides, causing additional cacheline bouncing during the
>> sorting process is a bad idea.
>
> Hmmm. The impact would be to multiply the memory required by 3 or 4 (buf_id, 
> relation, forknum, offset), instead of just buf_id, and I understood that 
> memory was a concern.
>
> Moreover, once the sort process get the lines which contain the sorting data 
> from the buffer descriptor in its cache, I think that it should be pretty 
> much okay. Incidentally, they would probably have been brought to cache by 
> the scan to collect them. Also, I do not think that the sorting time for 
> 128000 buffers, and possible cache misses, was a big issue, but I do not have 
> a measure to defend that. I could try to collect some data about that.

I've collected some data by adding a "sort time" measure, with 
checkpoint_sort_size=10000000 so that sorting is in one chunk, and done 
some large checkpoints:

LOG:  checkpoint complete: wrote 41091 buffers (6.3%);  0 transaction log file(s) added, 0 removed, 0 recycled;
sort=0.024s, write=0.488 s, sync=8.790 s, total=9.837 s;  sync files=41, longest=8.717 s, average=0.214 s;
distance=404972kB, estimate=404972 kB
 

LOG:  checkpoint complete: wrote 212124 buffers (32.4%);  0 transaction log file(s) added, 0 removed, 0 recycled;
sort=0.078s, write=128.885 s, sync=1.269 s, total=131.646 s;  sync files=43, longest=1.155 s, average=0.029 s;
distance=2102950kB, estimate=2102950 kB
 

LOG:  checkpoint complete: wrote 384427 buffers (36.7%);  0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.120s, write=83.995 s, sync=13.944 s, total=98.035 s;  sync files=9, longest=13.724 s, average=1.549 s;
distance=3783305kB, estimate=3783305 kB
 

LOG:  checkpoint complete: wrote 809211 buffers (77.2%);  0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.358s, write=138.146 s, sync=14.943 s, total=153.124 s;  sync files=13, longest=14.871 s, average=1.149 s;
distance=8075338kB, estimate=8075338 kB
 

Summary of these checkpoints:
  #buffers   size   sort     41091  328MB  0.024    212124  1.7GB  0.078    384427  2.9GB  0.120    809211  6.2GB
0.358

Sort times are pretty negligeable compared to the whole checkpoint time,
and under 0.1 s/GB of buffers sorted.

On a 512 GB server with shared_buffers=128GB (25%), this suggest a worst 
case checkpoint sorting in a few seconds, and then you have a hundred GB 
to write anyway. If we project on next decade 1 TB checkpoint that would 
make sorting in under a minute... But then you have 1 TB of data to dump.

As a comparison point, I've done the large checkpoint with the default 
sort size of 131072:

LOG:  checkpoint complete: wrote 809211 buffers (77.2%);  0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.251s, write=152.377 s, sync=15.062 s, total=167.453 s;  sync files=13, longest=14.974 s, average=1.158 s;
distance=8075338kB, estimate=8075338 kB
 

The 0.251 sort time is to be compared to 0.358. Well, n.log(n) is not too 
bad, as expected.


These figures suggest that sorting time and associated cache misses are 
not a significant issue and thus are not worth bothering much about, and 
also that probably a simple boolean option would be quite acceptable 
instead of the chunk approach.

Attached is an updated version of the patch which turns the sort option 
into a boolean, and also include the sort time in the checkpoint log.

There is still an open question about whether the sorting buffer 
allocation is lost on some signals and should be reallocated in such 
event.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

25 июня 2015 г., 04:22:09

On Wed, Jun 24, 2015 at 9:50 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

flsh | full speed tps | percent of late tx, 4 clients
/srt | 1 client | 4 clients | 100 | 200 | 400 |
N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
N/Y | 458 +- 327* | 743 +- 920* | 7.05 | 14.24 | 24.07 |
Y/N | 169 +- 166* | 187 +- 302* | 4.01 | 39.84 | 65.70 |
Y/Y | 546 +- 143 | 681 +- 459 | 1.55 | 3.51 | 2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?

What do you mean by "negative (-8 % with 4 clients) on throughput" in above sentence? I thought by that you mean that there is dip in TPS with patch as compare to HEAD at 4 clients.

Ok, I misunderstood your question. I thought you meant a dip between 1 client and 4 clients. I meant that when flush is turned on tps goes down by 8% (743 to 681 tps) on this particular run.

This 8% might matter if the dip is bigger with more clients and

more aggressive workload. Do you know what could lead to this

dip, because if we know what is the reason than it will be more

predictable to know if this is the max dip that could happen or it

could lead to bigger dip in other cases.

Basically tps improvements mostly come from "sort", and "flush" has uncertain effects on tps (throuput), but much more on latency and performance stability (lower late rate, lower standard deviation).

I agree that performance stability is important, but not sure if it

is good idea to sacrifice the throuput for it. If sort + flush always

gives better results, then isn't it better to perform these actions

together under one option.

Note that I'm not comparing to HEAD in the above tests, but with the new options desactivated, which should be more or less comparable to current HEAD, i.e. there is no sorting nor flushing done, but this is not strictly speaking HEAD behavior. Probably I should get some figures with HEAD as well to check the "more or less" assumption.

Also I am not completely sure what's +- means in your data above?

The first figure before "+-" is the tps, the second after is its standard deviation computed in per-second traces. Some runs are very bad, with pgbench stuck at times, and result on stddev larger than the average, they ere noted with "*".

I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.

Hmmm. My point of view is still that the logical priority is to optimize for disk IO first, then look for compatible RAM optimisations later.

It is not only about RAM optimisation which we can do later, but also

about avoiding regression in existing use-cases.

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

25 июня 2015 г., 07:54:13

Hello Amit,

>> [...]
>> Ok, I misunderstood your question. I thought you meant a dip between 1
>> client and 4 clients. I meant that when flush is turned on tps goes down by
>> 8% (743 to 681 tps) on this particular run.
>
> This 8% might matter if the dip is bigger with more clients and
> more aggressive workload.  Do you know what could lead to this
> dip, because if we know what is the reason than it will be more
> predictable to know if this is the max dip that could happen or it
> could lead to bigger dip in other cases.

I do not know the cause of the dip, and whether it would increase with 
more clients. I do not have a box for such tests. If someone can provided 
the box, I can provide test scripts:-)

The first, although higher, measure is really very unstable, with pg 
totaly unresponsive (offline, really) at time.

I think that the flush option may always have a risk of (small) 
detrimental effects on tps, because there are two steady states: one with 
pg only doing wal-logged transactions with great tps, and one with pg 
doing the checkpoint at nought tps. If this is on the same disk, even at 
best the combination means that probably each operation will amper the 
other one a little bit, so the combined tps performance would/could be 
lower than doing one after the other and having pg offline 50% of the 
time...

Please also note that this 8% "dip" is on a 681 (with the dip) vs 198 (no 
options at all) a X 3.4 improvement compared to pg current behavior.

>> Basically tps improvements mostly come from "sort", and "flush" has
>> uncertain effects on tps (throuput), but much more on latency and
>> performance stability (lower late rate, lower standard deviation).
>
> I agree that performance stability is important, but not sure if it
> is good idea to sacrifice the throuput for it.

See discussion above. I think better stability may imply slightly lower 
throughput on some load. That is why there are options and DBA to choose 
them:-)

> If sort + flush always gives better results, then isn't it better to 
> perform these actions together under one option.

Sure, but that is not currently the case. Also what is done is very 
orthogonal, so I would tend to keep these separate. If one is always 
beneficial and it is wished that it should be always activated, then the 
option could be removed.

>> Hmmm. My point of view is still that the logical priority is to optimize
>> for disk IO first, then look for compatible RAM optimisations later.
>
> It is not only about RAM optimisation which we can do later, but also
> about avoiding regression in existing use-cases.

Hmmm. Currently I have not seen really significant regressions. I have 
seen some less good impact of some options on some loads.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

26 июня 2015 г., 22:47:39

> Note that I'm not comparing to HEAD in the above tests, but with the new 
> options desactivated, which should be more or less comparable to current 
> HEAD, i.e. there is no sorting nor flushing done, but this is not strictly 
> speaking HEAD behavior. Probably I should get some figures with HEAD as well 
> to check the "more or less" assumption.

Just for answering myself on this point, I tried current HEAD vs patch v4 
with sort OFF + flush OFF: the figures are indeed quite comparable (see 
below), so although the internal implementation is different, the 
performance when both options are off is still a reasonable approximation 
of the performance without the patch, as I was expecting. What patch v4 
still does with OFF/OFF which is not done by HEAD is balancing writes 
among tablespaces, but there is only one disk on these tests so it does 
not matter.

tps & stddev full speed:
                            HEAD         OFF/OFF
 tiny 1 client          727 +- 227     221 +- 246 small 1 client         158 +- 316     158 +- 325 medium 1 client
 148 +- 285     157 +- 326 tiny 4 clients        2088 +- 786    2074 +- 699 small 4 clients        192 +- 648     188
+-560 medium 4 clients       220 +- 654     220 +- 648
 

percent of late transactions:
                            HEAD       OFF/OFF
 tiny 4 clients 100 tps      6.31        6.67 small 4c 100 tps           35.68       35.23 medium 4c 100 tps
37.38      38.00 tiny 4c 200 tps             9.06        9.10 small 4c 200 tps           51.65       51.16 medium 4c
200tps          51.35       50.20 tiny 4 clients 400 tps     11.4        10.5 small 4 clients 400 tps    66.4
67.6

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

26 июня 2015 г., 22:49:26

On 2015-06-26 21:47:30 +0200, Fabien COELHO wrote:
> tps & stddev full speed:

>                             HEAD         OFF/OFF
> 
>  tiny 1 client          727 +- 227     221 +- 246

Huh?

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

26 июня 2015 г., 22:59:39

Hello Andres,

>>                             HEAD         OFF/OFF
>>
>>  tiny 1 client          727 +- 227     221 +- 246
>
> Huh?

Indeed, just to check that someone was reading this magnificent mail:-)

Just a typo because I reformated the figures for simpler comparison. 221 
is really 721, quite close to 727.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

27 июня 2015 г., 09:58:33

> Attached is an updated version of the patch which turns the sort option into 
> a boolean, and also include the sort time in the checkpoint log.
>
> There is still an open question about whether the sorting buffer allocation 
> is lost on some signals and should be reallocated in such event.

In such case, probably the allocation should be managed from 
CheckpointerMain, and the lazy allocation could remain for other callers 
(I guess just "initdb").


More open questions:
 - best name for the flush option (checkpoint_flush_to_disk,     checkpoint_flush_on_write, checkpoint_flush, ...)
 - best name for the sort option (checkpoint_sort,     checkpoint_sort_buffers, checkpoint_sort_ios, ...)


Other nice-to-have inputs:
 - tests on a non-linux system with posix_fadvise   (FreeBSD? others?)
 - tests on a large dedicated box


Attached are some scripts to help with testing, if someone's feels like 
that:
 - cp_test.sh: run some tests, to adapt to one's setup...
 - cp_test_count.pl: show percent of late transactions
 - avg.py: show stats about stuff
   sh> grep 'progress: ' OUTPUT_FILE | cut -d' ' -f4 | avg.py
   *BEWARE* that if pgbench got stuck some "0" data are missing,   look for the actual tps in the output file and for
theline   count to check whether it is the case... some currently submitted   patch on pgbench helps, see
https://commitfest.postgresql.org/5/199/

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

26 июля 2015 г., 18:02:07

Hello,

Attached is very minor v5 update which does a rebase & completes the 
cleanup of doing a full sort instead of a chuncked sort.

>> Attached is an updated version of the patch which turns the sort option 
>> into a boolean, and also include the sort time in the checkpoint log.
>> 
>> There is still an open question about whether the sorting buffer allocation 
>> is lost on some signals and should be reallocated in such event.
>
> In such case, probably the allocation should be managed from 
> CheckpointerMain, and the lazy allocation could remain for other callers (I 
> guess just "initdb").
>
>
> More open questions:
>
> - best name for the flush option (checkpoint_flush_to_disk,
>     checkpoint_flush_on_write, checkpoint_flush, ...)
>
> - best name for the sort option (checkpoint_sort,
>     checkpoint_sort_buffers, checkpoint_sort_ios, ...)
>
>
> Other nice-to-have inputs:
>
> - tests on a non-linux system with posix_fadvise
>   (FreeBSD? others?)
>
> - tests on a large dedicated box
>
>
> Attached are some scripts to help with testing, if someone's feels like that:
>
> - cp_test.sh: run some tests, to adapt to one's setup...
>
> - cp_test_count.pl: show percent of late transactions
>
> - avg.py: show stats about stuff
>
>   sh> grep 'progress: ' OUTPUT_FILE | cut -d' ' -f4 | avg.py
>
>   *BEWARE* that if pgbench got stuck some "0" data are missing,
>   look for the actual tps in the output file and for the line
>   count to check whether it is the case... some currently submitted
>   patch on pgbench helps, see https://commitfest.postgresql.org/5/199/

As this pgbench patch is now in master, pgbench is less likely to get 
stuck, but check nevertheless that the number of progress line matches the 
expected number.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Heikki Linnakangas

Дата:

08 августа 2015 г., 20:49:12

On 07/26/2015 06:01 PM, Fabien COELHO wrote:
> Attached is very minor v5 update which does a rebase & completes the
> cleanup of doing a full sort instead of a chuncked sort.

Some thoughts on this:

* I think we should drop the "flush" part of this for now. It's not as
clearly beneficial as the sorting part, and adds a great deal more code
complexity. And it's orthogonal to the sorting patch, so we can deal
with it separately.

* Is it really necessary to parallelize the I/O among tablespaces? I can
see the point, but I wonder if it makes any difference in practice.

* Is there ever any harm in sorting the buffers? The GUC is useful for
benchmarking, but could we leave it out of the final patch?

* Do we need to worry about exceeding the 1 GB allocation limit in
AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers.
That's a lot, but it's not totally crazy these days that someone might
do that. At the very least, we need to lower the maximum of
shared_buffers so that you can't hit that limit.

I ripped out the "flushing" part, keeping only the sorting. I refactored
the logic in BufferSync() a bit. There's now a separate function,
nextCheckpointBuffer(), that returns the next buffer ID from the sorted
list. The tablespace-parallelization behaviour in encapsulated there,
keeping the code in BufferSync() much simpler. See attached. Needs some
minor cleanup and commenting still before committing, and I haven't done
any testing besides a simple "make check".

- Heikki

Вложения

checkpoint-sort-heikki-1.patch

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

09 августа 2015 г., 10:42:38

Hello Heikki,

Thanks for having a look at the patch.

> * I think we should drop the "flush" part of this for now. It's not as 
> clearly beneficial as the sorting part, and adds a great deal more code 
> complexity. And it's orthogonal to the sorting patch, so we can deal with it 
> separately.

I agree that it is orthogonal and that the two features could be in 
distinct patches. The flush part is the first patch I really submitted 
because it has significant effect on latency, and I was told to mix it 
with sorting...

The flushing part really helps to keep "write stalls" under control in 
many cases, for instance:

- 400-tps 1-client (or 4 for medium) max 100-ms latency
     options   | percent of late transactions  flush | sort | tiny | small | medium    off |  off | 12.0 | 64.28 | 68.6
  off |   on | 11.3 | 22.05 | 22.6     on |  off |  1.1 | 67.93 | 67.9     on |   on |  0.6 |  3.24 |  3.1
 

The "percent of late transactions" is really the fraction of time the 
database is unreachable because of write stalls... So sort without flush 
is cleary not enough.

Another thing suggested by Andres is to fsync as early as possible, but 
this is not a simple patch because is intermix things which are currently 
in distinct parts of checkpoint processing, so I already decided that this 
would be for another submission.

> * Is it really necessary to parallelize the I/O among tablespaces? I can see 
> the point, but I wonder if it makes any difference in practice.

I think that if someone bothers with tablespace there is no reason to kill 
them behind her. Without sorting you may hope that tablespaces will be 
touched randomly enough, but once buffers are sorted you can probably find 
cases where it would write on one table space and then on the other.

So I think that it really should be kept.

> * Is there ever any harm in sorting the buffers? The GUC is useful for 
> benchmarking, but could we leave it out of the final patch?

I think that the performance show that it is basically always beneficial, 
so the guc may be left out. However on SSD it is unclear to me whether it 
is just a loss of time or whether it helps, say with wear-leveling. Maybe 
best to keep it? Anyway it is definitely needed for testing.

> * Do we need to worry about exceeding the 1 GB allocation limit in 
> AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's a 
> lot, but it's not totally crazy these days that someone might do that. At the 
> very least, we need to lower the maximum of shared_buffers so that you can't 
> hit that limit.

Yep.

> I ripped out the "flushing" part, keeping only the sorting. I refactored 
> the logic in BufferSync() a bit. There's now a separate function,
> nextCheckpointBuffer(), that returns the next buffer ID from the sorted 
> list. The tablespace-parallelization behaviour in encapsulated there,

I do not understand the new tablespace-parallelization logic: there is no 
test about the tablespace of the buffer in the selection process... Note 
that I did wrote a proof for the one I put, and also did some detailed 
testing on the side because I'm always wary of proofs, especially mines:-)

I notice that you assume that table space numbers are always small and 
contiguous. Is that a fact? I was feeling more at ease with relying on a 
hash table to avoid such an assumption.

> keeping the code in BufferSync() much simpler. See attached. Needs some 
> minor cleanup and commenting still before committing, and I haven't done 
> any testing besides a simple "make check".

Hmmm..., just another detail, the patch does not sort:
  + if (checkpoint_sort && num_to_write > 1 && false)


I'll resubmit a patch with only the sorting part, and do the kind of 
restructuring you suggest which is a good thing.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

10 августа 2015 г., 11:27:25

On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:
> * I think we should drop the "flush" part of this for now. It's not as
> clearly beneficial as the sorting part, and adds a great deal more code
> complexity. And it's orthogonal to the sorting patch, so we can deal with it
> separately.

I don't agree. For one I've seen it cause rather big latency
improvements, and we're horrible at that. But more importantly I think
the requirements of the flush logic influences how exactly the sorting
is done. Splitting them will just make it harder to do the flushing in a
not too big patch.

> * Is it really necessary to parallelize the I/O among tablespaces? I can see
> the point, but I wonder if it makes any difference in practice.

Today it's somewhat common to have databases that are bottlenecked on
write IO and all those writes being done by the checkpointer. If we
suddenly do the writes to individual tablespaces separately and
sequentially we'll be bottlenecked on the peak IO of a single
tablespace.

> * Is there ever any harm in sorting the buffers? The GUC is useful for
> benchmarking, but could we leave it out of the final patch?

Agreed.

> * Do we need to worry about exceeding the 1 GB allocation limit in
> AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's
> a lot, but it's not totally crazy these days that someone might do that. At
> the very least, we need to lower the maximum of shared_buffers so that you
> can't hit that limit.

We can just use the _huge variant?

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

10 августа 2015 г., 18:02:47

Hi,

On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:
> I ripped out the "flushing" part, keeping only the sorting. I refactored the
> logic in BufferSync() a bit. There's now a separate function,
> nextCheckpointBuffer(), that returns the next buffer ID from the sorted
> list. The tablespace-parallelization behaviour in encapsulated there,
> keeping the code in BufferSync() much simpler. See attached. Needs some
> minor cleanup and commenting still before committing, and I haven't done any
> testing besides a simple "make check".

Thought it'd be useful to review the current version as well. Some of
what I'm commenting on you'll probably already have though of under the
label of "minor cleanup".

>  /*
> + * Array of buffer ids of all buffers to checkpoint.
> + */
> +static int *CheckpointBufferIds = NULL;
> +
> +/* Compare checkpoint buffers
> + */

Should be at the beginning of the file. There's a bunch more cases of that.

> +/* Compare checkpoint buffers
> + */
> +static int bufcmp(const int * pa, const int * pb)
> +{
> +    BufferDesc
> +        *a = GetBufferDescriptor(*pa),
> +        *b = GetBufferDescriptor(*pb);
> +
> +    /* tag: rnode, forkNum (different files), blockNum
> +     * rnode: { spcNode (ignore: not really needed),
> +     *   dbNode (ignore: this is a directory), relNode }
> +     * spcNode: table space oid, not that there are at least two
> +     * (pg_global and pg_default).
> +     */
> +    /* compare relation */
> +    if (a->tag.rnode.spcNode < b->tag.rnode.spcNode)
> +        return -1;
> +    else if (a->tag.rnode.spcNode > b->tag.rnode.spcNode)
> +        return 1;
> +    if (a->tag.rnode.relNode < b->tag.rnode.relNode)
> +        return -1;
> +    else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
> +        return 1;
> +    /* same relation, compare fork */
> +    else if (a->tag.forkNum < b->tag.forkNum)
> +        return -1;
> +    else if (a->tag.forkNum > b->tag.forkNum)
> +        return 1;
> +    /* same relation/fork, so same segmented "file", compare block number
> +     * which are mapped on different segments depending on the number.
> +     */
> +    else if (a->tag.blockNum < b->tag.blockNum)
> +        return -1;
> +    else /* should not be the same block anyway... */
> +        return 1;
> +}

This definitely needs comments about ignoring the normal buffer header
locking.

Why are we ignoring the database directory? I doubt it'll make a huge
difference, but grouping metadata affecting operations by directory
helps.

> +
> +static void
> +AllocateCheckpointBufferIds(void)
> +{
> +    /* Safe worst case allocation, all buffers belong to the checkpoint...
> +     * that is pretty unlikely.
> +     */
> +    CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
> +}

(wrong comment style...)

Heikki, you were concerned about the size of the allocation of this,
right? I don't think it's relevant - we used to allocate an array of
that size for the backend's private buffer pin array until 9.5, so in
theory we should be safe agains that. NBuffers is limited to INT_MAX/2
in guc.ċ, which ought to be sufficient?

> +    /*
> +     * Lazy allocation: this function is called through the checkpointer,
> +     * but also by initdb. Maybe the allocation could be moved to the callers.
> +     */
> +    if (CheckpointBufferIds == NULL)
> +        AllocateCheckpointBufferIds();
> +
> 

I don't think it's a good idea to allocate this on every round. That
just means a lot of page table entries have to be built and torn down
regularly. It's not like checkpoints only run for 1% of the time or
such.

FWIW, I still think it's a much better idea to allocate the memory once
in shared buffers. It's not like that makes us need more memory overall,
and it'll be huge page allocations if configured. I also think that
sooner rather than later we're going to need more than one process
flushing buffers, and then it'll need to be moved there.

> +    /*
> +     * Sort buffer ids to help find sequential writes.
> +     *
> +     * Note: buffers are not locked in anyway, but that does not matter,
> +     * this sorting is really advisory, if some buffer changes status during
> +     * this pass it will be filtered out later.  The only necessary property
> +     * is that marked buffers do not move elsewhere.
> +     */

That reasoning makes it impossible to move the fsyncing of files into
the loop (whenever we move to a new file). That's not nice. The
formulation with "necessary property" doesn't seem very clear to me?

How about:
/** Note: Buffers are not locked in any way during sorting, but that's ok:* A change in the buffer header is only
relevantwhen it changes the* buffer's identity. If the identity has changed it'll have been* written out by
BufferAlloc(),so there's no need for checkpointer to* write it out anymore. The buffer might also get written out by a*
backendor bgwriter, but that's equally harmless.*/

>          Also, qsort implementation
> +     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
> +     * because of these possible concurrent changes.

Hm. Is that actually the case for our qsort implementation? If the pivot
element changes its identity won't the result be pretty much random?

> +
> +    if (checkpoint_sort && num_to_write > 1 && false)
> +    {

&& false - Huh?

> +        qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
> +                  (int(*)(const void *, const void *)) bufcmp);
> +

Ick, I'd rather move the typecasts to the comparator.

> +        for (i = 1; i < num_to_write; i++)
> +        {
> +            bufHdr = GetBufferDescriptor(CheckpointBufferIds[i]);
> +
> +            spc = bufHdr->tag.rnode.spcNode;
> +            if (spc != lastspc && (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0)
> +            {
> +                if (allocatedSpc <= j)
> +                {
> +                    allocatedSpc = j + 5;
> +                    spcStatus = (TableSpaceCheckpointStatus *)
> +                        repalloc(spcStatus, sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
> +                }
> +
> +                spcStatus[j].index_end = spcStatus[j + 1].index = i;
> +                j++;
> +                lastspc = spc;
> +            }
> +        }
> +        spcStatus[j].index_end = num_to_write;

This really deserves some explanation.

Regards,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

10 августа 2015 г., 20:07:37

Hello Andres,

Thanks for your comments. Some answers and new patches included.

>> + /*
>> + * Array of buffer ids of all buffers to checkpoint.
>> + */
>> +static int *CheckpointBufferIds = NULL;
>
> Should be at the beginning of the file. There's a bunch more cases of that.

done.

>> +/* Compare checkpoint buffers
>> + */
>> +static int bufcmp(const int * pa, const int * pb)
>> +{
>> +    BufferDesc
>> +        *a = GetBufferDescriptor(*pa),
>> +        *b = GetBufferDescriptor(*pb);
>
> This definitely needs comments about ignoring the normal buffer header
> locking.

Added.

> Why are we ignoring the database directory? I doubt it'll make a huge
> difference, but grouping metadata affecting operations by directory
> helps.

I wanted to do the minimal comparisons to order buffers per file, so I 
skipped everything else. My idea of a checkpoint is a lot of data in a few 
files (at least compared to the data...), so I do not think that it is 
worth it. I may be proven wrong!

>> +static void
>> +AllocateCheckpointBufferIds(void)
>> +{
>> +    /* Safe worst case allocation, all buffers belong to the checkpoint...
>> +     * that is pretty unlikely.
>> +     */
>> +    CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
>> +}
>
> (wrong comment style...)

Fixed.

> Heikki, you were concerned about the size of the allocation of this, 
> right? I don't think it's relevant - we used to allocate an array of 
> that size for the backend's private buffer pin array until 9.5, so in 
> theory we should be safe agains that. NBuffers is limited to INT_MAX/2 
> in guc.ċ, which ought to be sufficient?

I think that there is no issue with the current shared_buffers limit. I 
could allocate and use 4 GB on my laptop without problem. I added a cast 
to ensure that unsigned int are used for the size computation.

>> + /* + * Lazy allocation: this function is called through the 
>> checkpointer, + * but also by initdb. Maybe the allocation could be 
>> moved to the callers. + */ + if (CheckpointBufferIds == NULL) + 
>> AllocateCheckpointBufferIds(); +
>>
>
> I don't think it's a good idea to allocate this on every round.
> That just means a lot of page table entries have to be built and torn 
> down regularly. It's not like checkpoints only run for 1% of the time or 
> such.

Sure. It is not allocated on every round, it is allocated once on the 
first checkpoint, the variable tested is static. There is no free. Maybe
the allocation could be moved to the callers, though.

> FWIW, I still think it's a much better idea to allocate the memory once
> in shared buffers.

Hmmm. The memory does not need to be shared with other processes?

> It's not like that makes us need more memory overall, and it'll be huge 
> page allocations if configured. I also think that sooner rather than 
> later we're going to need more than one process flushing buffers, and 
> then it'll need to be moved there.

That is an argument. I think that it could wait for the need to actually 
arise.

>> +    /*
>> +     * Sort buffer ids to help find sequential writes.
>> +     *
>> +     * Note: buffers are not locked in anyway, but that does not matter,
>> +     * this sorting is really advisory, if some buffer changes status during
>> +     * this pass it will be filtered out later.  The only necessary property
>> +     * is that marked buffers do not move elsewhere.
>> +     */
>
> That reasoning makes it impossible to move the fsyncing of files into 
> the loop (whenever we move to a new file). That's not nice.

I do not see why. Moving rsync ahead is definitely an idea that you 
already pointed out, I have given it some thoughts, and it would require 
a carefull implementation and some restructuring. For instance, you do not 
want to issue fsync right after having done writes, you want to wait a 
little bit so that the system had time to write the buffers to disk.

> The formulation with "necessary property" doesn't seem very clear to me?

Removed.

> How about: /* * Note: Buffers are not locked in any way during sorting, 
> but that's ok: * A change in the buffer header is only relevant when it 
> changes the * buffer's identity. If the identity has changed it'll have 
> been * written out by BufferAlloc(), so there's no need for checkpointer 
> to * write it out anymore. The buffer might also get written out by a * 
> backend or bgwriter, but that's equally harmless. */

This new version included.

>>          Also, qsort implementation
>> +     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
>> +     * because of these possible concurrent changes.
>
> Hm. Is that actually the case for our qsort implementation?

I think that it is hard to write a qsort which would fail that. That would 
mean that it would compare the same items twice, which would be 
inefficient.

> If the pivot element changes its identity won't the result be pretty 
> much random?

That would be a very unlikely event, given the short time spent in qsort. 
Anyway, this is not a problem, and is the beauty of the "advisory" sort: 
if the sort is wrong because of any such rare event, it just mean that the 
buffers would not be strictly in file order, which is currently the 
case.... Well, too bad, but the correctness of the checkpoint does not 
depend on it, that just mean that the checkpointer would come back twice 
on one file, no big deal.

>> +    if (checkpoint_sort && num_to_write > 1 && false)
>> +    {
>
> && false - Huh?

Probably Heikki tests.

>> +        qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
>> +                  (int(*)(const void *, const void *)) bufcmp);
>> +
>
> Ick, I'd rather move the typecasts to the comparator.

Done.

>> +        for (i = 1; i < num_to_write; i++)
>> +        { [...]
>
> This really deserves some explanation.

I think that this version does not work. I've reinstated my version and a 
lot of comments in the attached patches.

Please find attached two combined patches which provide both features one 
after the other.

(a) shared buffer sorting
 - I took Heikki hint about restructuring the buffer selection in a   separate function, which makes the code much more
readable.
 - I also followed Heikki intention (I think) that only active   table spaces are considered in the switching loop.

(b) add asynchronous flushes on top of the previous sort patch



I think that the many performance results I reported show that the 
improvements need both features, and one feature without the other is much 
less effective at improving responsiveness, which is my primary concern.
The TPS improvements are just a side effect.

I did not remove the gucs: I think it could be kept so that people can 
test around with it, and they may be removed in the future? I would be 
also fine if they are removed.

There are a lot of comments in some places. I think that they should be 
kept because the code is subtle.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

10 августа 2015 г., 20:26:38

On 2015-08-10 19:07:12 +0200, Fabien COELHO wrote:
> I think that there is no issue with the current shared_buffers limit. I
> could allocate and use 4 GB on my laptop without problem. I added a cast to
> ensure that unsigned int are used for the size computation.

You can't allocate 4GB with palloc(), it has a builtin limit against
allocating more than 1GB.

> >>+ /* + * Lazy allocation: this function is called through the
> >>checkpointer, + * but also by initdb. Maybe the allocation could be
> >>moved to the callers. + */ + if (CheckpointBufferIds == NULL) +
> >>AllocateCheckpointBufferIds(); +
> >>
> >
> >I don't think it's a good idea to allocate this on every round.
> >That just means a lot of page table entries have to be built and torn down
> >regularly. It's not like checkpoints only run for 1% of the time or such.
> 
> Sure. It is not allocated on every round, it is allocated once on the first
> checkpoint, the variable tested is static. There is no free. Maybe
> the allocation could be moved to the callers, though.

Well, then everytime the checkpointer is restarted.

> >FWIW, I still think it's a much better idea to allocate the memory once
> >in shared buffers.
> 
> Hmmm. The memory does not need to be shared with other processes?

The point is that it's done at postmaster startup, and we're pretty much
guaranteed that the memory will availabl.e.

> >It's not like that makes us need more memory overall, and it'll be huge
> >page allocations if configured. I also think that sooner rather than later
> >we're going to need more than one process flushing buffers, and then it'll
> >need to be moved there.
> 
> That is an argument. I think that it could wait for the need to actually
> arise.

Huge pages are used today.

> >>+    /*
> >>+     * Sort buffer ids to help find sequential writes.
> >>+     *
> >>+     * Note: buffers are not locked in anyway, but that does not matter,
> >>+     * this sorting is really advisory, if some buffer changes status during
> >>+     * this pass it will be filtered out later.  The only necessary property
> >>+     * is that marked buffers do not move elsewhere.
> >>+     */
> >
> >That reasoning makes it impossible to move the fsyncing of files into the
> >loop (whenever we move to a new file). That's not nice.
> 
> I do not see why.

Because it means that the sorting isn't necessarily correct. I.e. we
can't rely on it to determine whether a file has already been fsynced.

> >>         Also, qsort implementation
> >>+     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
> >>+     * because of these possible concurrent changes.
> >
> >Hm. Is that actually the case for our qsort implementation?
> 
> I think that it is hard to write a qsort which would fail that. That would
> mean that it would compare the same items twice, which would be inefficient.

What? The same two elements aren't frequently compared pairwise with
each other, but of course an individual element is frequently compared
with other elements. Consider what happens when the chosen pivot element
changes its identity after already dividing half. The two partitions
will not be divided in any meaning full way anymore. I don't see how
this will results in a meaningful sort.

> >If the pivot element changes its identity won't the result be pretty much
> >random?
> 
> That would be a very unlikely event, given the short time spent in
> qsort.

Meh, we don't want to rely on "likeliness" on such things.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

10 августа 2015 г., 21:24:43

Hello Andres,

> You can't allocate 4GB with palloc(), it has a builtin limit against
> allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in disguise.

>> [...]
> Well, then everytime the checkpointer is restarted.

Hm...

> The point is that it's done at postmaster startup, and we're pretty much
> guaranteed that the memory will availabl.e.

Ok ok, I stop resisting... I'll have a look.

Would it also fix the 1 GB palloc limit on the same go? I guess so...


>>> That reasoning makes it impossible to move the fsyncing of files into the
>>> loop (whenever we move to a new file). That's not nice.
>>
>> I do not see why.
>
> Because it means that the sorting isn't necessarily correct. I.e. we
> can't rely on it to determine whether a file has already been fsynced.

Ok, I understand your point.

Then the file would be fsynced twice: if the fsync is done properly (data 
have already been flushed to disk) then it would not cost much, and doing 
it sometimes twice on some file would not be a big issue. The code could 
also detect such event and log a warning, which would give a hint about 
how often it occurs in practice.

>>> Hm. Is that actually the case for our qsort implementation?
>>
>> I think that it is hard to write a qsort which would fail that. That would
>> mean that it would compare the same items twice, which would be inefficient.
>
> What? The same two elements aren't frequently compared pairwise with 
> each other, but of course an individual element is frequently compared 
> with other elements.

Sure.

> Consider what happens when the chosen pivot element changes its identity 
> after already dividing half. The two partitions will not be divided in 
> any meaning full way anymore. I don't see how this will results in a 
> meaningful sort.

It would be partly meaningful, which is enough for performance, and does 
not matter for correctness: currently buffers are not sorted at all and it 
works, even if it does not work well.

>>> If the pivot element changes its identity won't the result be pretty much
>>> random?
>>
>> That would be a very unlikely event, given the short time spent in
>> qsort.
>
> Meh, we don't want to rely on "likeliness" on such things.

My main argument is that even if it occurs, and the qsort result is partly 
wrong, it does not change correctness, it just mean that the actual writes 
will be less in order than wished. If it occurs, one pivot separation 
would be quite strange, but then others would be right, so the buffers 
would be "partly sorted".

Another issue I see is that even if buffers are locked within cmp, the 
status may change between two cmp... I do not think that locking all 
buffers for sorting them is an option. So on the whole, I think that 
locking buffers for sorting is probably not possible with the simple (and 
efficient) lightweight approach used in the patch.

The good news, as I argued before, is that the order is only advisory to 
help with performance, but the correctness is really that all checkpoint 
buffers are written and fsync is called in the end, and does not depend on 
the buffer order. That is how it currently works anyway.

If you block on this then I'll put a heavy weight approach, but that would 
be a waste of memory in my opinion, hence my argumentation for the 
lightweight approach.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

10 августа 2015 г., 21:58:32

> Ok ok, I stop resisting... I'll have a look.

Here is a v7 a&b version which uses shared memory instead of palloc.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

10 августа 2015 г., 22:28:18

On August 10, 2015 8:24:21 PM GMT+02:00, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>Hello Andres,
>
>> You can't allocate 4GB with palloc(), it has a builtin limit against
>> allocating more than 1GB.
>
>Argh, too bad, I assumed very naively that palloc was malloc in
>disguise.

It is, but there's some layering (memory pools/contexts) on top. You can get huge allocations with polloc_huge.

>Then the file would be fsynced twice: if the fsync is done properly
>(data 
>have already been flushed to disk) then it would not cost much, and
>doing 
>it sometimes twice on some file would not be a big issue. The code
>could 
>also detect such event and log a warning, which would give a hint about
>
>how often it occurs in practice.

Right. At the cost of keeping track of all files...



>>>> If the pivot element changes its identity won't the result be
>pretty much
>>>> random?
>>>
>>> That would be a very unlikely event, given the short time spent in
>>> qsort.
>>
>> Meh, we don't want to rely on "likeliness" on such things.
>
>My main argument is that even if it occurs, and the qsort result is
>partly 
>wrong, it does not change correctness, it just mean that the actual
>writes 
>will be less in order than wished. If it occurs, one pivot separation 
>would be quite strange, but then others would be right, so the buffers 
>would be "partly sorted".

It doesn't matter for correctness today, correct. But it makes out impossible to rely on or too.

>Another issue I see is that even if buffers are locked within cmp, the 
>status may change between two cmp...

Sure. That's not what in suggesting. Earlier versions of the patch kept an array of buffer headers exactly because of
that.
I do not think that locking all 
>buffers for sorting them is an option. So on the whole, I think that 
>locking buffers for sorting is probably not possible with the simple
>(and 
>efficient) lightweight approach used in the patch.

Yes, the other version has a higher space overhead. I'm not convinced that's meaningful in comparison to shared buffets
inspace.
 
And rather doubtful it a loss performance wise in a loaded server. All the buffer headers are touched on other cores
anddoing the sort with indirection will greatly increase bus traffic.
 

>The good news, as I argued before, is that the order is only advisory
>to 
>help with performance, but the correctness is really that all
>checkpoint 
>buffers are written and fsync is called in the end, and does not depend
>on 
>the buffer order. That is how it currently works anyway

It's not particularly desirable to have a performance feature that works less well if the server is heavily and
concurrentlyloaded. The likelihood of bogus sort results will increase with the churn rate in shared buffers.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: checkpointer continuous flushing

От

Michael Paquier

Дата:

11 августа 2015 г., 03:38:03

On Tue, Aug 11, 2015 at 4:28 AM, Andres Freund wrote:
> On August 10, 2015 8:24:21 PM GMT+02:00, Fabien COELHO wrote:
>>> You can't allocate 4GB with palloc(), it has a builtin limit against
>>> allocating more than 1GB.
>>
>>Argh, too bad, I assumed very naively that palloc was malloc in
>>disguise.
>
> It is, but there's some layering (memory pools/contexts) on top. You can get huge allocations with polloc_huge.

palloc_huge does not exist yet ;)
There is either repalloc_huge or palloc_extended now, though
implementing one would be trivial.
-- 
Michael

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

11 августа 2015 г., 18:15:50

Hello Andres,

> [...] Right. At the cost of keeping track of all files...

Sure. Pg already tracks all files, and probably some more tracking would 
be necessary for an early fsync feature to know what are those already 
fsync'ed and what are those not yet fsync'ed.

> Yes, the other version has a higher space overhead.

Yep, this is my concern.

> I'm not convinced that's meaningful in comparison to shared buffers in 
> space. And rather doubtful it a loss performance wise in a loaded 
> server. All the buffer headers are touched on other cores and doing the 
> sort with indirection will greatly increase bus traffic.

The measures I collected and reported showed that the sorting time is 
basically insignificant, so bus traffic induced by sorting does not seem 
to be an issue.

> [...] It's not particularly desirable to have a performance feature that 
> works less well if the server is heavily and concurrently loaded. The 
> likelihood of bogus sort results will increase with the churn rate in 
> shared buffers.

Hm.

In conclusion I'm not convinced that it is worth the memory, but I'm also 
tired of arguing, and hopefully nobody else cares about a few more bytes 
per shared_buffers, so why should I care?

Here is a v8, I reduced the memory overhead of the "heavy weight" approach 
from 24 to 16 bytes per buffer, so it is medium weight:-). It might be 
compacted further down to 12 bytes by combining the 2 bits of forkNum 
either with relNode or blockNum, and use a uint64_t comparison field with
all data so that the comparison code would be simpler and faster.
I also fixed the computation of the shmem size which I had not updated
when switching to shmem.

The patches still include the two guc, but it is easy to remove one or the 
other. They are useful is someone wants to test. The default is on for 
sort, and off for flush. Maybe it should be on for both.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

12 августа 2015 г., 23:35:24

> Here is a v8,

I collected a few performance figures with this patch on an old box with 8 
cores, 16 GB, RAID 1 HDD, under Ubuntu precise.
  postgresql.conf:    shared_buffers = 4GB    checkpoint_timeout = 15min    checkpoint_completion_target = 0.8
max_wal_size= 4GB
 
  init> pgbench -i -s 250  warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4
  # 400 tps throttled "simple update" test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4
    sort/flush : percent of skipped/late transactions     on   on   :  2.7     on   off  : 16.2     off  on   : 68.4
off  off  : 68.7
 
  # 200 tps  sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4
    sort/flush : percent of skipped/late transactions     on   on   :  2.7     on   off  :  9.5     off  on   : 47.4
off  off  : 48.8
 

The large "percent of skipped/late transactions" is to be understood as 
"fraction of time with postgresql offline because of a write stall".
  # full speed 1 client  sh> pgbench -M prepared -N -P 1 -T 4000
    sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)     on   on   : 631 +- 131 (0.1%)     on   off  :
564+- 303 (12.0%)     off  on   : 167 +- 315 (76.8%) # stuck...     off  off  : 177 +- 305 (71.2%) # ~ current pg
 
  # full speed 2 threads 4 clients  sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
    sort/flush : tps avg & stddev (percent of time below 10.0 tps)     on   on   : 1058 +- 455 (0.1%)     on   off  :
1056+- 942 (32.8%)     off  on   :  170 +- 500 (88.3%) # stuck...     off  off  :  209 +- 506 (82.0%) # ~ current pg
 

The combined features provide a tps speedup of 3-5 on these runs, and 
allow to have some control on write stalls. Flushing is not effective on 
unsorted buffers, at least on these example.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

17 августа 2015 г., 13:59:21

Hi Fabien,

On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:
>     sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
>      on   on   : 631 +- 131 (0.1%)
>      on   off  : 564 +- 303 (12.0%)
>      off  on   : 167 +- 315 (76.8%) # stuck...
>      off  off  : 177 +- 305 (71.2%) # ~ current pg

What exactly do you mean with 'stuck'?

- Andres

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

17 августа 2015 г., 14:42:05

On 2015-08-11 17:15:22 +0200, Fabien COELHO wrote:
> +void
> +PerformFileFlush(FileFlushContext * context)
> +{
> +    if (context->ncalls != 0)
> +    {
> +        int rc;
> +
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +
> +        /* Linux: tell the memory manager to move these blocks to io so
> +         * that they are considered for being actually written to disk.
> +         */
> +        rc = sync_file_range(context->fd, context->offset, context->nbytes,
> +                             SYNC_FILE_RANGE_WRITE);
> +
> +#elif defined(HAVE_POSIX_FADVISE)
> +
> +        /* Others: say that data should not be kept in memory...
> +         * This is not exactly what we want to say, because we want to write
> +         * the data for durability but we may need it later nevertheless.
> +         * It seems that Linux would free the memory *if* the data has
> +         * already been written do disk, else the "dontneed" call is ignored.
> +         * For FreeBSD this may have the desired effect of moving the
> +         * data to the io layer, although the system does not seem to
> +         * take into account the provided offset & size, so it is rather
> +         * rough...
> +         */
> +        rc = posix_fadvise(context->fd, context->offset, context->nbytes,
> +                           POSIX_FADV_DONTNEED);
> +
> +#endif
> +
> +        if (rc < 0)
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not flush block " INT64_FORMAT
> +                            " on " INT64_FORMAT " blocks in file \"%s\": %m",
> +                            context->offset / BLCKSZ,
> +                            context->nbytes / BLCKSZ,
> +                            context->filename)));
> +    }

I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

17 августа 2015 г., 15:56:42

Hello Andres,

> On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:
>>     sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
>>      on   on   : 631 +- 131 (0.1%)
>>      on   off  : 564 +- 303 (12.0%)
>>      off  on   : 167 +- 315 (76.8%) # stuck...
>>      off  off  : 177 +- 305 (71.2%) # ~ current pg
>
> What exactly do you mean with 'stuck'?

I mean that the during the I/O storms induced by the checkpoint pgbench 
sometimes get stuck, i.e. does not report its progression every second (I 
run with "-P 1"). This occurs when sort is off, either with or without 
flush, for instance an extract from the off/off medium run:
 progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977 progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059
progress:575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708 progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan progress:
815.0s, 0.0 tps, lat -nan ms stddev -nan progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 817.0 s, 0.0
tps,lat -nan ms stddev -nan progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 819.0 s, 0.0 tps, lat -nan ms
stddev-nan progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan
progress:822.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 824.0
s,0.0 tps, lat -nan ms stddev -nan progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 826.0 s, 0.0 tps, lat
-nanms stddev -nan
 

There is a 239.4 seconds gap in pgbench output. This occurs from time to 
time and may represent a significant part of the run, and I count these 
"stuck" times as 0 tps. Sometimes pgbench is stuck performance wise but 
manages nevetheless to report a "0.0 tps" every second, as above after it 
unstuck.

The actual origin of the issue with a stuck client (pgbench, libpq, OS, 
postgres...) is unclear to me, but the whole system does not behave well 
under an I/O storm anyway, and I have not succeeded in understanding where 
pgbench is stuck when it does not report its progress. I tried some runs 
with gdb but it did not get stuck and reported a lot of "0.0 tps" during 
the storms.


Here are a few more figures with the v8 version of the patch, on a host 
with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported 
the medium case, and the small case turned afterwards.
  small postgresql.conf:    shared_buffers = 2GB    checkpoint_timeout = 300s # this is the default
checkpoint_completion_target= 0.8    # initialization: pgbench -i -s 120
 
  medium postgresql.conf: ## ALREADY REPORTED    shared_buffers = 4GB    checkpoint_timeout = 15min
checkpoint_completion_target= 0.8    max_wal_size = 4GB    # initialization: pgbench -i -s 250
 
  warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4
  # 400 tps throttled test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4
      options  / percent of skipped/late transactions    sort/flush /   small  medium     on   on   :    3.5    2.7
on  off  :   24.6   16.2     off  on   :   66.1   68.4     off  off  :   63.2   68.7
 
  # 200 tps throttled test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4
      options  / percent of skipped/late transactions    sort/flush /   small  medium     on   on   :    1.9    2.7
on  off  :   14.3    9.5     off  on   :   45.6   47.4     off  off  :   47.4   48.8
 
  # 100 tps throttled test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4
      options  / percent of skipped/late transactions    sort/flush /   small  medium     on   on   :    0.9    1.8
on  off  :    9.3    7.9     off  on   :    5.0   13.0     off  off  :   31.2   31.9
 
  # full speed 1 client  sh> pgbench -M prepared -N -P 1 -T 4000
      options  / tps avg & stddev (percent of time below 10.0 tps)    sort/flush /    small              medium     on
on   : 564 +- 148 ( 0.1%)   631 +- 131 ( 0.1%)     on   off  : 470 +- 340 (21.7%)   564 +- 303 (12.0%)     off  on   :
157+- 296 (66.2%)   167 +- 315 (76.8%)     off  off  : 154 +- 251 (61.5%)   177 +- 305 (71.2%)
 
  # full speed 2 threads 4 clients  sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
      options  / tps avg & stddev (percent of time below 10.0 tps)    sort/flush /    small              medium     on
on   : 757 +- 417 ( 0.1%)  1058 +- 455 ( 0.1%)     on   off  : 752 +- 893 (48.4%)  1056 +- 942 (32.8%)     off  on   :
173+- 521 (83.0%)   170 +- 500 (88.3%)     off  off  : 199 +- 512 (82.5%)   209 +- 506 (82.0%)
 

In all cases, the "sort on & flush on" provides the best results, with tps 
speedup from 3-5, and overall high responsiveness (& lower latency).

-- 
Fabien.

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

17 августа 2015 г., 16:28:52


<Oops, stalled post, sorry wrong "From", resent..>


Hello Andres,

>> +        rc = posix_fadvise(context->fd, context->offset, [...]
> 
> I'm a bit wary that this might cause significant regressions on
> platforms not supporting sync_file_range, but support posix_fadvise()
> for workloads that are bigger than shared_buffers. Consider what happens
> if the workload does *not* fit into shared_buffers but *does* fit into
> the OS's buffer cache. Suddenly reads will go to disk again, no?

That is an interesting question!

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS 
implementation of posix_fadvise, so it may differ between OS.

This is a reason why I think that flushing should be kept a guc, even if the 
sort guc is removed and always on. The sync_file_range implementation is 
clearly always very beneficial for Linux, and the posix_fadvise may or may 
not induce a good behavior depending on the underlying system.

This is also a reason why the default value for the flush guc is currently 
set to false in the patch. The documentation should advise to turn it on for 
Linux and to test otherwise. Or if Linux is assumed to be often a host, then 
maybe to set the default to on and to suggest that on some systems it may be 
better to have it off. (Another reason to keep it "off" is that I'm not sure 
about what happens with such HD flushing features on virtual servers).

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host 
and it was as bad as Linux (namely the database and even the box was offline 
for long minutes...), and if you can avoid that having to read back some data 
may be not that bad a down payment.

The issue is largely mitigated if the data is not removed from 
shared_buffers, because the OS buffer is just a copy of already hold data. 
What I would do on such systems is to increase shared_buffers and keep 
flushing on, that is to count less on the system cache and more on postgres 
own cache.

Overall, I'm not convince that the practice of relying on the OS cache is a 
good one, given what it does with it, at least on Linux.

Now, if someone could provide a dedicated box with posix_fadvise (say 
FreeBSD, maybe others...) for testing that would allow to provide data 
instead of speculating... and then maybe to decide to change its default 
value.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

17 августа 2015 г., 18:13:13

On 2015-08-17 15:21:22 +0200, Fabien COELHO wrote:
> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
> implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

> This is a reason why I think that flushing should be kept a guc, even if the
> sort guc is removed and always on. The sync_file_range implementation is
> clearly always very beneficial for Linux, and the posix_fadvise may or may
> not induce a good behavior depending on the underlying system.

That's certainly an argument.

> This is also a reason why the default value for the flush guc is currently
> set to false in the patch. The documentation should advise to turn it on for
> Linux and to test otherwise. Or if Linux is assumed to be often a host, then
> maybe to set the default to on and to suggest that on some systems it may be
> better to have it off. 

I'd say it should then be an os-specific default. No point in making
people work for it needlessly on linux and/or elsewhere.

> (Another reason to keep it "off" is that I'm not sure about what
> happens with such HD flushing features on virtual servers).

I don't see how that matters? Either the host will entirely ignore
flushing, and thus the sync_file_range and the fsync won't cost much, or
fsync will be honored, in which case the pre-flushing is helpful.

> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
> and it was as bad as Linux (namely the database and even the box was offline
> for long minutes...), and if you can avoid that having to read back some
> data may be not that bad a down payment.

I don't see how that'd alleviate my fear. Sure, the latency for many
workloads will be better, but I don't how that argument says anything
about the reads? And we'll not just use this in cases it'd be
beneficial...

> The issue is largely mitigated if the data is not removed from
> shared_buffers, because the OS buffer is just a copy of already hold data.
> What I would do on such systems is to increase shared_buffers and keep
> flushing on, that is to count less on the system cache and more on postgres
> own cache.

That doesn't work that well for a bunch of reasons. For one it's
completely non-adaptive. With the OS's page cache you can rely on free
memory being used for caching *and* it be available should a query or
another program need lots of memory.

> Overall, I'm not convince that the practice of relying on the OS cache is a
> good one, given what it does with it, at least on Linux.

The alternatives aren't super realistic near-term though. Using direct
IO efficiently on the set of operating systems we support is
*hard*. It's more or less trivial to hack pg up to use direct IO for
relations/shared_buffers, but it'll perform utterly horribly in many
many cases.

To pick one thing out: Without the OS buffering writes any write will
have to wait for the disks, instead being asynchronous. That'll make
writes performed by backends a massive bottleneck.

> Now, if someone could provide a dedicated box with posix_fadvise (say
> FreeBSD, maybe others...) for testing that would allow to provide data
> instead of speculating... and then maybe to decide to change its default
> value.

Testing, as an approximation, how it turns out to work on linux would be
a good step.

Greetings,

Andres Freund

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

17 августа 2015 г., 22:32:47

Hello Andres,

>>> [...] posix_fadvise().
>>
>> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
>> implementation of posix_fadvise, so it may differ between OS.
>
> As long as fadvise has no 'undirty' option, I don't see how that
> problem goes away. You're telling the OS to throw the buffer away, so
> unless it ignores it that'll have consequences when you read the page
> back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out 
*but* keep them nevertheless. As the buffer is kept, the OS page is a 
duplicate, and freeing it should not harm, at least immediatly.

The situation is different if the memory is reused in between, which is 
the work of the bgwriter I think, based on LRU/LFU heuristics, but such 
writes are not flushed by the current patch.

Now, if a buffer was recently updated it should not be selected by the 
bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the 
issue somehow...

To sum up, I agree that it is indeed possible that flushing with 
posix_fadvise could reduce read OS-memory hits on some systems for some 
workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm 
fine with that.

> [...] I'd say it should then be an os-specific default. No point in 
> making people work for it needlessly on linux and/or elsewhere.

Ok. Version 9 attached does that, "on" for Linux, "off" for others because 
of the potential issues you mentioned.

>> (Another reason to keep it "off" is that I'm not sure about what
>> happens with such HD flushing features on virtual servers).
>
> I don't see how that matters? Either the host will entirely ignore
> flushing, and thus the sync_file_range and the fsync won't cost much, or
> fsync will be honored, in which case the pre-flushing is helpful.

Possibly. I know that I do not know:-)  The distance between the database 
and real hardware is so great in VM, that I think that it may have any 
effect, including good, bad or none:-)

>> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
>> and it was as bad as Linux (namely the database and even the box was offline
>> for long minutes...), and if you can avoid that having to read back some
>> data may be not that bad a down payment.
>
> I don't see how that'd alleviate my fear.

I'm trying to mitigate your fears, not to alleviate them:-)

> Sure, the latency for many workloads will be better, but I don't how 
> that argument says anything about the reads?

It just says that there may be a compromise, better in some case, possibly 
not so in others, because posix_fadvise does not really say what the 
database would like to say to the OS, this is why I wrote such a large 
comment about it in the source file in the first place.

> And we'll not just use this in cases it'd be beneficial...

I'm fine if it is off by default for some systems. If people want to avoid 
write stalls they can use the option, but it may have adverse effect on 
the tps in some cases, that's life? Not using the option also has adverse 
effects in some cases, because you have write stalls... and currently you 
do not have the choice, so it would be a progress.

>> The issue is largely mitigated if the data is not removed from
>> shared_buffers, because the OS buffer is just a copy of already hold data.
>> What I would do on such systems is to increase shared_buffers and keep
>> flushing on, that is to count less on the system cache and more on postgres
>> own cache.
>
> That doesn't work that well for a bunch of reasons. For one it's
> completely non-adaptive. With the OS's page cache you can rely on free
> memory being used for caching *and* it be available should a query or
> another program need lots of memory.

Yep. I was thinking about a dedicated database server, not a shared one.

>> Overall, I'm not convince that the practice of relying on the OS cache is a
>> good one, given what it does with it, at least on Linux.
>
> The alternatives aren't super realistic near-term though. Using direct
> IO efficiently on the set of operating systems we support is
> *hard*. [...]

Sure.  This is not necessarily what I had in mind.

Currently pg "write"s stuff to the OS, and then suddenly calls "fsync" out 
of the blue, hoping that in between the OS will actually have done a good 
job with the underlying hardware.  This is pretty naive, the fsync 
generates write storms, and the database is offline: trying to improve 
these things is the motivation for this patch.

Now if you think of the bgwriter, it does pretty much the same, and 
probably may generate plenty of random I/Os, because the underlying 
LRU/LFU heuristics used to select buffers does not care about the file 
structures.

So I think that to get good performance the database must take some 
control over the OS. That does not mean that direct I/O needs to be 
involved, although maybe it could, but this patch shows that it is not 
needed to improve things.

>> Now, if someone could provide a dedicated box with posix_fadvise (say
>> FreeBSD, maybe others...) for testing that would allow to provide data
>> instead of speculating... and then maybe to decide to change its default
>> value.
>
> Testing, as an approximation, how it turns out to work on linux would be
> a good step.

Do you mean testing with posix_fadvise on Linux?

I did think about it, but the documented behavior of this call on Linux is 
disappointing: if the buffer has been written to disk, it is freed by the 
OS. If not, nothing is done. Given that the flush is called pretty close 
after writes, mostly the buffer will not have been written to disk yet, 
and the call would just be a no-op... So I concluded that there is no 
point in trying that on Linux because it will have no effect other than 
loosing some time, IMO.

Really, a useful test would be FreeBSD, when posix_fadvise does move 
things to disk, although the actual offsets & length are ignored, but I do 
not think that it would be a problem. I do not know about other systems 
and what they do with posix_fadvise.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

18 августа 2015 г., 06:46:34

On Tue, Aug 18, 2015 at 1:02 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Andres,

[...] posix_fadvise().

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out *but* keep them nevertheless. As the buffer is kept, the OS page is a duplicate, and freeing it should not harm, at least immediatly.

This theory could makes sense if we can predict in some way that

the data we are flushing out of OS cache won't be needed soon.

After flush, we can only rely to an extent that data could be found in

shared_buffers if the usage_count is high, other wise it could be

replaced any moment by backend needing the buffer and there is no

free buffer. Now here one way to think is that if the usage_count is

low, then anyway it's okay to assume that this won't be needed in near

future, however I don't think relying only on usage_count for such a thing

is good idea.

To sum up, I agree that it is indeed possible that flushing with posix_fadvise could reduce read OS-memory hits on some systems for some workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm fine with that.

One point to think here is on what basis user can decide make

this option on, is it predictable in any way?

I think one case could be when the data set fits in shared_buffers.

In general, providing an option is a good idea if user can decide with

ease when to use that option or we can give some clear recommendation

for the same otherwise one has to recommend that test your workload

with this option and if it works then great else don't use it which might also

be okay in some cases, but it is better to be clear.

One minor point, while glancing through the patch, I noticed that couple

of multiline comments are not written in the way which is usually used

in code (Keep the first line as empty).

+/* Status of buffers to checkpoint for a particular tablespace,

+ * used internally in BufferSync.

+ * - space: oid of the tablespace

+ * - num_to_write: number of checkpoint pages counted for this tablespace

+ * - num_written: number of pages actually written out

+/* entry structure for table space to count hashtable,

+ * used internally in BufferSync.

+ */

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

18 августа 2015 г., 10:09:06

Hello Amit,

>> So the option is best kept as "off" for now, without further data, I'm
>> fine with that.
>
> One point to think here is on what basis user can decide make
> this option on, is it predictable in any way?
> I think one case could be when the data set fits in shared_buffers.

Yep.

> In general, providing an option is a good idea if user can decide with 
> ease when to use that option or we can give some clear recommendation 
> for the same otherwise one has to recommend that test your workload with 
> this option and if it works then great else don't use it which might 
> also be okay in some cases, but it is better to be clear.

My opinion, which is not backed by any data (anyone can feel free to 
provide a FreeBSD box for testing...) is that it would mostly be an 
improvement if you have a significant write load to have the flush option 
on when running on non-Linux systems which provide posix_fadvise.

If you have a lot of reads and few writes, then postgresql currently works 
reasonably enough, which is why people do not complain too much about 
write stalls, and I expect that the situation would not be significantly 
degraded.

Now there are competing positive and negative effects induced by using 
posix_fadvise, and moreover its implementation varries from OS to OS, so 
without running some experiments it is hard to be definite.

> One minor point, while glancing through the patch, I noticed that couple
> of multiline comments are not written in the way which is usually used
> in code (Keep the first line as empty).

Indeed.

Please find attached a v10, where I have reviewed comments for style & 
contents, and also slightly extended the documentation about the flush 
option to hint that it is essentially useful for high write loads. Without 
further data, I think it is not obvious to give more definite advices.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

19 августа 2015 г., 05:36:47

On Tue, Aug 18, 2015 at 12:38 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

So the option is best kept as "off" for now, without further data, I'm
fine with that.

One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.

Yep.

In general, providing an option is a good idea if user can decide with ease when to use that option or we can give some clear recommendation for the same otherwise one has to recommend that test your workload with this option and if it works then great else don't use it which might also be okay in some cases, but it is better to be clear.

My opinion, which is not backed by any data (anyone can feel free to provide a FreeBSD box for testing...) is that it would mostly be an improvement if you have a significant write load to have the flush option on when running on non-Linux systems which provide posix_fadvise.

If you have a lot of reads and few writes, then postgresql currently works reasonably enough, which is why people do not complain too much about write stalls, and I expect that the situation would not be significantly degraded.

Now there are competing positive and negative effects induced by using posix_fadvise, and moreover its implementation varries from OS to OS, so without running some experiments it is hard to be definite.

Sure, I think what can help here is a testcase/'s (in form of script file

or some other form, to test this behaviour of patch) which you can write

and post here, so that others can use that to get the data and share it.

Ofcourse, that is not mandatory to proceed with this patch, but still can

help you to prove your point as you might not have access to different

kind of systems to run the tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

19 августа 2015 г., 09:44:18

> Sure, I think what can help here is a testcase/'s (in form of script file
> or some other form, to test this behaviour of patch) which you can write
> and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo... 
but I can do it again...

Tests should be run on a dedicated host. If it has n cores, I suggest to 
share them between postgres checkpointer & workers and pgbench threads so 
as to avoid thread competition to use cores. With 8 cores I used up to 2 
threads & 4 clients, so that there is 2 core left for the checkpointer and 
other stuff (i.e. I also run iotop & htop in parallel...). Although it may 
seem conservative to do so, I think that the point of the test is to 
exercise checkpoints and not to test the process scheduler of the OS.

Here are the latest version of my test scripts:
 (1) cp_test.sh <name> <test>

Run "test" with setup "name". Currently it runs 4000 seconds pgbench with 
the 4 possible on/off combinations for sorting & flushing, after some 
warmup. The 4000 second is chosen so that there are a few checkpoint 
cycles. For larger checkpoint times, I suggest to extend the run time to 
see at least 3 checkpoints during the run.

More test settings can be added to the 2 "case"s. Postgres settings,
especially shared_buffers, should be set to a pertinent value wrt the 
memory of the test host.

The test run with postgres version found in the PATH, so ensure that the 
right version is found!
 (2) cp_test_count.py one-test-output.log

For rate limited runs, look at the final figures and compute the number of 
late & skipped transactions. This can also be done by hand.
 (3) avg.py

For full speed runs, compute stats about per second tps:
  sh> grep 'progress:' one-test-output.log | cut -d' ' -f4 | \        ./avg.py --limit=10 --length=4000  warning: 633
missingdata, extending with zeros  avg over 4000: 199.290575 ± 512.114070 [0.000000, 0.000000, 4.000000, 5.000000,
2280.900000] percent of values below 10.0: 82.5%
 

The figures I reported are the 199 (average tps), 512 (standard deviation 
on per second figures), 82.5% (percent of time below 10 tps, aka postgres 
is basically unresponsive). In brakets, the min q1 median q3 and max tps 
seen in the run.

> Ofcourse, that is not mandatory to proceed with this patch, but still can
> help you to prove your point as you might not have access to different
> kind of systems to run the tests.

I agree that more tests would be useful to decide which default value for 
the flushing option is the better. For Linux, all tests so far suggest 
"on" is the best choice, but for other systems that use posix_fadvise, it 
is really an open question.

Another option would be to give me a temporary access for some available 
host, I'm used to running these tests...

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

23 августа 2015 г., 06:42:54

On Wed, Aug 19, 2015 at 12:13 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo... but I can do it again...

Thanks.

I have tried your scripts and found some problem while using avg.py

script.

grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py --limit=10 --length=300

: No such file or directory

I didn't get chance to poke into avg.py script (the command without

avg.py works fine). Python version on the m/c, I planned to test is

Python 2.7.5.

Today while reading the first patch (checkpoint-continuous-flush-10-a),

I have given some thought to below part of patch which I would like

to share with you.

+static int

+NextBufferToWrite(

+ TableSpaceCheckpointStatus *spcStatus, int nb_spaces,

+ int *pspace, int num_to_write, int num_written)

+ int space = *pspace, buf_id = -1, index;

+ /*

+ * Select a tablespace depending on the current overall progress.

+ *

+ * The progress ratio of each unfinished tablespace is compared to

+ * the overall progress ratio to find one with is not in advance

+ * (i.e. overall ratio > tablespace ratio,

+ * i.e. tablespace written/to_write > overall written/to_write

Here, I think above calculation can go for toss if backend or bgwriter

starts writing buffers when checkpoint is in progress. The tablespace

written parameter won't be able to consider the one's written by backends

or bgwriter. Now it may not big thing to worry but I find Heikki's version

worth considering, he has not changed the overall idea of this patch, but

the calculations are somewhat simpler and hence less chance of going

wrong.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

23 августа 2015 г., 10:04:13

Hello Amit,

> I have tried your scripts and found some problem while using avg.py
> script.
> grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
> --limit=10 --length=300
> : No such file or directory

> I didn't get chance to poke into avg.py script (the command without
> avg.py works fine). Python version on the m/c, I planned to test is
> Python 2.7.5.

Strange... What does "/usr/bin/env python" say? Can the script be started 
on its own at all? I think that the script should work both with python2 
and python3, at least it does on my laptop...

> Today while reading the first patch (checkpoint-continuous-flush-10-a),
> I have given some thought to below part of patch which I would like
> to share with you.
>
> + * Select a tablespace depending on the current overall progress.
> + *
> + * The progress ratio of each unfinished tablespace is compared to
> + * the overall progress ratio to find one with is not in advance
> + * (i.e. overall ratio > tablespace ratio,
> + *  i.e. tablespace written/to_write > overall written/to_write

> Here, I think above calculation can go for toss if backend or bgwriter
> starts writing buffers when checkpoint is in progress.  The tablespace
> written parameter won't be able to consider the one's written by backends
> or bgwriter.

Sure... This is *already* the case with the current checkpointer, the 
schedule is performed with respect to the initial number of buffers it 
think it will have to write, and if someone else writes these buffers then 
the schedule is skewed a little bit, or more... I have not changed this 
logic, but I extended it to handle several tablespaces.

If this (the checkpointer progress evaluation used for its schedule is 
sometimes wrong because of other writes) is proven to be a major 
performance issue, then the processes which writes the checkpointed 
buffers behind its back should tell the checkpointer about it, probably 
with some shared data structure, so that the checkpointer can adapt its 
schedule.

This is an independent issue, that may be worth to address some day. My 
opinion is that when the bgwriter or backends quick in to write buffers, 
they are basically generating random I/Os on HDD and killing tps and 
latency, so it is a very bad time anyway, thus I'm not sure that this is 
the next problem to address to improve pg performance and responsiveness.

> Now it may not big thing to worry but I find Heikki's version worth 
> considering, he has not changed the overall idea of this patch, but the 
> calculations are somewhat simpler and hence less chance of going wrong.

I do not think that Heikki version worked wrt to balancing writes over 
tablespaces, and I'm not sure it worked at all. However I reused some of 
his ideas to simplify and improve the code.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

24 августа 2015 г., 06:43:18

On Sun, Aug 23, 2015 at 12:33 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

I have tried your scripts and found some problem while using avg.py
script.
grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
--limit=10 --length=300
: No such file or directory

I didn't get chance to poke into avg.py script (the command without
avg.py works fine). Python version on the m/c, I planned to test is
Python 2.7.5.

Strange... What does "/usr/bin/env python" say?

Python 2.7.5 (default, Apr 9 2015, 11:07:29)

[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>>

Can the script be started on its own at all?

I have tried like below which results in same error, also I tried few

other variations but could not succeed.

./avg.py

: No such file or directory

Here, I think above calculation can go for toss if backend or bgwriter
starts writing buffers when checkpoint is in progress. The tablespace
written parameter won't be able to consider the one's written by backends
or bgwriter.

Sure... This is *already* the case with the current checkpointer, the schedule is performed with respect to the initial number of buffers it think it will have to write, and if someone else writes these buffers then the schedule is skewed a little bit, or more... I have not changed this logic, but I extended it to handle several tablespaces.

I don't know how good or bad it is to build further on somewhat skewed

logic, but the point is that unless it is required why to use it.

I do not think that Heikki version worked wrt to balancing writes over tablespaces,

I also think that it doesn't balances over tablespaces, but the question

is why do we need to balance over tablespaces, can we reliably

predict in someway which indicates that performing balancing over

tablespace can help the workload. I think here we are doing more

engineering than required for this patch.

and I'm not sure it worked at all.

Okay, his version might have some bugs, but then those could be

fixed as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

24 августа 2015 г., 10:16:04

Hello Amit,

>> Can the script be started on its own at all?
>
> I have tried like below which results in same error, also I tried few
> other variations but could not succeed.
> ./avg.py

Hmmm... Ensure that the script is readable and executable:
  sh> chmod a+rx ./avg.py

Also check the file:
  sh> file ./avg.py  ./avg.py: Python script, UTF-8 Unicode text executable

>> Sure... This is *already* the case with the current checkpointer, the
>> schedule is performed with respect to the initial number of buffers it
>> think it will have to write, and if someone else writes these buffers then
>> the schedule is skewed a little bit, or more... I have not changed this
>
> I don't know how good or bad it is to build  further on somewhat skewed
> logic,

The logic is no more skewed that it is with the current version: your 
remark about the estimation which may be wrong in some cases is clearly 
valid, but it is orthogonal (independent, unrelated, different) to what is 
addressed by this patch.

I currently have no reason to believe that the issue you raise is a major 
performance issue, but if so it may be addressed by another patch by 
whoever want to do so.

What I have done is to demonstrate that generating a lot of random I/Os is 
a major performance issue (well, sure), and this patch addresses this 
point and provide major speedup (*3-5) and latency reductions (from +60% 
unavailability to nearly full availability) for high OLTP write load, by 
reordering and flushing checkpoint buffers in a sensible way.

> but the point is that unless it is required why to use it.

This is really required to avoid predictable performance regressions, see 
below.

>> I do not think that Heikki version worked wrt to balancing writes over
>> tablespaces,
>
> I also think that it doesn't balances over tablespaces, but the question 
> is why do we need to balance over tablespaces, can we reliably predict 
> in someway which indicates that performing balancing over tablespace can 
> help the workload.

The reason for the tablespace balancing is that in the current postgres 
buffers are written more or less randomly, so it is (probably) implicitely 
and statistically balanced over tablespaces because of this randomness, 
and indeed, AFAIK, people with multi tablespace setup have not complained 
that postgres was using the disks sequentially.

However, once the buffers are sorted per file, the order becomes 
deterministic and there is no more implicit balancing, which means that if 
someone has a pg setup with several disks it will write sequentially on 
these instead of in parallel.

This regression was pointed out by Andres Freund, I agree that such a 
regression for high end systems must be avoided, hence the tablespace 
balancing.

> I think here we are doing more engineering than required for this patch.

I do not think so, I think that Andres remark is justified to avoid a 
performance regression on high end systems which use tablespaces, which is 
really undesirable.

About the balancing code, it is not that difficult, even if it is not 
trivial: the point is to select the tablespace for which the progress 
ratio (written/to_write) is below the overall progress ratio, so that it 
catches up, and do so in a round robin maner, so that all tablespaces get 
to write things. I also have both written a proof and tested the logic (in 
a separate script).

-- 
Fabien.

Re: checkpointer continuous flushing

От

Michael Paquier

Дата:

26 августа 2015 г., 10:48:03

On Mon, Aug 24, 2015 at 4:15 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> [stuff]

Moved to next CF 2015-09.
-- 
Michael

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

27 августа 2015 г., 13:52:10

On 2015-08-18 09:08:43 +0200, Fabien COELHO wrote:
> Please find attached a v10, where I have reviewed comments for style &
> contents, and also slightly extended the documentation about the flush
> option to hint that it is essentially useful for high write loads. Without
> further data, I think it is not obvious to give more definite advices.

v10b misses the checkpoint_sort part of the patch, and thus cannot be applied.

Andres

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

27 августа 2015 г., 15:32:58

Hello Andres,

>> Please find attached a v10, where I have reviewed comments for style & 
>> contents, and also slightly extended the documentation about the flush 
>> option to hint that it is essentially useful for high write loads. 
>> Without further data, I think it is not obvious to give more definite 
>> advices.
>
> v10b misses the checkpoint_sort part of the patch, and thus cannot be 
> applied.

Yes, indeed, the second part is expected to be applied on top of v10a.

Please find attached the cumulated version (v10a + v10b).

-- 
Fabien.

Re: checkpointer continuous flushing

От

Andres Freund

Дата:

27 августа 2015 г., 15:41:13

On 2015-08-27 14:32:39 +0200, Fabien COELHO wrote:
> >v10b misses the checkpoint_sort part of the patch, and thus cannot be
> >applied.
> 
> Yes, indeed, the second part is expected to be applied on top of v10a.

Oh, sorry. I'd somehow assumed they were two variants of the same patch
(one with "slim" sorting and the other without).

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

27 августа 2015 г., 16:46:56

>>> v10b misses the checkpoint_sort part of the patch, and thus cannot be
>>> applied.
>>
>> Yes, indeed, the second part is expected to be applied on top of v10a.
>
> Oh, sorry. I'd somehow assumed they were two variants of the same patch
> (one with "slim" sorting and the other without).

The idea is that as these two features could be committed separately. 
However, experiments show that flushing is really efficient when sorting 
is done first, and moreover the two features conflict, so I've made two 
dependent patches.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

31 августа 2015 г., 07:00:03

On Mon, Aug 24, 2015 at 12:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Also check the file:
>
> sh> file ./avg.py
> ./avg.py: Python script, UTF-8 Unicode text executable
>

There were some CRLF line terminators, after removing those, it worked

fine and here are the results of some of the tests done for sorting patch

(checkpoint-continuous-flush-10-a) :

Config Used

----------------------

M/c details

--------------------IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

Test details

------------------

warmup=60

scale=300

max_connections=150

shared_buffers=8GB

checkpoint_timeout=2min

time=7200

synchronous_commit=on

max_wal_size=5GB

parallelism - 128 clients, 128 threads

Sort - off

avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000, 10975.500000, 13105.950000, 21729.000000]

percent of values below 10.0: 19.5%

Sort - on

avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]

percent of values below 10.0: 18.6%

Before going to conclusion, let me try to explain above data (I am

explaining again even though Fabien has explained, to make it clear

if someone has not read his mail)

Let's try to understand with data for sorting - off option

avg over 7200: 8256.382528 ± 6218.769282

8256.382528 - average tps for 7200s pgbench run

6218.769282 - standard deviation on per second figures

[0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]

These 5 values can be read as minimum TPS, q1, median TPS, q3,

maximum TPS over 7200s pgbench run. As far as I understand q1

and q3 median of subset of values which I didn't focussed much.

percent of values below 10.0: 19.5%

Above means percent of time the result is below 10 tps.

Now about test results, these tests are done for pgbench full speed runs

and the above results indicate that there is approximately 1.5%

improvement in avg. TPS and ~1% improvement in tps values which are

below 10 with sorting on and there is almost no improvement in median or

maximum TPS values, instead they or slightly less when sorting is

on which could be due to run-to-run variation.

I have done more tests as well by varying time and number of clients

keeping other configuration same as above, but the results are quite

similar.

The results of sorting patch for the tests done indicate that the win is not

big enough with just doing sorting during checkpoints, we should consider

flush patch along with sorting. I would like to perform some tests with both

the patches together (sort + flush) unless somebody else thinks that sorting

patch alone is beneficial and we should test some other kind of scenarios to

see it's benefit.

>
> The reason for the tablespace balancing is that in the current postgres buffers are written more or less randomly, so it is (probably) implicitely and statistically balanced over tablespaces because of this randomness, and indeed, AFAIK, people with multi tablespace setup have not complained that postgres was using the disks sequentially.
>
> However, once the buffers are sorted per file, the order becomes deterministic and there is no more implicit balancing, which means that if someone has a pg setup with several disks it will write sequentially on these instead of in parallel.
>

What if tablespaces are not on separate disks or not enough hardware

support to make Writes parallel? I think for such cases it might be

better to do it sequentially.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

31 августа 2015 г., 10:11:13

Hello Amit,

> IBM POWER-8 24 cores, 192 hardware threads
> RAM = 492GB

Wow! Thanks for trying the patch on such high-end hardware!

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

What is the OS? The FS?

> warmup=60

Quite short, but probably ok.

> scale=300

Means about 4-4.5 GB base.

> time=7200
> synchronous_commit=on

> shared_buffers=8GB

This is small wrt hardware, but given the scale setup I think that it 
should not matter much.

> max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

> checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

You did not update checkpoint_completion_target, which means 0.5 so that 
the checkpoint is scheduled to run in at most 1 minute, which suggest at 
least 130 MB/s write performance for the checkpoint.

> parallelism - 128 clients, 128 threads

Given 192 hw threads, I would have tried used 128 clients & 64 threads, so 
that each pgbench client has its own dedicated postgres in a thread, and 
that postgres processes are not competing with pgbench. Now as pgbench is 
mostly sleeping, probably that does not matter much... I may also be 
totally wrong:-)

> Sort - off
> avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000,
> 10975.500000, 13105.950000, 21729.000000]
> percent of values below 10.0: 19.5%

The max performance is consistent with 128 threads * 200 (random) writes 
per second.

> Sort - on
> avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000,
> 10946.000000, 13084.000000, 20289.900000]
> percent of values below 10.0: 18.6%

This is really a small improvement, probably in the error interval of the 
measure. I would not trust much 1.5% tps or 0.9% availability 
improvements.

I think that we could conclude that on your (great) setup, with these 
configuration parameter, this patch does not harm performance. This is a 
good thing, even if I would have hoped to see better performance.

> Before going to conclusion, let me try to explain above data (I am
> explaining again even though Fabien has explained, to make it clear
> if someone has not read his mail)
>
> Let's try to understand with data for sorting - off option
>
> avg over 7200: 8256.382528 ± 6218.769282
>
> 8256.382528 - average tps for 7200s pgbench run
> 6218.769282 - standard deviation on per second figures
>
> [0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]
>
> These 5 values can be read as minimum TPS, q1, median TPS, q3,
> maximum TPS over 7200s pgbench run.  As far as I understand q1
> and q3 median of subset of values which I didn't focussed much.

q1 = 84 means that 25% of the time the performance was below 84 tps, about 
1% of the average performance, which I would translate as "pg is pretty 
unresponsive 25% of the time".

This is the kind of issue I really want to address, the eventual tps 
improvements are just a side effect.

> percent of values below 10.0: 19.5%
>
> Above means percent of time the result is below 10 tps.

Which means "postgres is really unresponsive 19.5% of the time".

If you count zeros, you will get "postgres was totally unresponsive X% of 
the time".

> Now about test results, these tests are done for pgbench full speed runs
> and the above results indicate that there is approximately 1.5%
> improvement in avg. TPS and ~1% improvement in tps values which are
> below 10 with sorting on and there is almost no improvement in median or
> maximum TPS values, instead they or slightly less when sorting is
> on which could be due to run-to-run variation.

Yes, I agree.

> I have done more tests as well by varying time and number of clients
> keeping other configuration same as above, but the results are quite
> similar.

Given the hardware, I would suggest to raise checkpoint_timeout, 
shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8. 
I would expect that it should improve performance both with and without 
sorting.

It would be interesting to have informations from checkpoint logs 
(especially how many buffers written in how long, whether checkpoints are 
time or xlog triggered, ...).

> The results of sorting patch for the tests done indicate that the win is 
> not big enough with just doing sorting during checkpoints,

ISTM that you do too much generalization: The win is not big "under this 
configuration and harware".

I think that the patch may have very small influence under some 
conditions, but should not degrade performance significantly, and on the 
other hand it should provide great improvements under some (other) 
conditions.

So having no performance degradation is a good result, even if I would 
hope to get better results.  It would be interesting to understand why 
random disk writes do not perform too poorly on this box: size of I/O 
queue, kind of (expensive:-) disks, write caches, file system, raid 
level...

> we should consider flush patch along with sorting.

I also think that it would be interesting.

> I would like to perform some tests with both the patches together (sort 
> + flush) unless somebody else thinks that sorting patch alone is 
> beneficial and we should test some other kind of scenarios to see it's 
> benefit.

Yep. Is it a Linux box? If not, does it support posix_fadvise()?

>> The reason for the tablespace balancing is [...]
>
> What if tablespaces are not on separate disks

I would expect that it might very slightly degrade performance, but only 
marginally.

> or not enough hardware support to make Writes parallel?

I'm not sure that balancing or not writes over tablespaces would change 
anything to an I/O bottleneck which is not the disk write performance, so 
I would say "no impact" in that case.

> I think for such cases it might be better to do it sequentially.

Writing sequentially to different disks would be a bug, and degrade 
performance significantly on a setup with several disks, up to dividing 
the performance by the number of disks... so I do think that a patch which 
predictability and significantly degrades performance on high-end harware 
is a reasonable option.

If you want to be able to disactivate balancing, it could be done with a 
guc, but I cannot see good reasons to want to do that: it would complicate 
the code and it does not make much sense to use many tablespaces on one 
disk, while anyone who uses several tablespaces on several disks is 
probably expecting to see her expensive disks actually used in parallel.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

01 сентября 2015 г., 09:51:49

On Mon, Aug 31, 2015 at 12:40 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Amit,
>
>> IBM POWER-8 24 cores, 192 hardware threads
>> RAM = 492GB
>
>
> Wow! Thanks for trying the patch on such high-end hardware!
>
> About the disks: what kind of HDD (RAID? speed?)? HDD write cache?
>

Speed of Reads -

Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec

Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

> What is the OS? The FS?
>

OS info -

Linux <m/c addr> 3.10.0-123.1.2.el7.ppc64 #1 SMP Wed Jun 4 15:23:17 EDT 2014 ppc64 ppc64 ppc64 GNU/Linux

FS - ext4

>> shared_buffers=8GB
>
>
> This is small wrt hardware, but given the scale setup I think that it should not matter much.
>

Yes, I was testing the case for Read-Write transactions when all the data

fits in shared_buffers, so this is okay.

>> max_wal_size=5GB
>
>
> Hmmm... Maybe quite small given the average performance?
>

We can check with larger value, but do you expect some different

results and why?

>> checkpoint_timeout=2min
>
>
> This seems rather small. Are the checkpoints xlog or time triggered?
>

I wanted to test by triggering more checkpoints, but I can test with

larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

> You did not update checkpoint_completion_target, which means 0.5 so that the checkpoint is scheduled to run in at most 1 minute, which suggest at least 130 MB/s write performance for the checkpoint.
>

The value used in your script was 0.8 for checkpoint_completion_target

which I have not changed during tests.

>> parallelism - 128 clients, 128 threads
>
>
> Given 192 hw threads, I would have tried used 128 clients & 64 threads, so that each pgbench client has its own dedicated postgres in a thread, and that postgres processes are not competing with pgbench. Now as pgbench is mostly sleeping, probably that does not matter much... I may also be totally wrong:-)
>

In next run, I can use it with 64 threads, lets settle on other parameters

first for which you expect there could be a clear win with the first patch.

>
>
> Given the hardware, I would suggest to raise checkpoint_timeout, shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8. I would expect that it should improve performance both with and without sorting.
>

I don't think increasing shared_buffers would have any impact, because

8GB is sufficient for 300 scale factor data, checkpoint_completion_target is

already 0.8 in my previous tests. Lets try with checkpoint_timeout = 10 min

and max_wal_size = 15GB, do you have any other suggestion?

> It would be interesting to have informations from checkpoint logs (especially how many buffers written in how long, whether checkpoints are time or xlog triggered, ...).
>
>> The results of sorting patch for the tests done indicate that the win is not big enough with just doing sorting during checkpoints,
>
>
> ISTM that you do too much generalization: The win is not big "under this configuration and harware".
>

Hmm.. nothing like that, this was based on couple of tests done by

me and I am open to do some more if you or anybody feels that the

first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,

in-fact I have started these tests with the intention to see if first

patch gives benefit, then that could be evaluated and eventually

committed separately.

> I think that the patch may have very small influence under some conditions, but should not degrade performance significantly, and on the other hand it should provide great improvements under some (other) conditions.
>

True, let us try to find conditions/scenarios where you think it can give

big boost, suggestions are welcome.

>>
>> What if tablespaces are not on separate disks
>
>
> I would expect that it might very slightly degrade performance, but only marginally.
>

>
> If you want to be able to disactivate balancing, it could be done with a guc, but I cannot see good reasons to want to do that: it would complicate the code and it does not make much sense to use many tablespaces on one disk, while anyone who uses several tablespaces on several disks is probably expecting to see her expensive disks actually used in parallel.
>

I think we can leave this for committer to take a call or if anybody

else has any opinion, because there is nothing wrong in what you

have done, but I am not clear if there is a clear need for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

От

Fabien COELHO

Дата:

01 сентября 2015 г., 15:00:56

Hello Amit,

>> About the disks: what kind of HDD (RAID? speed?)? HDD write cache?
>
> Speed of Reads -
> Timing cached reads:   27790 MB in  1.98 seconds = 14001.86 MB/sec
> Timing buffered disk reads: 3830 MB in  3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

> Copy speed -
>
> dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
> 262144+0 records in
> 262144+0 records out
> 2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"?? Looks more 
like several SSD... Or the file is kept in memory and not committed to 
disk yet? Try a "sync" afterwards??

If these are SSD, or if there is some SSD cache on top of the HDD, I would 
not expect the patch to do much, because the SSD random I/O writes are 
pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

>>> max_wal_size=5GB
>>
>> Hmmm... Maybe quite small given the average performance?
>
> We can check with larger value, but do you expect some different
> results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or 
time triggered (which depends on checkpoint_timeout). Given the large tps, 
I expect that the WAL is filled very quickly hence may trigger checkpoints 
every ... that is the question.

>>> checkpoint_timeout=2min
>>
>> This seems rather small. Are the checkpoints xlog or time triggered?
>
> I wanted to test by triggering more checkpoints, but I can test with
> larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.

It would be useful to know about checkpoint stats before suggesting values 
for max_wal_size and checkpoint_timeout.

> [...] The value used in your script was 0.8 for 
> checkpoint_completion_target which I have not changed during tests.

Ok.

>>> parallelism - 128 clients, 128 threads [...]
> In next run, I can use it with 64 threads, lets settle on other parameters
> first for which you expect there could be a clear win with the first patch.

Ok.

>> Given the hardware, I would suggest to raise checkpoint_timeout, 
>> shared_buffers and max_wal_size, [...]. I would expect that it should 
>> improve performance both with and without sorting.
>
> I don't think increasing shared_buffers would have any impact, because
> 8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed 
postgres adds new pages (update = delete + insert), and the deleted space 
is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens 
is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the 
beginning, and would evolve but should be at least about 0.6 MB/s (insert 
in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of 
the pgbench run is an open question.

> checkpoint_completion_target is already 0.8 in my previous tests.  Lets 
> try with checkpoint_timeout = 10 min and max_wal_size = 15GB, do you 
> have any other suggestion?

Maybe shared_buffers = 32GB to ensure that it is a "in buffer" run ?

>> It would be interesting to have informations from checkpoint logs 
>> (especially how many buffers written in how long, whether checkpoints 
>> are time or xlog triggered, ...).

Information still welcome.

> Hmm.. nothing like that, this was based on couple of tests done by
> me and I am open to do some more if you or anybody feels that the
> first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,
> in-fact I have started these tests with the intention to see if first
> patch gives benefit, then that could be evaluated and eventually
> committed separately.

Ok.

My initial question remains: is the setup using HDDs? For SSD there should 
be probably no significant benefit with sorting, although it should not 
harm, and I'm not sure about flushing.

> True, let us try to find conditions/scenarios where you think it can give
> big boost, suggestions are welcome.

HDDs?

> I think we can leave this for committer to take a call or if anybody
> else has any opinion, because there is nothing wrong in what you
> have done, but I am not clear if there is a clear need for the same.

I may have an old box available with two disks, so that I can run some 
tests with table spaces, but with very few cores.

-- 
Fabien.

Re: checkpointer continuous flushing

От

Amit Kapila

Дата:

05 сентября 2015 г., 06:14:47

On Tue, Sep 1, 2015 at 5:30 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

Speed of Reads -
Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Yes, there is no SSD in system. I have confirmed the same. There are RAID

spinning drives.

Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"??

One thing to notice is that if I don't remove the output file (output.img) the

speed is much slower, see the below output. I think this means in our case

we will get ~320 MB/s