Обсуждение: Design proposal: fsync absorb linear slider

Поиск
Список
Период
Сортировка

Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
Recently I've been dismissing a lot of suggested changes to checkpoint 
fsync timing without suggesting an alternative.  I have a simple one in 
mind that captures the biggest problem I see:  that the number of 
backend and checkpoint writes to a file are not connected at all.

We know that a 1GB relation segment can take a really long time to write 
out.  That could include up to 128 changed 8K pages, and we allow all of 
them to get dirty before any are forced to disk with fsync.

Rather than second guess the I/O scheduling, I'd like to take this on 
directly by recognizing that the size of the problem is proportional to 
the number of writes to a segment.  If you turned off fsync absorption 
altogether, you'd be at an extreme that allows only 1 write before 
fsync.  That's low latency for each write, but terrible throughput.  The 
maximum throughput case of 128 writes has the terrible latency we get 
reports about.  But what if that trade-off was just a straight, linear 
slider going from 1 to 128?  Just move it to the latency vs. throughput 
position you want, and see how that works out.

The implementation I had in mind was this one:

-Add an absorption_count to the fsync queue.

-Add a new latency vs. throughput GUC I'll call .  Its default value is 
-1 (or 0), which corresponds to ignoring this new behavior.

-Whenever the background write absorbs a fsync call for a relation 
that's already in the queue, increment the absorption count.

-max_segment_absorb > 0, have the background writer scan for relations 
where absorption_count > max_segment_absorb.  When it finds one, call 
fsync on that segment.

Note that it's possible for this simple scheme to be fooled when writes 
are actually touching a small number of pages.  A process that 
constantly overwrites the same page is the worst case here.  Overwrite 
it 128 times, and this method would assume you've dirtied every page, 
while only 1 will actually go to disk when you call fsync.  It's 
possible to track this better.  The count mechanism could be replaced 
with a bitmap of the 128 blocks, so that absorbs set a bit instead of 
incrementing a count.  My gut feel is that this is more complexity than 
is really necessary here.  If in fact the fsync is slimmer than 
expected, paying for it too much isn't the worst problem to have here.

I'd like to build this myself, but if someone else wants to take a shot 
at it I won't mind.  Just be aware the review is the big part here.  I 
should be honest about one thing: I have zero incentive to actually work 
on this.  The moderate amount of sponsorship money I've raised for 9.4 
so far isn't getting anywhere near this work.  The checkpoint patch 
review I have been doing recently is coming out of my weekend volunteer 
time.

And I can't get too excited about making this as my volunteer effort 
when I consider what the resulting credit will look like.  Coding is by 
far the smallest part of work like this, first behind coming up with the 
design in the first place.  And both of those are way, way behind how 
long review benchmarking takes on something like this.  The way credit 
is distributed for this sort of feature puts coding first, design not 
credited at all, and maybe you'll see some small review credit for 
benchmarks.  That's completely backwards from the actual work ratio.  If 
all I'm getting out of something is credit, I'd at least like it to be 
an appropriate amount of it.
-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Design proposal: fsync absorb linear slider

От
Peter Geoghegan
Дата:
On Mon, Jul 22, 2013 at 8:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> And I can't get too excited about making this as my volunteer effort when I
> consider what the resulting credit will look like.  Coding is by far the
> smallest part of work like this, first behind coming up with the design in
> the first place.  And both of those are way, way behind how long review
> benchmarking takes on something like this.  The way credit is distributed
> for this sort of feature puts coding first, design not credited at all, and
> maybe you'll see some small review credit for benchmarks.  That's completely
> backwards from the actual work ratio.  If all I'm getting out of something
> is credit, I'd at least like it to be an appropriate amount of it.

FWIW, I think that's a reasonable request.


-- 
Peter Geoghegan



Re: Design proposal: fsync absorb linear slider

От
Robert Haas
Дата:
On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Recently I've been dismissing a lot of suggested changes to checkpoint fsync
> timing without suggesting an alternative.  I have a simple one in mind that
> captures the biggest problem I see:  that the number of backend and
> checkpoint writes to a file are not connected at all.
>
> We know that a 1GB relation segment can take a really long time to write
> out.  That could include up to 128 changed 8K pages, and we allow all of
> them to get dirty before any are forced to disk with fsync.

By my count, it can include up to 131,072 changed 8K pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
On 7/23/13 10:56 AM, Robert Haas wrote:
> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> We know that a 1GB relation segment can take a really long time to write
>> out.  That could include up to 128 changed 8K pages, and we allow all of
>> them to get dirty before any are forced to disk with fsync.
>
> By my count, it can include up to 131,072 changed 8K pages.

Even better!  I can pinpoint exactly what time last night I got tired 
enough to start making trivial mistakes.  Everywhere I said 128 it's 
actually 131,072, which just changes the range of the GUC I proposed.

Getting the number right really highlights just how bad the current 
situation is.  Would you expect the database to dump up to 128K writes 
into a file and then have low latency when it's flushed to disk with 
fsync?  Of course not.  But that's the job the checkpointer process is 
trying to do right now.  And it's doing it blind--it has no idea how 
many dirty pages might have accumulated before it started.

I'm not exactly sure how best to use the information collected.  fsync 
every N writes is one approach.  Another is to use accumulated writes to 
predict how long fsync on that relation should take.  Whenever I tried 
to spread fsync calls out before, the scale of the piled up writes from 
backends was the input I really wanted available.  The segment write 
count gives an alternate way to sort the blocks too, you might start 
with the heaviest hit ones.

In all these cases, the fundamental I keep coming back to is wanting to 
cue off past write statistics.  If you want to predict relative I/O 
delay times with any hope of accuracy, you have to start the checkpoint 
knowing something about the backend and background writer activity since 
the last one.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Design proposal: fsync absorb linear slider

От
Robert Haas
Дата:
On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 7/23/13 10:56 AM, Robert Haas wrote:
>> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>>
>>> We know that a 1GB relation segment can take a really long time to write
>>> out.  That could include up to 128 changed 8K pages, and we allow all of
>>> them to get dirty before any are forced to disk with fsync.
>>
>> By my count, it can include up to 131,072 changed 8K pages.
>
> Even better!  I can pinpoint exactly what time last night I got tired enough
> to start making trivial mistakes.  Everywhere I said 128 it's actually
> 131,072, which just changes the range of the GUC I proposed.
>
> Getting the number right really highlights just how bad the current
> situation is.  Would you expect the database to dump up to 128K writes into
> a file and then have low latency when it's flushed to disk with fsync?  Of
> course not.  But that's the job the checkpointer process is trying to do
> right now.  And it's doing it blind--it has no idea how many dirty pages
> might have accumulated before it started.
>
> I'm not exactly sure how best to use the information collected.  fsync every
> N writes is one approach.  Another is to use accumulated writes to predict
> how long fsync on that relation should take.  Whenever I tried to spread
> fsync calls out before, the scale of the piled up writes from backends was
> the input I really wanted available.  The segment write count gives an
> alternate way to sort the blocks too, you might start with the heaviest hit
> ones.
>
> In all these cases, the fundamental I keep coming back to is wanting to cue
> off past write statistics.  If you want to predict relative I/O delay times
> with any hope of accuracy, you have to start the checkpoint knowing
> something about the backend and background writer activity since the last
> one.

So, I don't think this is a bad idea; in fact, I think it'd be a good
thing to explore.  The hard part is likely to be convincing ourselves
of anything about how well or poorly it works on arbitrary hardware
under arbitrary workloads, but we've got to keep trying things until
we find something that works well, so why not this?

One general observation is that there are two bad things that happen
when we checkpoint.  One is that we force all of the data in RAM out
to disk, and the other is that we start doing lots of FPIs.  Both of
these things harm throughput.  Your proposal allows the user to make
the first of those behaviors more frequent without making the second
one more frequent.  That idea seems promising, and it also seems to
admit of many variations.  For example, instead of issuing an fsync
when after N OS writes to a particular file, we could fsync the file
with the most writes every K seconds.  That way, if the system has
busy and idle periods, we'll effectively "catch up on our fsyncs" when
the system isn't that busy, and we won't bunch them up too much if
there's a sudden surge of activity.

Now that's just a shot in the dark and there might be reasons why it's
terrible, but I just generally offer it as food for thought that the
triggering event for the extra fsyncs could be chosen via a multitude
of different algorithms, and as you hack through this it might be
worth trying a few different possibilities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Design proposal: fsync absorb linear slider

От
didier
Дата:
Hi


On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith <greg@2ndquadrant.com> wrote:
Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative.  I have a simple one in mind that captures the biggest problem I see:  that the number of backend and checkpoint writes to a file are not connected at all.

We know that a 1GB relation segment can take a really long time to write out.  That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync.

It was surely already discussed but why isn't postresql  writing sequentially its cache in a temporary file? With storage random speed at least five to ten time slower it could help a lot.
Thanks

Didier

Re: Design proposal: fsync absorb linear slider

От
Robert Haas
Дата:
On Thu, Jul 25, 2013 at 6:02 PM, didier <did447@gmail.com> wrote:
> It was surely already discussed but why isn't postresql  writing
> sequentially its cache in a temporary file? With storage random speed at
> least five to ten time slower it could help a lot.
> Thanks

Sure, that's what the WAL does.  But you still have to checkpoint eventually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Design proposal: fsync absorb linear slider

От
didier
Дата:
Hi,


Sure, that's what the WAL does.  But you still have to checkpoint eventually.

Sure, when you run  pg_ctl stop.
Unlike the WAL it only needs two files, shared_buffers size.
 
I did bogus tests by replacing  mask |= BM_PERMANENT; with mask = -1 in BufferSync() and simulating checkpoint with a periodic dd if=/dev/zero of=foo  conv=fsync

On a saturated storage with %usage locked solid at 100% I got up to 30% speed improvement and fsync latency down by one order of magnitude, some fsync were still slow of course if buffers were already in OS cache.

But it's the upper bound, it's was done one a slow storage with bad ratios : (OS cache write)/(disk sequential write) in 50, (sequential write)/(effective random write) in 10 range and a proper implementation would have a 'little' more work to do... (only checkpoint task can write BM_CHECKPOINT_NEEDED buffers keeping them dirty and so on)

Didier

Re: Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
On 7/25/13 6:02 PM, didier wrote:
> It was surely already discussed but why isn't postresql  writing
> sequentially its cache in a temporary file?

If you do that, reads of the data will have to traverse that temporary 
file to assemble their data.  You'll make every later reader pay the 
random I/O penalty that's being avoided right now.  Checkpoints are 
already postponing these random writes as long as possible. You have to 
take care of them eventually though.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Design proposal: fsync absorb linear slider

От
Hannu Krosing
Дата:
On 07/26/2013 11:42 AM, Greg Smith wrote:
> On 7/25/13 6:02 PM, didier wrote:
>> It was surely already discussed but why isn't postresql  writing
>> sequentially its cache in a temporary file?
>
> If you do that, reads of the data will have to traverse that temporary
> file to assemble their data.  You'll make every later reader pay the
> random I/O penalty that's being avoided right now.  Checkpoints are
> already postponing these random writes as long as possible. You have
> to take care of them eventually though.
>
Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
"random"
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.

I would not dismiss the idea without more tests and discussion.

We could have a system where "checkpoint" does sequential writes of dirty
wal buffers to alternating synced holding files (a "checkpoint log" :) )
and only background writer does random writes with no forced sync at all


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: Design proposal: fsync absorb linear slider

От
Hannu Krosing
Дата:
On 07/26/2013 11:42 AM, Greg Smith wrote:
> On 7/25/13 6:02 PM, didier wrote:
>> It was surely already discussed but why isn't postresql  writing
>> sequentially its cache in a temporary file?
>
> If you do that, reads of the data will have to traverse that temporary
> file to assemble their data.  
In case of crash recovery, a sequential reading of this file could be
performed as first step.

this should work fairly well in most cases, at least when the recovery
shared_buffers is not smaller
than the latest run of checkpoint-written dirty buffers.




-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
On 7/26/13 5:59 AM, Hannu Krosing wrote:
> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
> "random"
> fs pages on one large disk page and having an extra index layer for
> resolving
> random-to-sequential ordering.

If your solution to avoiding random writes now is to do sequential ones 
into a buffer, you'll pay for it by having more expensive random reads 
later.  In the SSD write buffer case, that works only because those 
random reads are very cheap.  Do the same thing on a regular drive, and 
you'll be paying a painful penalty *every* time you read in return for 
saving work *once* when you write.  That only makes sense when your 
workload is near write-only.

It's possible to improve on this situation by having some sort of 
background process that goes back and cleans up the random data, 
converting it back into sequentially ordered writes again.  SSD 
controllers also have firmware that does this sort of work, and Postgres 
might do it as part of vacuum cleanup.  But note that such work faces 
exactly the same problems as writing the data out in the first place.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Design proposal: fsync absorb linear slider

От
Tom Lane
Дата:
Greg Smith <greg@2ndQuadrant.com> writes:
> On 7/26/13 5:59 AM, Hannu Krosing wrote:
>> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
>> "random"
>> fs pages on one large disk page and having an extra index layer for
>> resolving
>> random-to-sequential ordering.

> If your solution to avoiding random writes now is to do sequential ones 
> into a buffer, you'll pay for it by having more expensive random reads 
> later.

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes.  But sooner or
later you have to put the data where it belongs.
        regards, tom lane



Re: Design proposal: fsync absorb linear slider

От
didier
Дата:
Hi,


On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith <greg@2ndquadrant.com> wrote:
On 7/25/13 6:02 PM, didier wrote:
It was surely already discussed but why isn't postresql  writing
sequentially its cache in a temporary file?

If you do that, reads of the data will have to traverse that temporary file to assemble their data.  You'll make every later reader pay the random I/O penalty that's being avoided right now.  Checkpoints are already postponing these random writes as long as possible. You have to take care of them eventually though.


No the log file is only used at recovery time.

in check point code:
- loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in current code
- other workers can't write and evicted these marked buffers to disk, there's a race with fsync.  
- check point fsync now or after the next step.
- check point loop again save to log these buffers, clear BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers will be written again, as they are when check point isn't running.
- check point done.

During recovery you have to load the log in cache first before applying WAL.

Didier

Re: Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
On 7/26/13 9:14 AM, didier wrote:
> During recovery you have to load the log in cache first before applying WAL.

Checkpoints exist to bound recovery time after a crash.  That is their 
only purpose.  What you're suggesting moves a lot of work into the 
recovery path, which will slow down how long it takes to process.

More work at recovery time means someone who uses the default of 
checkpoint_timeout='5 minutes', expecting that crash recovery won't take 
very long, will discover it does take a longer time now.  They'll be 
forced to shrink the value to get the same recovery time as they do 
currently.  You might need to make checkpoint_timeout 3 minutes instead, 
if crash recovery now has all this extra work to deal with.  And when 
the time between checkpoints drops, it will slow the fundamental 
efficiency of checkpoint processing down.  You will end up writing out 
more data in the end.

The interval between checkpoints and recovery time are all related.  If 
you let any one side of the current requirements slip, it makes the rest 
easier to deal with.  Those are all trade-offs though, not improvements.  And this particular one is already an
option.

If you want less checkpoint I/O per capita and don't care about recovery 
time, you don't need a code change to get it.  Just make 
checkpoint_timeout huge.  A lot of checkpoint I/O issues go away if you 
only do a checkpoint per hour, because instead of random writes you're 
getting sequential ones to the WAL.  But when you crash, expect to be 
down for a significant chunk of an hour, as you go back to sort out all 
of the work postponed before.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
On 7/26/13 8:32 AM, Tom Lane wrote:
> What I'd point out is that that is exactly what WAL does for us, ie
> convert a bunch of random writes into sequential writes.  But sooner or
> later you have to put the data where it belongs.

Hannu was observing that SSD often doesn't do that at all.  They can 
maintain logical -> physical translation tables that decode where each 
block was written to forever.  When read seeks are really inexpensive, 
the only pressure to reorder block is wear leveling.

That doesn't really help with regular drives though, where the low seek 
time assumption doesn't play out so well.  The whole idea of writing 
things sequentially and then sorting them out later was the rage in 2001 
for ext3 on Linux, as part of the "data=journal" mount option.  You can 
go back and see that people are confused but excited about the 
performance at 
http://www.ibm.com/developerworks/linux/library/l-fs8/index.html

Spoiler:  if you use a workload that has checkpoint issues, it doesn't 
help PostgreSQL latency.  Just like using a large write cache, you gain 
some burst performance, but eventually you pay for it with extra latency 
somewhere.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Design proposal: fsync absorb linear slider

От
didier
Дата:
Hi,


On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith <greg@2ndquad (needrant.com> wrote:
On 7/26/13 9:14 AM, didier wrote:
During recovery you have to load the log in cache first before applying WAL.

Checkpoints exist to bound recovery time after a crash.  That is their only purpose.  What you're suggesting moves a lot of work into the recovery path, which will slow down how long it takes to process.

Yes it's slower but you're sequentially reading only one file at most the size of your buffer cache, moreover it's a constant time.

Let say you make a checkpoint and crash just after with a next to empty WAL.

Now recovery  is very fast but you have to repopulate your cache with random reads from requests.

With the snapshot it's slower but you read, sequentially again, a lot of hot cache you will need later when the db starts serving requests.

Of course the worst case is if it crashes just before a checkpoint, most of the snapshot data are stalled and will be overwritten by WAL ops.

But  If the WAL recovery is CPU bound, loading from the snapshot may be done concurrently while replaying the WAL.

More work at recovery time means someone who uses the default of checkpoint_timeout='5 minutes', expecting that crash recovery won't take very long, will discover it does take a longer time now.  They'll be forced to shrink the value to get the same recovery time as they do currently.  You might need to make checkpoint_timeout 3 minutes instead, if crash recovery now has all this extra work to deal with.  And when the time between checkpoints drops, it will slow the fundamental efficiency of checkpoint processing down.  You will end up writing out more data in the end.
Yes it's a trade off, now you're paying the price at checkpoint time, every time,  with the log you're paying only once, at recovery.

The interval between checkpoints and recovery time are all related.  If you let any one side of the current requirements slip, it makes the rest easier to deal with.  Those are all trade-offs though, not improvements.  And this particular one is already an option.

If you want less checkpoint I/O per capita and don't care about recovery time, you don't need a code change to get it.  Just make checkpoint_timeout huge.  A lot of checkpoint I/O issues go away if you only do a checkpoint per hour, because instead of random writes you're getting sequential ones to the WAL.  But when you crash, expect to be down for a significant chunk of an hour, as you go back to sort out all of the work postponed before.
It's not the same  it's a snapshot saved and loaded in constant time unlike the WAL log.

Didier

Re: Design proposal: fsync absorb linear slider

От
KONDO Mitsumasa
Дата:
(2013/07/24 1:13), Greg Smith wrote:
> On 7/23/13 10:56 AM, Robert Haas wrote:
>> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> We know that a 1GB relation segment can take a really long time to write
>>> out.  That could include up to 128 changed 8K pages, and we allow all of
>>> them to get dirty before any are forced to disk with fsync.
>>
>> By my count, it can include up to 131,072 changed 8K pages.
>
> Even better!  I can pinpoint exactly what time last night I got tired enough to
> start making trivial mistakes.  Everywhere I said 128 it's actually 131,072,
> which just changes the range of the GUC I proposed.
I think that it is almost same as small dirty_background_ratio or 
dirty_background_bytes.
This method will be very bad performance, and many fsync() may be caused long fsync
situaition which was said past by you. My colleagues who are kernel expert say,
in executing fsync(), other process write same file a lot, it does not return fsync
call function occasionally. So too many fsync with large file is very dangerous.
Moreover fsync() also write metadata, it is worst for performance.

The essential improvement is not dirty page size in fsync() but scheduling of 
fsync phase.
I can't understand why postgres does not consider scheduling of fsync phase. When
dirty_background_ratio is big, in write phase does not write to disk at all,
therefore, fsync() is too heavy in fsync phase.


> Getting the number right really highlights just how bad the current situation
> is.  Would you expect the database to dump up to 128K writes into a file and then
> have low latency when it's flushed to disk with fsync?  Of course not.
I think that it will be improved this problem by sync_file_range() in fsync phase,
and adding checkpoint schedule in fsync phase. Executing small range 
sync_file_range()
and sleep, in final executing fsync(). I think it is better than your proposal.
If a system do not support sync_file_range() system call, it only execute fsync 
and sleep, it is same our method (you and I posted past).

Taken together my checkpoint proposal method,

* write phase  - Almost same, but considering fsync phase schedule.  - Considering case of background-write in OS, sort
bufferbefore starting 
 
checkpoint write.

* fsync phase  - Considering checkpoint schedule and write-phase schedule  - Executing separated sync_file_range() and
sleep,in final fsync().
 

And if I can, not write a buffer method which is called fsync() in a target file.
I think it may be quite difficult.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center





Re: Design proposal: fsync absorb linear slider

От
Jim Nasby
Дата:
On 7/26/13 7:32 AM, Tom Lane wrote:
> Greg Smith <greg@2ndQuadrant.com> writes:
>> On 7/26/13 5:59 AM, Hannu Krosing wrote:
>>> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
>>> "random"
>>> fs pages on one large disk page and having an extra index layer for
>>> resolving
>>> random-to-sequential ordering.
>
>> If your solution to avoiding random writes now is to do sequential ones
>> into a buffer, you'll pay for it by having more expensive random reads
>> later.
>
> What I'd point out is that that is exactly what WAL does for us, ie
> convert a bunch of random writes into sequential writes.  But sooner or
> later you have to put the data where it belongs.

FWIW, at RICon East there was someone from Seagate that gave a presentation. One of his points is that even spinning
rustis moving to the point where the drive itself has to do some kind of write log. He notes that modern filesystems do
thesame thing, and the overlap is probably stupid (I pointed out that the most degenerate case is the logging database
onthe logging filesystem on the logging drive...)
 

It'd be interesting for Postgres to work with drive manufacturers to study ways to get rid of the extra layers of
stupid...
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Design proposal: fsync absorb linear slider

От
Greg Smith
Дата:
On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:
> I think that it is almost same as small dirty_background_ratio or
> dirty_background_bytes.

The main difference here is that all writes pushed out this way will be 
to a single 1GB relation chunk.  The odds are better that multiple 
writes will combine, and that the I/O will involve a lower than average 
amount of random seeking.  Whereas shrinking the size of the write cache 
always results in more random seeking.

> The essential improvement is not dirty page size in fsync() but
> scheduling of fsync phase.
> I can't understand why postgres does not consider scheduling of fsync
> phase.

Because it cannot get the sort of latency improvements I think people 
want.  I proved to myself it's impossible during the last 9.2 CF when I 
submitted several fsync scheduling change submissions.

By the time you get to the fsync sync phase, on a system that's always 
writing heavily there is way too much backlog to possibly cope with by 
then.  There just isn't enough time left before the checkpoint should 
end to write everything out.  You have to force writes to actual disk to 
start happening earlier to keep a predictable schedule.  Basically, the 
longer you go without issuing a fsync, the more uncertainty there is 
around how long it might take to fire.  My proposal lets someone keep 
all I/O from ever reaching the point where the uncertainty is that high.

In the simplest to explain case, imagine that a checkpoint includes a 
1GB relation segment that is completely dirty in shared_buffers.  When a 
checkpoint hits this, it will have 1GB of I/O to push out.

If you have waited this long to fsync the segment, the problem is now 
too big to fix by checkpoint time.  Even if the 1GB of writes are 
themselves nicely ordered and grouped on disk, the concurrent background 
ability is going to chop the combination up into more random I/O than 
the ideal.

Regular consumer disks have a worst case random I/O throughput of less 
than 2MB/s.  My observed progress rates for such systems show you're 
lucky to get 10MB/s of writes out of them.  So how long will the dirty 
1GB in the segment take to write?  1GB @ 10MB/s = 102.4 *seconds*.  And 
that's exactly what I saw whenever I tried to play with checkpoint sync 
scheduling.  No matter what you do there, periodically you'll hit a 
segment that has over a minute of dirty data accumulated, and >60 second 
latency pauses result.  By the point you've reached checkpoint, you're 
dead when you call fsync on that relation.  You *must* hit that segment 
with fsync more often than once per checkpoint to achieve reasonable 
latency.

With this "linear slider" idea, I might tune such that no segment will 
ever get more than 256MB of writes before hitting a fsync instead.  I 
can't guarantee that will work usefully, but the shape of the idea seems 
to match the problem.

> Taken together my checkpoint proposal method,
> * write phase
>    - Almost same, but considering fsync phase schedule.
>    - Considering case of background-write in OS, sort buffer before
> starting checkpoint write.

This cannot work for the reasons I've outlined here.  I guarantee you I 
will easily find a test workload where it performs worse than what's 
happening right now.  If you want to play with this to learn more about 
the trade-offs involved, that's fine, but expect me to vote against 
accepting any change of this form.  I would prefer you to not submit 
them because it will waste a large amount of reviewer time to reach that 
conclusion yet again.  And I'm not going to be that reviewer.

> * fsync phase
>    - Considering checkpoint schedule and write-phase schedule
>    - Executing separated sync_file_range() and sleep, in final fsync().

If you can figure out how to use sync_file_range() to fine tune how much 
fsync is happening at any time, that would be useful on all the 
platforms that support it.  I haven't tried it just because that looked 
to me like a large job refactoring the entire fsync absorb mechanism, 
and I've never had enough funding to take it on.  That approach has a 
lot of good properties, if it could be made to work without a lot of 
code changes.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com