Re: checkpointer continuous flushing
| От | Fabien COELHO | 
|---|---|
| Тема | Re: checkpointer continuous flushing | 
| Дата | |
| Msg-id | alpine.DEB.2.10.1506200817400.31742@sto обсуждение исходный текст | 
| Ответ на | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) | 
| Ответы | Re: checkpointer continuous flushing Re: checkpointer continuous flushing | 
| Список | pgsql-hackers | 
Hello Andres,
>>> - Move fsync as early as possible, suggested by Andres Freund?
>>>
>>> My opinion is that this should be left out for the nonce.
>
> "for the nonce" - what does that mean?
 Nonce \Nonce\ (n[o^]ns), n. [For the nonce, OE. for the nones, ...     {for the nonce}, i. e. for the present time.
> I'm doubtful that it's a good idea to separate this out, if you did.
Actually I did, because as explained in another mail the fsync time when 
the other options are activated as reported in the logs is essentially 
null, so it would not bring significant improvements on these runs,
and also the patch changes enough things as it is.
So this is an evidence-based decision.
I also agree that it seems interesting on principle and should be 
beneficial in some case, but I would rather keep that on a TODO list 
together with trying to do better things in the bgwriter and try to focus 
on the current proposal which already changes significantly the 
checkpointer throttling logic.
>>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>>    Also, sorting is done by chunks of 131072 pages in the current version,
>>    with a guc to change this value.
>
> I think it's a really bad idea to do this in chunks.
The small problem I see is that for a very large setting there could be 
several seconds or even minutes of sorting, which may or may not be 
desirable, so having some control on that seems a good idea.
Another argument is that Tom said he wanted that:-)
In practice the value can be set at a high value so that it is nearly 
always sorted in one go. Maybe value "0" could be made special and used to 
trigger this behavior systematically, and be the default.
> That'll mean we'll frequently uselessly cause repetitive random IO,
This is not an issue if the chunks are large enough, and anyway the guc 
allows to change the behavior as desired. As I said, keeping some control 
seems a good idea, and the "full sorting" can be made the default 
behavior.
> often interleaved. That pattern is horrible for SSDs too. We should 
> always try to do this at once, and only fail back to using less memory 
> if we couldn't allocate everything.
The memory is needed anyway in order to avoid a double or significantly 
more heavy implementation for the throttling loop. It is allocated once on 
the first checkpoint. The allocation could be moved to the checkpointer 
initialization if this is a concern. The memory needed is one int per 
buffer, which is smaller than the 2007 patch.
>>  . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
>
> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.
The point of tiny was to trigger a lot of checkpoints. The size is pretty 
ridiculous anyway, as "tiny" implies. I think I did some tests on other 
versions of the patch and longer checkpoint_timeout on pretty small 
database that showed smaller benefit from the options, as one would 
expect. I'll try to re-run some.
> So you've not run things at more serious concurrency, that'd be
> interesting to see.
I do not have a box available for "serious concurrency".
> I'd also like to see concurrent workloads with synchronous_commit=off -
> I've seen absolutely horrible latency behaviour for that, and I'm hoping
> this will help. It's also a good way to simulate faster hardware than
> you have.
> It's also curious that sorting is detrimental for full speed 'tiny'.
Yep.
>> With SSD probably both options would probably have limited benefit.
>
> I doubt that. Small random writes have bad consequences for wear
> leveling. You might not notice that with a short tests - again, I doubt
> it - but it'll definitely become visible over time.
Possibly. Testing such effects does not seem easy, though. At least I have 
not seen "write stalls" on SSD, which is my primary concern.
-- 
Fabien.
		
	В списке pgsql-hackers по дате отправления: