Re: WAL prefetch

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: WAL prefetch
Дата
Msg-id a7253b8a-f0cf-c173-212f-6747cdc81585@postgrespro.ru
обсуждение исходный текст
Ответ на Re: WAL prefetch  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers

On 17.06.2018 03:00, Andres Freund wrote:
> On 2018-06-16 23:25:34 +0300, Konstantin Knizhnik wrote:
>>
>> On 16.06.2018 22:02, Andres Freund wrote:
>>> On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:
>>>> On 06/15/2018 08:01 PM, Andres Freund wrote:
>>>>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:
>>>>>> On 14.06.2018 09:52, Thomas Munro wrote:
>>>>>>> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
>>>>>>> <k.knizhnik@postgrespro.ru> wrote:
>>>>>>>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block
>>>>>>>> references in WAL records
>>>>>>>> using posix_fadvise(WILLNEED) system call.
>>>>>>> Hi Konstantin,
>>>>>>>
>>>>>>> Why stop at the page cache...  what about shared buffers?
>>>>>>>
>>>>>> It is good question. I thought a lot about prefetching directly to shared
>>>>>> buffers.
>>>>> I think that's definitely how this should work.  I'm pretty strongly
>>>>> opposed to a prefetching implementation that doesn't read into s_b.
>>>>>
>>>> Could you elaborate why prefetching into s_b is so much better (I'm sure it
>>>> has advantages, but I suppose prefetching into page cache would be much
>>>> easier to implement).
>>> I think there's a number of issues with just issuing prefetch requests
>>> via fadvise etc:
>>>
>>> - it leads to guaranteed double buffering, in a way that's just about
>>>     guaranteed to *never* be useful. Because we'd only prefetch whenever
>>>     there's an upcoming write, there's simply no benefit in the page
>>>     staying in the page cache - we'll write out the whole page back to the
>>>     OS.
>> Sorry, I do not completely understand this.
>> Prefetch is only needed for partial update of a page - in this case we need
>> to first read page from the disk
> Yes.
>
>
>> before been able to perform update. So before "we'll write out the whole
>> page back to the OS" we have to read this page.
>> And if page is in OS cached (prefetched) then is can be done much faster.
> Yes.
>
>
>> Please notice that at the moment of prefetch there is no double
>> buffering.
> Sure, but as soon as it's read there is.
>
>
>> As far as page is not accessed before, it is not present in shared buffers.
>> And once page is updated,  there is really no need to keep it in shared
>> buffers.  We can use cyclic buffers (like in case  of sequential scan or
>> bulk update) to prevent throwing away useful pages from shared  buffers by
>> redo process. So once again there will no double buffering.
> That's a terrible idea. There's a *lot* of spatial locality of further
> WAL records arriving for the same blocks.

In some cases it is true, in some cases - not. In typical OLTP system if 
record is updated, then there is high probability that
it will be accessed soon. So if at such system we perform write requests 
on master and read-only queries at replicas,
keeping updated pages in shared buffers at replica can be very helpful.

But if replica is used for running mostly analytic queries while master 
performs some updates, then
it is more useful to keep in replica's cache indexes  and most 
frequently accessed pages, rather than recent updates from the master.

So at least it seems to be reasonable to have such parameter and make 
DBA to choose caching policy at replicas.


>
>
>> I am not so familiar with current implementation of full page writes
>> mechanism in Postgres.
>> So may be my idea explained below is stupid or already implemented (but I
>> failed to find any traces of this).
>> Prefetch is needed only for WAL records performing partial update. Full page
>> write doesn't require prefetch.
>> Full page write has to be performed when the page is update first time after
>> checkpoint.
>> But what if slightly extend this rule and perform full page write also when
>> distance from previous full page write exceeds some delta
>> (which somehow related with size of OS cache)?
>>
>> In this case even if checkpoint interval is larger than OS cache size, we
>> still can expect that updated pages are present in OS cache.
>> And no WAL prefetch is needed at all!
> We could do so, but I suspect the WAL volume penalty would be
> prohibitive in many cases. Worthwhile to try though.

Well, the typical size of server's memory is now several hundreds of 
megabytes.
Certainly some of this memory is used for shared buffers, backends work 
memory, ...
But still there are hundreds of gigabytes of free memory which can be 
used by OS for caching.
Let's assume that full page write threshold is 100Gb. So one extra 8kb 
for 100Gb of WAL!
Certainly it is estimation only for one page and it is more realistic to 
expect that we have to force full page writes for most of the updated 
pages. But still I do not believe that it will cause significant growth 
of log size.

Another question is why do we choose so large checkpoint interval: re 
than hundred gigabytes.
Certainly frequent checkpoints have negative impact on performance. But 
100Gb is not "too frequent" in any case...




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Removing "Included attributes in B-tree indexes" section fromdocs
Следующее
От: Nikhil Sontakke
Дата:
Сообщение: Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding