Обсуждение: Use fadvise in wal replay

Поиск
Список
Период
Сортировка

Use fadvise in wal replay

От
Kirill Reshke
Дата:
Hi hackers!

Recently we faced a problem with one of our production clusters. We use a cascade replication setup in this cluster, that is: master, standby (r1), and cascade standby (r2). From time to time, the replication lag on r1 used to grow, while on r2 it did not. Analysys showed that r1 startup process was spending a lot of time in reading wal from disk. Increasing /sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case. Maybe we can add fadvise call in postgresql startup, so it would not be necessary to change settings on the hypervisor?
Вложения

Re: Use fadvise in wal replay

От
Amit Kapila
Дата:
On Tue, Jun 21, 2022 at 1:07 PM Kirill Reshke <reshke@double.cloud> wrote:
>
> Recently we faced a problem with one of our production clusters. We use a cascade replication setup in this cluster,
thatis: master, standby (r1), and cascade standby (r2). From time to time, the replication lag on r1 used to grow,
whileon r2 it did not. Analysys showed that r1 startup process was spending a lot of time in reading wal from disk.
Increasing/sys/block/md2/queue/read_ahead_kb to 16384 (from 0) helps in this case. Maybe we can add fadvise call in
postgresqlstartup, so it would not be necessary to change settings on the hypervisor? 
>

I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
help your case?

[1] - https://www.postgresql.org/docs/devel/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY

--
With Regards,
Amit Kapila.



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> help your case?

AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it. Kirill
istrying to solve the problem of reading WAL segments that are our of OS page cache. 

Best regards, Andrey Borodin.


RE: Use fadvise in wal replay

От
Jakub Wartak
Дата:
>> > On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
>> > help your case?
>>
>> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to
>> prefetch WAL itself before reading it. Kirill is trying to solve the problem of
>> reading WAL segments that are our of OS page cache.

It seems that it is always by default set to 128 (kB) by default, another thing is that having (default) 16MB WAL
segmentsmight also hinder the readahead heuristics compared to having configured the bigger WAL segment size. 

Maybe the important question is why would be readahead mechanism be disabled in the first place via /sys | blockdev ?

-J.



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 21 Jun 2022, at 13:20, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
>
> Maybe the important question is why would be readahead mechanism be disabled in the first place via /sys | blockdev ?

Because database should know better than OS which data needs to be prefetched and which should not. Big OS readahead
affectsindex scan performance. 

Best regards, Andrey Borodin.


RE: Use fadvise in wal replay

От
Jakub Wartak
Дата:
> > Maybe the important question is why would be readahead mechanism be
> disabled in the first place via /sys | blockdev ?
>
> Because database should know better than OS which data needs to be
> prefetched and which should not. Big OS readahead affects index scan
> performance.

OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which is not cheap either. The code is
alreadyhot and there is example from the past where syscalls were limiting the performance [1]. Maybe it could be
prefetchingin larger batches (128kB? 1MB? 16MB?)  ? 

-J.

[1] - https://commitfest.postgresql.org/28/2606/





Re: Use fadvise in wal replay

От
Thomas Munro
Дата:
On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> > > Maybe the important question is why would be readahead mechanism be
> > disabled in the first place via /sys | blockdev ?
> >
> > Because database should know better than OS which data needs to be
> > prefetched and which should not. Big OS readahead affects index scan
> > performance.
>
> OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which is not cheap either. The code is
alreadyhot and there is example from the past where syscalls were limiting the performance [1]. Maybe it could be
prefetchingin larger batches (128kB? 1MB? 16MB?)  ? 

I've always thought we'd want to tell it about the *next* segment
file, to smooth the transition from one file to the next, something
like the attached (not tested).

Вложения

Re: Use fadvise in wal replay

От
Bharath Rupireddy
Дата:
On Tue, Jun 21, 2022 at 4:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> > > > Maybe the important question is why would be readahead mechanism be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ which is not cheap either. The code is
alreadyhot and there is example from the past where syscalls were limiting the performance [1]. Maybe it could be
prefetchingin larger batches (128kB? 1MB? 16MB?)  ? 
>
> I've always thought we'd want to tell it about the *next* segment
> file, to smooth the transition from one file to the next, something
> like the attached (not tested).

Yes, it makes sense to prefetch the "future" WAL files that "may be"
needed for recovery (crash recovery/archive or PITR recovery/standby
recovery), not the current WAL file. Having said that, it's not a
great idea (IMO) to make the WAL readers prefetching instead WAL
prefetching can be delegated to a new background worker or existing bg
writer or checkpointer which gets started during recovery.

Also, it's a good idea to measure the benefits with and without WAL
prefetching for all recovery types - crash recovery/archive or PITR
recovery/standby recovery. For standby recovery,  the WAL files may be
in OS cache if there wasn't a huge apply lag.

Regards,
Bharath Rupireddy.



Re: Use fadvise in wal replay

От
Amit Kapila
Дата:
On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > help your case?
>
> AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it.
Kirillis trying to solve the problem of reading WAL segments that are our of OS page cache.
 
>

Okay, but normally the WAL written by walreceiver is read by the
startup process soon after it's written as indicated in code comments
(get_sync_bit()). So, what is causing the delay here which makes the
startup process perform physical reads?

-- 
With Regards,
Amit Kapila.



Re: Use fadvise in wal replay

От
Bharath Rupireddy
Дата:
On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
> > > On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > > help your case?
> >
> > AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it.
Kirillis trying to solve the problem of reading WAL segments that are our of OS page cache.
 
> >
>
> Okay, but normally the WAL written by walreceiver is read by the
> startup process soon after it's written as indicated in code comments
> (get_sync_bit()). So, what is causing the delay here which makes the
> startup process perform physical reads?

That's not always true. If there's a huge apply lag and/or
restartpoint is infrequent/frequent or there are many reads on the
standby - in all of these cases the OS cache can replace the WAL from
it causing the startup process to hit the disk for WAL reading.

Regards,
Bharath Rupireddy.



Re: Use fadvise in wal replay

От
Amit Kapila
Дата:
On Tue, Jun 21, 2022 at 5:41 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, Jun 21, 2022 at 4:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 21, 2022 at 3:18 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > >
> > > > On 21 Jun 2022, at 12:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > I wonder if the newly introduced "recovery_prefetch" [1] for PG-15 can
> > > > help your case?
> > >
> > > AFAICS recovery_prefetch tries to prefetch main fork, but does not try to prefetch WAL itself before reading it.
Kirillis trying to solve the problem of reading WAL segments that are our of OS page cache.
 
> > >
> >
> > Okay, but normally the WAL written by walreceiver is read by the
> > startup process soon after it's written as indicated in code comments
> > (get_sync_bit()). So, what is causing the delay here which makes the
> > startup process perform physical reads?
>
> That's not always true. If there's a huge apply lag and/or
> restartpoint is infrequent/frequent or there are many reads on the
> standby - in all of these cases the OS cache can replace the WAL from
> it causing the startup process to hit the disk for WAL reading.
>

It is possible that due to one or more these reasons startup process
has to physically read the WAL. I think it is better to find out what
is going on for the OP. AFAICS, there is no mention of any other kind
of reads on the problematic standby. As per the analysis shared in the
initial email, the replication lag is due to disk reads, so there
doesn't seem to be a very clear theory as to why the OP is seeing disk
reads.

-- 
With Regards,
Amit Kapila.



RE: Use fadvise in wal replay

От
Jakub Wartak
Дата:
> On Tue, Jun 21, 2022 at 10:33 PM Jakub Wartak <Jakub.Wartak@tomtom.com>
> wrote:
> > > > Maybe the important question is why would be readahead mechanism
> > > > be
> > > disabled in the first place via /sys | blockdev ?
> > >
> > > Because database should know better than OS which data needs to be
> > > prefetched and which should not. Big OS readahead affects index scan
> > > performance.
> >
> > OK fair point, however the patch here is adding 1 syscall per XLOG_BLCKSZ
> which is not cheap either. The code is already hot and there is example from the
> past where syscalls were limiting the performance [1]. Maybe it could be
> prefetching in larger batches (128kB? 1MB? 16MB?)  ?
> 
> I've always thought we'd want to tell it about the *next* segment file, to
> smooth the transition from one file to the next, something like the attached (not
> tested).

Hey Thomas! 

Apparently it's false theory. Redo-bench [1] results (1st is total recovery time in seconds, 3.1GB pgdata (out of which
2.6pg_wals/166 files). Redo-bench was slightly hacked to drop fs caches always after copying so that there is nothing
infscache (both no pgdata and no pg_wals; shared fs). M_io_c is at default (10), recovery_prefetch same (try; on by
default)

master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master, blockdev --setra 0 /dev/nvme0n1: 
53.151, 0.603
58.329, 0.525
52.435, 0.536

master, with yours patch (readaheads disabled) -- double checked, calls to fadvise64(offset=0 len=0) were there
58.063, 0.593
51.369, 0.574
51.716, 0.59

master, with Kirill's original patch (readaheads disabled)
38.25, 1.134
36.563, 0.582
37.711, 0.584

I've noted also that in both cases POSIX_FADV_SEQUENTIAL is being used instead of WILLNEED (?). 
I haven't quantified the tradeoff of master vs Kirill's with readahead, but I think that 1 additional syscall is not
goingto be cheap just for non-standard OS configurations (?)
 

-J.

[1] - https://github.com/macdice/redo-bench

Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:
Oh, wow, your benchmarks show really impressive improvement.

> I think that 1 additional syscall is not going to be cheap just for non-standard OS configurations
Also we can reduce number of syscalls by something like

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
    if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
        posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8, POSIX_FADV_WILLNEED);
#endif

and maybe define\reuse the some GUC to control number of prefetched pages at once.

Best regards, Andrey Borodin.


Re: Use fadvise in wal replay

От
Pavel Borisov
Дата:
> On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:
Oh, wow, your benchmarks show really impressive improvement.

FWIW I was trying to speedup long sequential file reads in Postgres using fadvise hints. I've found no detectable improvements.
Then I've written 1Mb - 1Gb sequential read test with both fadvise POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in Linux. The only improvement I've found was 

1. when the size of read was around several Mb and fadvise len also around several Mb. 
2. when before fdavice and the first read there was a delay (which was supposedly used by OS for reading into prefetch buffer)
3. If I read sequential blocks i saw speedup only on first ones. Overall read speed of say 1Gb file remained unchanged no matter what.

I became convinced that if I read something long, OS does necessary speedups automatically (which is also in agreement with fadvise manual/code comments).
Could you please elaborate how have you got the results with that big difference? (Though I don't against fadvise usage, at worst it is expected to be useless).

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 21 Jun 2022, at 20:52, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> > On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:
> Oh, wow, your benchmarks show really impressive improvement.
>
> FWIW I was trying to speedup long sequential file reads in Postgres using fadvise hints. I've found no detectable
improvements.
> Then I've written 1Mb - 1Gb sequential read test with both fadvise POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in
Linux.
Did you drop caches?

> The only improvement I've found was
>
> 1. when the size of read was around several Mb and fadvise len also around several Mb.
> 2. when before fdavice and the first read there was a delay (which was supposedly used by OS for reading into
prefetchbuffer) 
That's the case of startup process: you read a xlog page, then redo records from this page.
> 3. If I read sequential blocks i saw speedup only on first ones. Overall read speed of say 1Gb file remained
unchangedno matter what. 
>
> I became convinced that if I read something long, OS does necessary speedups automatically (which is also in
agreementwith fadvise manual/code comments). 
> Could you please elaborate how have you got the results with that big difference? (Though I don't against fadvise
usage,at worst it is expected to be useless). 

FWIW we with Kirill observed drastically reduced lag on a production server when running patched version. Fidvise
surelyworks :) The question is how to use it optimally. 

Best regards, Andrey Borodin.


Re: Use fadvise in wal replay

От
Pavel Borisov
Дата:
On Wed, Jun 22, 2022 at 2:07 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:


> On 21 Jun 2022, at 20:52, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> > On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:
> Oh, wow, your benchmarks show really impressive improvement.
>
> FWIW I was trying to speedup long sequential file reads in Postgres using fadvise hints. I've found no detectable improvements.
> Then I've written 1Mb - 1Gb sequential read test with both fadvise POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL in Linux.
Did you drop caches?
Yes. I saw nothing changes speed of long file (50Mb+) read.
> The only improvement I've found was
>
> 1. when the size of read was around several Mb and fadvise len also around several Mb.
> 2. when before fdavice and the first read there was a delay (which was supposedly used by OS for reading into prefetch buffer)
That's the case of startup process: you read a xlog page, then redo records from this page.
Then I'd guess that your speedup is due to speeding up the first several Mb's in many files opened (and delay for kernel prefetch is due to some other reason). That may differ from the case I've tried to measure speedup and this could be the cause of speedup in your case.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 22 Jun 2022, at 13:26, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> Then I'd guess that your speedup is due to speeding up the first several Mb's in many files opened
I think in this case Thomas' aproach of prefetching next WAL segment would do better. But Jakub observed opposite
results.

Best regards, Andrey Borodin.


RE: Use fadvise in wal replay

От
Jakub Wartak
Дата:
>> > On 21 Jun 2022, at 16:59, Jakub Wartak <jakub.wartak@tomtom.com> wrote:
>> Oh, wow, your benchmarks show really impressive improvement.
>>
>> > I think that 1 additional syscall is not going to be cheap just for
>> > non-standard OS configurations
>> Also we can reduce number of syscalls by something like
>>
>> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
>>     if ((readOff % (8 * XLOG_BLCKSZ)) == 0)
>>         posix_fadvise(readFile, readOff + XLOG_BLCKSZ, XLOG_BLCKSZ * 8,
>> POSIX_FADV_WILLNEED); #endif
>>
>> and maybe define\reuse the some GUC to control number of prefetched pages
>> at once.

Hi, I was thinking the same, so I got the patch (attached) to the point it gets the identical performance with and
withoutreadahead enabled: 

baseline, master, default Linux readahead (128kb):
33.979, 0.478
35.137, 0.504
34.649, 0.518

master+patched, readahead disabled:
34.338, 0.528
34.568, 0.575
34.007, 1.136

master+patched, readahead enabled (as default):
33.935, 0.523
34.109, 0.501
33.408, 0.557

Thoughts?

Notes:
- no GUC, as the default/identical value seems to be the best
- POSIX_FADV_SEQUENTIAL is apparently much slower and doesn't seem to have effect from xlogreader.c at all while
_WILLNEEDdoes (testing again contradicts "common wisdom"?) 

-J.

Вложения

Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> 23 июня 2022 г., в 13:50, Jakub Wartak <Jakub.Wartak@tomtom.com> написал(а):
>
> Thoughts?
The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra branch for 120KB after 1st block when
readOff==0?
Or maybe do
+        posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK, POSIX_FADV_WILLNEED);
instead of
+        posix_fadvise(readFile, readOff + RACHUNK    , RACHUNK, POSIX_FADV_WILLNEED);
?

> Notes:
> - no GUC, as the default/identical value seems to be the best
I think adding this performance boost on most systems definitely worth 1 syscall per 16 pages. And I believe 128KB to
beoptimal for most storages. And having no GUCs sounds great. 

But storage systems might be different, far beyond benchmarks.
All in all, I don't have strong opinion on having 1 or 0 GUCs to configure this.

I've added patch to the CF.

Thanks!

Best regards, Andrey Borodin.


RE: Use fadvise in wal replay

От
Jakub Wartak
Дата:
Hey Andrey,

> > 23 июня 2022 г., в 13:50, Jakub Wartak <Jakub.Wartak@tomtom.com>
> написал(а):
> >
> > Thoughts?
> The patch leaves 1st 128KB chunk unprefetched. Does it worth to add and extra
> branch for 120KB after 1st block when readOff==0?
> Or maybe do
> +        posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK,
> POSIX_FADV_WILLNEED);
> instead of
> +        posix_fadvise(readFile, readOff + RACHUNK    , RACHUNK,
> POSIX_FADV_WILLNEED);
> ?

> > Notes:
> > - no GUC, as the default/identical value seems to be the best
> I think adding this performance boost on most systems definitely worth 1 syscall
> per 16 pages. And I believe 128KB to be optimal for most storages. And having
> no GUCs sounds great.
> 
> But storage systems might be different, far beyond benchmarks.
> All in all, I don't have strong opinion on having 1 or 0 GUCs to configure this.
> 
> I've added patch to the CF.

Cool. As for GUC I'm afraid there's going to be resistance of adding yet another GUC (to avoid many knobs). Ideally it
wouldbe nice if we had some advanced/deep/hidden parameters , but there isn't such thing.
 
Maybe another option would be to use (N * maintenance_io_concurrency * XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB
(prettyclose to default value, and still can be tweaked by enduser). Let's wait what others say?
 

-J.

Re: Use fadvise in wal replay

От
Justin Pryzby
Дата:
On Thu, Jun 23, 2022 at 09:49:31AM +0000, Jakub Wartak wrote:
> it would be nice if we had some advanced/deep/hidden parameters , but there isn't such thing.

There's DEVELOPER_OPTIONS gucs, although I don't know if this is a good fit for
that.

-- 
Justin



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 23 Jun 2022, at 12:50, Jakub Wartak <jakub.wartak@tomtom.com> wrote:
>
> Thoughts?

I've looked into the patch one more time. And I propose to change this line
+        posix_fadvise(readFile, readOff + RACHUNK, RACHUNK, POSIX_FADV_WILLNEED);
to
+        posix_fadvise(readFile, readOff + XLOG_BLCKSZ, RACHUNK, POSIX_FADV_WILLNEED);

Currently first 128Kb of the file are not prefetched. But I expect that this change will produce similar performance
results.I propose this change only for consistency, so we prefetch all data that we did not prefetch yet and going to
read.What do you think? 

Best regards, Andrey Borodin.


Re: Use fadvise in wal replay

От
Robert Haas
Дата:
On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> Cool. As for GUC I'm afraid there's going to be resistance of adding yet another GUC (to avoid many knobs). Ideally
itwould be nice if we had some advanced/deep/hidden parameters , but there isn't such thing.
 
> Maybe another option would be to use (N * maintenance_io_concurrency * XLOG_BLCKSZ), so N=1 that's 80kB and N=2 160kB
(prettyclose to default value, and still can be tweaked by enduser). Let's wait what others say?
 

I don't think adding more parameters is a problem intrinsically. A
good question to ask, though, is how the user is supposed to know what
value they should configure. If we don't have any idea what value is
likely to be optimal, odds are users won't either.

It's not very clear to me that we have any kind of agreement on what
the basic approach should be here, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 18 Jul 2022, at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
>> Cool. As for GUC I'm afraid there's going to be resistance of adding yet another GUC (to avoid many knobs). Ideally
itwould be nice if we had some advanced/deep/hidden parameters , but there isn't such thing. 
>> Maybe another option would be to use (N * maintenance_io_concurrency * XLOG_BLCKSZ), so N=1 that's 80kB and N=2
160kB(pretty close to default value, and still can be tweaked by enduser). Let's wait what others say? 
>
> I don't think adding more parameters is a problem intrinsically. A
> good question to ask, though, is how the user is supposed to know what
> value they should configure. If we don't have any idea what value is
> likely to be optimal, odds are users won't either.
We know that 128KB is optimal on some representative configuration and that changing value won't really affect
performancemuch. 128KB is marginally better then 8KB and removes some theoretical concerns about extra syscalls. 

> It's not very clear to me that we have any kind of agreement on what
> the basic approach should be here, though.

Actually, the only question is offset from current read position: should it be 1 block or full readehead chunk. Again,
thisdoes not change anything, just a matter of choice. 


Thanks!

Best regards, Andrey Borodin.


Re: Use fadvise in wal replay

От
Bharath Rupireddy
Дата:
On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 18 Jul 2022, at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:

I have a fundamental question on the overall idea - How beneficial it
will be if the process that's reading the current WAL page only does
(at least attempts) the prefetching of future WAL pages? Won't the
benefit be higher if "some" other background process does prefetching?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:
Hi Bharath,

thank you for the suggestion.

> On 5 Aug 2022, at 16:02, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>
>>> On 18 Jul 2022, at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
>
> I have a fundamental question on the overall idea - How beneficial it
> will be if the process that's reading the current WAL page only does
> (at least attempts) the prefetching of future WAL pages? Won't the
> benefit be higher if "some" other background process does prefetching?

IMO prefetching from other thread would have negative effect.
fadvise() call is non-blocking, startup process won't do IO. It just informs kernel to schedule asynchronous page read.
On the other hand synchronization with other process might cost more than fadvise().

Anyway cost of calling fadise() once per 16 page reads is neglectable.

Thank you!

Best regards, Andrey Borodin.


Re: Use fadvise in wal replay

От
Bharath Rupireddy
Дата:
On Sat, Aug 6, 2022 at 10:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi Bharath,
>
> thank you for the suggestion.
>
> > On 5 Aug 2022, at 16:02, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Thu, Aug 4, 2022 at 9:48 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >>
> >>> On 18 Jul 2022, at 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
> >>>
> >>> On Thu, Jun 23, 2022 at 5:49 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
> >
> > I have a fundamental question on the overall idea - How beneficial it
> > will be if the process that's reading the current WAL page only does
> > (at least attempts) the prefetching of future WAL pages? Won't the
> > benefit be higher if "some" other background process does prefetching?
>
> IMO prefetching from other thread would have negative effect.
> fadvise() call is non-blocking, startup process won't do IO. It just informs kernel to schedule asynchronous page
read.
> On the other hand synchronization with other process might cost more than fadvise().

Hm, POSIX_FADV_WILLNEED flag makes fadvise() non-blocking.

> Anyway cost of calling fadise() once per 16 page reads is neglectable.

Agree. Why can't we just prefetch the entire WAL file once whenever it
is opened for the first time? Does the OS have any limitations on max
size to prefetch at once? It may sound aggressive, but it avoids
fadvise() system calls, this will be especially useful if there are
many WAL files to recover (crash, PITR or standby recovery),
eventually we would want the total WAL file to be prefetched.

If prefetching the entire WAL file is okay, we could further do this:
1) prefetch in XLogFileOpen() and all of segment_open callbacks, 2)
release in XLogFileClose (it's being dong right now) and all of
segment_close callbacks - do this perhaps optionally.

Also, can't we use an existing function FilePrefetch()? That way,
there is no need for a new wait event type.

Thoughts?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:

> On 7 Aug 2022, at 06:39, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Agree. Why can't we just prefetch the entire WAL file once whenever it
> is opened for the first time? Does the OS have any limitations on max
> size to prefetch at once? It may sound aggressive, but it avoids
> fadvise() system calls, this will be especially useful if there are
> many WAL files to recover (crash, PITR or standby recovery),
> eventually we would want the total WAL file to be prefetched.
>
> If prefetching the entire WAL file is okay, we could further do this:
> 1) prefetch in XLogFileOpen() and all of segment_open callbacks, 2)
> release in XLogFileClose (it's being dong right now) and all of
> segment_close callbacks - do this perhaps optionally.
>
> Also, can't we use an existing function FilePrefetch()? That way,
> there is no need for a new wait event type.
>
> Thoughts?

Thomas expressed this idea upthread. Benchmarks done by Jakub showed that this approach had no significant improvement
overexisting master code. 
The same benchmarks showed almost x1.5 improvement of readahead in 8Kb or 128Kb chunks.

Best regards, Andrey Borodin.


Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:
On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>

Hi everyone. The patch is 16 lines, looks harmless and with proven
benefits. I'm moving this into RfC.

Thanks!

Best regards, Andrey Borodin.



Re: Use fadvise in wal replay

От
Pavel Borisov
Дата:
Hi, hackers!

On Sun, 13 Nov 2022 at 02:02, Andrey Borodin <amborodin86@gmail.com> wrote:
>
> On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >
>
> Hi everyone. The patch is 16 lines, looks harmless and with proven
> benefits. I'm moving this into RfC.

As I've written up in the thread we can not gain much from this
optimization. The results of Jakub shows around 2% difference:

>baseline, master, default Linux readahead (128kb):
>33.979, 0.478
>35.137, 0.504
>34.649, 0.518>

>master+patched, readahead disabled:
>34.338, 0.528
>34.568, 0.575
>34.007, 1.136

>master+patched, readahead enabled (as default):
>33.935, 0.523
>34.109, 0.501
>33.408, 0.557

On the other hand, the patch indeed is tiny and I don't think the
proposed advise can ever make things bad. So, I've looked through the
patch again and I agree it can be committed in the current state.

Kind regards,
Pavel Borisov,
Supabase



Re: Use fadvise in wal replay

От
Pavel Borisov
Дата:
On Sat, 26 Nov 2022 at 01:10, Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> Hi, hackers!
>
> On Sun, 13 Nov 2022 at 02:02, Andrey Borodin <amborodin86@gmail.com> wrote:
> >
> > On Sun, Aug 7, 2022 at 9:41 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > >
> >
> > Hi everyone. The patch is 16 lines, looks harmless and with proven
> > benefits. I'm moving this into RfC.
>
> As I've written up in the thread we can not gain much from this
> optimization. The results of Jakub shows around 2% difference:
>
> >baseline, master, default Linux readahead (128kb):
> >33.979, 0.478
> >35.137, 0.504
> >34.649, 0.518>
>
> >master+patched, readahead disabled:
> >34.338, 0.528
> >34.568, 0.575
> >34.007, 1.136
>
> >master+patched, readahead enabled (as default):
> >33.935, 0.523
> >34.109, 0.501
> >33.408, 0.557
>
> On the other hand, the patch indeed is tiny and I don't think the
> proposed advise can ever make things bad. So, I've looked through the
> patch again and I agree it can be committed in the current state.

My mailer corrected my previous message. The right one is:
"On the other hand, the patch indeed is tiny and I don't think the
proposed _fadvise_ can ever make things bad".

Pavel.



Re: Use fadvise in wal replay

От
Andrey Borodin
Дата:
On Fri, Nov 25, 2022 at 1:12 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
> As I've written up in the thread we can not gain much from this
> optimization. The results of Jakub shows around 2% difference:
>
> >baseline, master, default Linux readahead (128kb):
> >33.979, 0.478
> >35.137, 0.504
> >34.649, 0.518>
>
> >master+patched, readahead disabled:
> >34.338, 0.528
> >34.568, 0.575
> >34.007, 1.136
>
> >master+patched, readahead enabled (as default):
> >33.935, 0.523
> >34.109, 0.501
> >33.408, 0.557
>

The performance benefit shows up only when readahead is disabled. And
on many workloads readahead brings unneeded data into page cache, so
it's preferred configuration.
In this particular case, time to apply WAL decreases from 53s to 33s.

Thanks!

Best Regards, Andrey Borodin.



Re: Use fadvise in wal replay

От
Tomas Vondra
Дата:
Hi,

I looked at this patch today. The change is fairly simple, so I decided
to do a benchmark. To prepare, I created a cluster with a 1GB database,
created a backup, and ran 1h UPDATE workload with WAL archiving. Then,
the actual benchmark does this:

1. restore the datadir backup
2. copy the WAL from archive
3. drop caches
4. start the cluster, measure time until end of recovery

I did this with master/patched, and with/without Linux readahead. And I
did this on two different machines - both have SSD storage, one (i5) has
a RAID of SATA SSD devices, the other one (xeon) has a single NVMe SSD.

The results (in seconds) look like this (the amount of WAL is different
on each machine, so the timings are not comparable).

     host     ra      master      patched
    --------------------------------------
       i5      0         615          513
             256         392          396
    --------------------------------------
     xeon      0        2113         1436
             256        1487         1460

On i5 (smaller machine with RAID of 6 x SATA SSD), with read-ahead
enabled it takes ~390 seconds, and the patch makes no difference.
Without read-ahead, it takes ~615 seconds, and the patch does help a
bit, but it's hardly competitive to the read-ahead.

Note: 256 is read-ahead per physical device, the read-ahead value for
the whole RAID is 6x that, i.e. 1536. I was speculating that maybe the
hard-coded 128kB RACHUNK is not sufficient, so I tried with 512kB. But
that actually made it worse, and the timing deteriorated to ~640s (that
is, slower than master without read-ahead).

On the xeon (with NVMe SSD), it's different - the patch seems about as
effective as regular read-ahead. So that's good.


So I'm a bit unsure about this patch. I doesn't seem like it can perform
better than read-ahead (although perhaps it does, on a different storage
system).

With disabled read-ahead it helps (at least a bit), although I'm not
really convinced there are good reasons to run without read-ahead. The
reason for doing that was described like this:

> Because database should know better than OS which data needs to be
> prefetched and which should not. Big OS readahead affects index scan
> performance.

I don't recall seeing such issue, and I can't find anything like that in
our mailinglist archives either. Sure, that doesn't mean it can't
happen, and read-ahead is a heuristics so it can do weird stuff. But in
my experience it tends to work fairly well. The issues I've seen are
generally in the opposite direction, i.e. read-ahead not kicking in.


Anyway, it's not my intent to prevent this patch getting committed, if
someone wishes to do that. But I'm not quite convinced it actually helps
with a practical issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Use fadvise in wal replay

От
Andres Freund
Дата:
Hi,

On 2023-01-19 22:19:10 +0100, Tomas Vondra wrote:
> So I'm a bit unsure about this patch. I doesn't seem like it can perform
> better than read-ahead (although perhaps it does, on a different storage
> system).

I really don't see the point of the patch as-is. It's not going to help OSs
without useful readahead, because those don't have posix_fadvise either. And
hardcoded readahead distance isn't useful - on e.g. cloud storage 128kB won't
be long enough to overcome network latency.

I could see a *tad* more point in using posix_fadvise(fd, 0, 0,
POSIX_FADV_SEQUENTIAL) when opening WAL files, as that'll increase the kernel
readahead window over what's configured.


> With disabled read-ahead it helps (at least a bit), although I'm not
> really convinced there are good reasons to run without read-ahead. The
> reason for doing that was described like this:

Agreed. Postgres currently totally crashes and burns without OS
readhead. Buffered IO without readahead makes no sense. So I just don't see
the point of this patch.


> > Because database should know better than OS which data needs to be
> > prefetched and which should not. Big OS readahead affects index scan
> > performance.

I don't disagree fundamentally. But that doesn't make this patch a useful
starting point.

Greetings,

Andres Freund



Re: Use fadvise in wal replay

От
"Gregory Stark (as CFM)"
Дата:
On Thu, 19 Jan 2023 at 18:19, Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-01-19 22:19:10 +0100, Tomas Vondra wrote:
>
> > So I'm a bit unsure about this patch. I doesn't seem like it can perform
> > better than read-ahead (although perhaps it does, on a different storage
> > system).
>
> I really don't see the point of the patch as-is.
...
> I don't disagree fundamentally. But that doesn't make this patch a useful
> starting point.

It sounds like this patch has gotten off on the wrong foot and is not
worth moving forward to the next commitfest. Hopefully a starting over
from a different approach might target i/o that is more amenable to
fadvise. I'll mark it RwF.

-- 
Gregory Stark
As Commitfest Manager