Re: Improve WALRead() to suck data directly from WAL buffers when possible

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Improve WALRead() to suck data directly from WAL buffers when possible
Дата
Msg-id CAD21AoDqYH6mqsZf5x3v4rLU8jA0FkKUskKnpBgKDy-enYftNw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Ответы Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> > > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > > > Please review the attached v2 patch further.
> > > >
> > > > I'm still unclear on the performance goals of this patch. I see that it
> > > > will reduce syscalls, which sounds good, but to what end?
> > > >
> > > > Does it allow a greater number of walsenders? Lower replication
> > > > latency? Less IO bandwidth? All of the above?
> > >
> > > One benefit would be that it'd make it more realistic to use direct IO for WAL
> > > - for which I have seen significant performance benefits. But when we
> > > afterwards have to re-read it from disk to replicate, it's less clearly a win.
> >
> > Satya's email just now reminded me of another important reason:
> >
> > Eventually we should add the ability to stream out WAL *before* it has locally
> > been written out and flushed. Obviously the relevant positions would have to
> > be noted in the relevant message in the streaming protocol, and we couldn't
> > generally allow standbys to apply that data yet.
> >
> > That'd allow us to significantly reduce the overhead of synchronous
> > replication, because instead of commonly needing to send out all the pending
> > WAL at commit, we'd just need to send out the updated flush position. The
> > reason this would lower the overhead is that:
> >
> > a) The reduced amount of data to be transferred reduces latency - it's easy to
> >    accumulate a few TCP packets worth of data even in a single small OLTP
> >    transaction
> > b) The remote side can start to write out data earlier
> >
> >
> > Of course this would require additional infrastructure on the receiver
> > side. E.g. some persistent state indicating up to where WAL is allowed to be
> > applied, to avoid the standby getting ahead of th eprimary, in case the
> > primary crash-restarts (or has more severe issues).
> >
> >
> > With a bit of work we could perform WAL replay on standby without waiting for
> > the fdatasync of the received WAL - that only needs to happen when a) we need
> > to confirm a flush position to the primary b) when we need to write back pages
> > from the buffer pool (and some other things).
>
> Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
> is right, the eventual plan is to do a bunch of other stuff as
> described above and we've discussed this in another thread (see
> below). I would like to once again clarify motivation behind this
> feature:
>
> 1. It enables WAL readers (callers of WALRead() - wal senders,
> pg_walinspect etc.) to use WAL buffers as first level cache which
> might reduce number of IOPS at a peak load especially when the pread()
> results in a disk read (WAL isn't available in OS page cache). I had
> earlier presented the buffer hit ratio/amount of pread() system calls
> reduced with wal senders in the first email of this thread (95% of the
> time wal senders are able to read from WAL buffers without impacting
> anybody). Now, here are the results with the WAL DIO patch [1] - where
> WAL pread() turns into a disk read, see the results [2] and attached
> graph.
>
> 2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
> page cache, using WAL buffers as read cache helps a lot. It is clearly
> evident from my experiment with WAL DIO patch [1], see the results [2]
> and attached graph. As expected, WAL DIO brings down the TPS, whereas
> WAL buffers read i.e. this patch brings it up.
>
> [2] Test case is an insert pgbench workload.
> clients    HEAD    WAL DIO    WAL DIO & WAL BUFFERS READ    WAL BUFFERS READ
> 1    1404    1070    1424    1375
> 2    1487    796    1454    1517
> 4    3064    1743    3011    3019
> 8    6114    3556    6026    5954
> 16    11560    7051    12216    12132
> 32    23181    13079    23449    23561
> 64    43607    26983    43997    45636
> 128    80723    45169    81515    81911
> 256    110925    90185    107332    114046
> 512    119354    109817    110287    117506
> 768    112435    105795    106853    111605
> 1024    107554    105541    105942    109370
> 2048    88552    79024    80699    90555
> 4096    61323    54814    58704    61743

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Inconsistency in reporting checkpointer stats
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Assertion failure in SnapBuildInitialSnapshot()