Обсуждение: Improving replay of XLOG_BTREE_VACUUM records

Поиск

Список

Период

Сортировка

Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

01 мая 2015 г., 19:19:22

Hi all.

There are situations in which vacuuming big btree index causes stuck in WAL replaying on hot standby servers for quite a long time. I’ve described the problem in more details in this thread [0]. Below in that thread Kevin Grittner proposed a good way for improving btree scans so that btree vacuuming logic could be seriously simplified. Since I don’t know when that may happen I’ve done a patch that makes some improvement right now. If Kevin or someone else would expand [1] for handling all types of btree scans, I suppose, my patch could be thrown away and vacuuming logic should be strongly rewritten.

The idea of the patch is not to read pages from disk to make sure they are unpinned (like btree_xlog_vacuum does it right now). This is done with creating a new ReadBufferMode which returns locked buffer without setting BM_VALID flag on it. I don’t know if that is the right way of doing that but it seems to work well.

Testing it my environment gives a good win [2] - green is unpatched replica, blue is replica with a patch, two spikes are results of calling manual vacuuming of big table. Since the picture could be unavailable I’ll write here that:

1. replication delay reduced from ~1250 MB to 200 MB of replay_location lag,

2. patched replica caught master in less than a minute against 12 minutes of unpatched replica,

3. Physical I/O load on patched replica didn’t change compared with the normal workload while unpatched replica did lots of reads from PGDATA during spikes.

There is still a stuck in WAL replay but much smaller that right now. Also this change seems not to affect any other scenarios.

I’ll add it to 2015-06 commitfest.

[0] http://www.postgresql.org/message-id/058C9D59-9200-45FD-A565-0E4431A6F1E3@simply.name

[1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069

[2] https://yadi.sk/i/l13PZUNhgNB8u

May the force be with you…

https://simply.name

Вложения

btree_vacuum_v1.patch

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Jim Nasby

Дата:

02 мая 2015 г., 02:10:46

On 5/1/15 11:19 AM, Vladimir Borodin wrote:
> There are situations in which vacuuming big btree index causes stuck in
> WAL replaying on hot standby servers for quite a long time. I’ve
> described the problem in more details in this thread [0]. Below in that
> thread Kevin Grittner proposed a good way for improving btree scans so
> that btree vacuuming logic could be seriously simplified. Since I don’t
> know when that may happen I’ve done a patch that makes some improvement
> right now. If Kevin or someone else would expand [1] for handling all
> types of btree scans, I suppose, my patch could be thrown away and
> vacuuming logic should be strongly rewritten.

This looks like a good way to address this until the more significant 
work can be done.

I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID? 
or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on 
the code; I see the logic to NO_BM_VALID...

+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer so that noone could pin it.

It would be better to explain why we want that mode. How about:

RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set 
BM_VALID before returning the buffer. This is done to ensure that no one 
can pin the buffer without actually reading the buffer contents in. This 
is necessary while replying XLOG_BTREE_VACUUM records in hot standby.

+        if (mode == RBM_ZERO_NO_BM_VALID)
+            TerminateBufferIO(bufHdr, false, 0);
+        else
+            TerminateBufferIO(bufHdr, false, BM_VALID);

Simply passing in a 0 seems a bit odd to me; is there anywhere else we 
do that?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

03 мая 2015 г., 12:27:38

Hi, Jim.

Thanks for review.

2 мая 2015 г., в 2:10, Jim Nasby <Jim.Nasby@BlueTreble.com> написал(а):

On 5/1/15 11:19 AM, Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck in
WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in that
thread Kevin Grittner proposed a good way for improving btree scans so
that btree vacuuming logic could be seriously simplified. Since I don’t
know when that may happen I’ve done a patch that makes some improvement
right now. If Kevin or someone else would expand [1] for handling all
types of btree scans, I suppose, my patch could be thrown away and
vacuuming logic should be strongly rewritten.

This looks like a good way to address this until the more significant work can be done.

I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID? or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on the code; I see the logic to NO_BM_VALID…

Perhaps, RBM_ZERO_NO_BM_VALID is not so good (it makes more difficult to grep BM_VALID in code), but I don’t actually like BM_INVALID and BM_NOT_VALID, sorry :( But I also don’t insist on NO_BM_VALID, any other suggestions?

+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer so that noone could pin it.

It would be better to explain why we want that mode. How about:

RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set BM_VALID before returning the buffer. This is done to ensure that no one can pin the buffer without actually reading the buffer contents in. This is necessary while replying XLOG_BTREE_VACUUM records in hot standby.

Good point, fixed in attached patch.

+ if (mode == RBM_ZERO_NO_BM_VALID)
+ TerminateBufferIO(bufHdr, false, 0);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID);

Simply passing in a 0 seems a bit odd to me; is there anywhere else we do that?

Yes, it is done the same way in FlushBuffer function [0]. Also comments before TerminateBufferIO say that 0 is expected value for third argument.

[0] http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/buffer/bufmgr.c;h=a68eae81695f3f3b711d35d6c910e46b031f1cbc;hb=HEAD#l2426

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

May the force be with you…

https://simply.name

Вложения

btree_vacuum_v2.patch

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Heikki Linnakangas

Дата:

24 июля 2015 г., 09:53:56

On 05/02/2015 02:10 AM, Jim Nasby wrote:
> This looks like a good way to address this until the more significant
> work can be done.
>
> I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
> or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
> the code; I see the logic to NO_BM_VALID...
>
> + * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
> + * BM_VALID bit before returning buffer so that noone could pin it.
>
> It would be better to explain why we want that mode. How about:
>
> RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
> BM_VALID before returning the buffer. This is done to ensure that no one
> can pin the buffer without actually reading the buffer contents in. This
> is necessary while replying XLOG_BTREE_VACUUM records in hot standby.

That's a rather strange mode. The BM_VALID flag is internal to the 
buffer manager - if you don't know how the buffer manager works, you 
cannot understand that description. I couldn't quite understand what 
that means myself. What can or can you not do with a buffer that's not 
marked as BM_VALID? I'd suggest a mode like this instead:

In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer 
cache already. If it's not, it is not read from disk, and InvalidBuffer 
is returned instead.
- Heikki

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Jim Nasby

Дата:

24 июля 2015 г., 20:58:02

On 7/24/15 1:53 AM, Heikki Linnakangas wrote:
> On 05/02/2015 02:10 AM, Jim Nasby wrote:
>> This looks like a good way to address this until the more significant
>> work can be done.
>>
>> I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
>> or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
>> the code; I see the logic to NO_BM_VALID...
>>
>> + * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does
>> not set
>> + * BM_VALID bit before returning buffer so that noone could pin it.
>>
>> It would be better to explain why we want that mode. How about:
>>
>> RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
>> BM_VALID before returning the buffer. This is done to ensure that no one
>> can pin the buffer without actually reading the buffer contents in. This
>> is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
>
> That's a rather strange mode. The BM_VALID flag is internal to the
> buffer manager - if you don't know how the buffer manager works, you
> cannot understand that description. I couldn't quite understand what
> that means myself. What can or can you not do with a buffer that's not
> marked as BM_VALID? I'd suggest a mode like this instead:
>
> In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer
> cache already. If it's not, it is not read from disk, and InvalidBuffer
> is returned instead.

+1, though I think it's still worth mentioning why we're doing this (so 
we know no one else can pin the buffer).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Andres Freund

Дата:

26 июля 2015 г., 15:47:07

On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
> On 05/02/2015 02:10 AM, Jim Nasby wrote:
> >This looks like a good way to address this until the more significant
> >work can be done.
> >
> >I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
> >or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
> >the code; I see the logic to NO_BM_VALID...
> >
> >+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
> >+ * BM_VALID bit before returning buffer so that noone could pin it.
> >
> >It would be better to explain why we want that mode. How about:
> >
> >RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
> >BM_VALID before returning the buffer. This is done to ensure that no one
> >can pin the buffer without actually reading the buffer contents in. This
> >is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
>
> That's a rather strange mode. The BM_VALID flag is internal to the buffer
> manager - if you don't know how the buffer manager works, you cannot
> understand that description. I couldn't quite understand what that means
> myself. What can or can you not do with a buffer that's not marked as
> BM_VALID? I'd suggest a mode like this instead:
>
> In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer cache
> already. If it's not, it is not read from disk, and InvalidBuffer is
> returned instead.

To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something like

bool
BufferInCache(Relation, ForkNumber, BlockNumber)
{   /* XXX: setup tag, hash, partition */
   LWLockAcquire(newPartitionLock, LW_SHARED);   buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);  return buf_id != -1;
 
}

and then fall back to the normal ReadBuffer() when it's in cache.


Andres

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Michael Paquier

Дата:

25 августа 2015 г., 16:03:44

On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:
> On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
> To me it sounds like this shouldn't go through the full ReadBuffer()
> rigamarole. That code is already complex enough, and here it's really
> not needed. I think it'll be much easier to review - and actually faster
> in many cases to simply have something like
>
> bool
> BufferInCache(Relation, ForkNumber, BlockNumber)
> {
>     /* XXX: setup tag, hash, partition */
>
>     LWLockAcquire(newPartitionLock, LW_SHARED);
>     buf_id = BufTableLookup(&newTag, newHash);
>     LWLockRelease(newPartitionLock);
>     return buf_id != -1;
> }
>
> and then fall back to the normal ReadBuffer() when it's in cache.

Patch marked as returned with feedback as input from the author has
been waited for some time now.
-- 
Michael

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

03 сентября 2015 г., 00:10:28

25 авг. 2015 г., в 16:03, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:
On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something like

bool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
   /* XXX: setup tag, hash, partition */

   LWLockAcquire(newPartitionLock, LW_SHARED);
   buf_id = BufTableLookup(&newTag, newHash);
   LWLockRelease(newPartitionLock);
   return buf_id != -1;
}

and then fall back to the normal ReadBuffer() when it's in cache.

Yep, sounds good. Patch with implementation attached.

Patch marked as returned with feedback as input from the author has
been waited for some time now.

Sorry for delay, I will link it to the current commitfest.

--
Michael

May the force be with you…

https://simply.name

Вложения

btree_vacuum_v3.patch

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Greg Stark

Дата:

03 сентября 2015 г., 00:47:49

On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
> To me it sounds like this shouldn't go through the full ReadBuffer()
> rigamarole. That code is already complex enough, and here it's really
> not needed. I think it'll be much easier to review - and actually faster
> in many cases to simply have something like
>
> bool
> BufferInCache(Relation, ForkNumber, BlockNumber)
> {
>     /* XXX: setup tag, hash, partition */
>
>     LWLockAcquire(newPartitionLock, LW_SHARED);
>     buf_id = BufTableLookup(&newTag, newHash);
>     LWLockRelease(newPartitionLock);
>     return buf_id != -1;
> }
>
> and then fall back to the normal ReadBuffer() when it's in cache.

I guess I'm missing something but how is this API useful? As soon as
its returned it the result might be invalid since before you actually
make use of the return value another process could come and read in
and pin the page in question. Is this part of some interlock where you
only have to know it was unpinned at some point in time since some
other event?

-- 
greg

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Andres Freund

Дата:

03 сентября 2015 г., 01:17:06

On 2015-09-02 22:46:59 +0100, Greg Stark wrote:
> On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
> > To me it sounds like this shouldn't go through the full ReadBuffer()
> > rigamarole. That code is already complex enough, and here it's really
> > not needed. I think it'll be much easier to review - and actually faster
> > in many cases to simply have something like
> >
> > bool
> > BufferInCache(Relation, ForkNumber, BlockNumber)
> > {
> >     /* XXX: setup tag, hash, partition */
> >
> >     LWLockAcquire(newPartitionLock, LW_SHARED);
> >     buf_id = BufTableLookup(&newTag, newHash);
> >     LWLockRelease(newPartitionLock);
> >     return buf_id != -1;
> > }
> >
> > and then fall back to the normal ReadBuffer() when it's in cache.
> 
> 
> I guess I'm missing something but how is this API useful? As soon as
> its returned it the result might be invalid since before you actually
> make use of the return value another process could come and read in
> and pin the page in question. Is this part of some interlock where you
> only have to know it was unpinned at some point in time since some
> other event?

Yes. We're talking about this block:    for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)    {
  /*         * We use RBM_NORMAL_NO_LOG mode because it's not an error         * condition to see all-zero pages.  The
originalbtvacuumpage         * scan would have skipped over all-zero pages, noting them in FSM         * but not
botheringto initialize them just yet; so we mustn't         * throw an error here.  (We could skip acquiring the
cleanuplock         * if PageIsNew, but it's probably not worth the cycles to test.)         *         * XXX we don't
actuallyneed to read the block, we just need to         * confirm it is unpinned. If we had a special call into the
   * buffer manager we could optimise this so that if the block is         * not in shared_buffers we confirm it as
unpinned.        */        buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
         RBM_NORMAL_NO_LOG);        if (BufferIsValid(buffer))        {            LockBufferForCleanup(buffer);
   UnlockReleaseBuffer(buffer);        }    }}
 


Greetings,

Andres Freund

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Michael Paquier

Дата:

17 декабря 2015 г., 09:52:18

On Thu, Sep 3, 2015 at 6:10 AM, Vladimir Borodin wrote:
> Patch with implementation attached.
> Sorry for delay, I will link it to the current commitfest.

This patch looks correct, and is doing what it claims it does. It is
also not really worth refactoring the bit of code in PrefetchBuffer
that does a similar operation, so I am marking it as ready for
committer.
-- 
Michael

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Alvaro Herrera

Дата:

06 января 2016 г., 18:20:17

Vladimir Borodin wrote:

> There are situations in which vacuuming big btree index causes stuck
> in WAL replaying on hot standby servers for quite a long time. I’ve
> described the problem in more details in this thread [0]. Below in
> that thread Kevin Grittner proposed a good way for improving btree
> scans so that btree vacuuming logic could be seriously simplified.
> Since I don’t know when that may happen I’ve done a patch that makes
> some improvement right now. If Kevin or someone else would expand [1]
> for handling all types of btree scans, I suppose, my patch could be
> thrown away and vacuuming logic should be strongly rewritten.

You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.

Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
(Note there's an updated patch a few emails down the thread.)

If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon's.

Thanks!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Michael Paquier

Дата:

07 января 2016 г., 05:27:02

On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Vladimir Borodin wrote:
>
>> There are situations in which vacuuming big btree index causes stuck
>> in WAL replaying on hot standby servers for quite a long time. I’ve
>> described the problem in more details in this thread [0]. Below in
>> that thread Kevin Grittner proposed a good way for improving btree
>> scans so that btree vacuuming logic could be seriously simplified.
>> Since I don’t know when that may happen I’ve done a patch that makes
>> some improvement right now. If Kevin or someone else would expand [1]
>> for handling all types of btree scans, I suppose, my patch could be
>> thrown away and vacuuming logic should be strongly rewritten.
>
> You submitted this patch in May 2015 -- and 7 months later, Simon came
> up with another patch that's supposed to fix the underlying problem, so
> that this shouldn't be a problem anymore.
>
> Would you please have a look at Simon's patch, in particular verify
> whether it solves the performance dip in your testing environment?
> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
> (Note there's an updated patch a few emails down the thread.)
>
> If it seems to fix the problem for you, I think we should mark yours
> rejected and just apply Simon's.

Simon's patch (justly) does not update lastBlockVacuumed in the case
of toast indexes, but Vladimir's patch would still optimize this case,
no?
--
Michael

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

08 января 2016 г., 13:26:13

7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Vladimir Borodin wrote:

There are situations in which vacuuming big btree index causes stuck
in WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in
that thread Kevin Grittner proposed a good way for improving btree
scans so that btree vacuuming logic could be seriously simplified.
Since I don’t know when that may happen I’ve done a patch that makes
some improvement right now. If Kevin or someone else would expand [1]
for handling all types of btree scans, I suppose, my patch could be
thrown away and vacuuming logic should be strongly rewritten.

You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.

Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
(Note there's an updated patch a few emails down the thread.)

If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon’s.

Ok, I’ll try this patch with my use case. Basically, it’s not so easy now since I’ve partitioned that big table to not have such problems but there is a way to reproduce it once again. If it helps, I agree that my should be rejected in favor of the Simon’s patch because my patch just reduces replication lag but Simon’s seems to remove lag at all.

Simon's patch (justly) does not update lastBlockVacuumed in the case
of toast indexes, but Vladimir's patch would still optimize this case,
no?

I suppose, in case of _not_ toast indexes. But yes, seems that my patch should help in that case too.

--
Michael

May the force be with you…

https://simply.name

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Alvaro Herrera

Дата:

08 января 2016 г., 16:37:00

Vladimir Borodin wrote:
> 
> > 7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):
> > 
> > On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
> > <alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:

> >> Would you please have a look at Simon's patch, in particular verify
> >> whether it solves the performance dip in your testing environment?
> >> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
> >> (Note there's an updated patch a few emails down the thread.)
> >> 
> >> If it seems to fix the problem for you, I think we should mark yours
> >> rejected and just apply Simon’s.
> 
> Ok, I’ll try this patch with my use case. Basically, it’s not so easy
> now since I’ve partitioned that big table to not have such problems
> but there is a way to reproduce it once again. If it helps, I agree
> that my should be rejected in favor of the Simon’s patch because my
> patch just reduces replication lag but Simon’s seems to remove lag at
> all.

I would agree except for the observation on toast indexes.  I think
that's an important enough use case that perhaps we should have both.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Simon Riggs

Дата:

08 января 2016 г., 17:31:10

On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Vladimir Borodin wrote:
>
> > 7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):
> >
> > On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
> > <alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:

> >> Would you please have a look at Simon's patch, in particular verify
> >> whether it solves the performance dip in your testing environment?
> >> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
> >> (Note there's an updated patch a few emails down the thread.)
> >>
> >> If it seems to fix the problem for you, I think we should mark yours
> >> rejected and just apply Simon’s.
>
> Ok, I’ll try this patch with my use case. Basically, it’s not so easy
> now since I’ve partitioned that big table to not have such problems
> but there is a way to reproduce it once again. If it helps, I agree
> that my should be rejected in favor of the Simon’s patch because my
> patch just reduces replication lag but Simon’s seems to remove lag at
> all.

I would agree except for the observation on toast indexes. I think
that's an important enough use case that perhaps we should have both.

The exclusion of toast indexes is something we can remove also, I have recently discovered. When we access toast data we ignore MVCC, but we still have the toast pointer and chunkid to use for rechecking our scan results. So a later patch will add some rechecks.

So I don't think it is worth applying this patch now. I should add that I also had a patch that did this, posted earlier IIRC. That is not the reason to reject this, just me pointing out that I'm effectively rejecting my own earlier patch also.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Alvaro Herrera

Дата:

08 января 2016 г., 17:34:37

Simon Riggs wrote:
> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > I would agree except for the observation on toast indexes.  I think
> > that's an important enough use case that perhaps we should have both.
> 
> The exclusion of toast indexes is something we can remove also, I have
> recently discovered. When we access toast data we ignore MVCC, but we still
> have the toast pointer and chunkid to use for rechecking our scan results.
> So a later patch will add some rechecks.

Ah, interesting, glad to hear.  I take it you're pushing your patch
soon, then?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От

David Steele

Дата:

10 марта 2016 г., 03:29:15

On 1/8/16 9:34 AM, Alvaro Herrera wrote:
> Simon Riggs wrote:
>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> I would agree except for the observation on toast indexes.  I think
>>> that's an important enough use case that perhaps we should have both.
>> The exclusion of toast indexes is something we can remove also, I have
>> recently discovered. When we access toast data we ignore MVCC, but we still
>> have the toast pointer and chunkid to use for rechecking our scan results.
>> So a later patch will add some rechecks.
> Ah, interesting, glad to hear.  I take it you're pushing your patch
> soon, then?

ISTM that this patch should be "returned with feedback" or "rejected" 
based on the thread.  I'm marking it "waiting for author" for the time 
being.

Thanks,

-- 
-David
david@pgmasters.net

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Michael Paquier

Дата:

10 марта 2016 г., 09:27:40

On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
> On 1/8/16 9:34 AM, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>>
>>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> I would agree except for the observation on toast indexes.  I think
>>>> that's an important enough use case that perhaps we should have both.
>>>
>>> The exclusion of toast indexes is something we can remove also, I have
>>> recently discovered. When we access toast data we ignore MVCC, but we
>>> still
>>> have the toast pointer and chunkid to use for rechecking our scan
>>> results.
>>> So a later patch will add some rechecks.
>>
>> Ah, interesting, glad to hear.  I take it you're pushing your patch
>> soon, then?
>
> ISTM that this patch should be "returned with feedback" or "rejected" based
> on the thread.  I'm marking it "waiting for author" for the time being.

I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?
-- 
Michael

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Simon Riggs

Дата:

10 марта 2016 г., 11:50:47

On 10 March 2016 at 06:27, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
> On 1/8/16 9:34 AM, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>>
>>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> I would agree except for the observation on toast indexes. I think
>>>> that's an important enough use case that perhaps we should have both.
>>>
>>> The exclusion of toast indexes is something we can remove also, I have
>>> recently discovered. When we access toast data we ignore MVCC, but we
>>> still
>>> have the toast pointer and chunkid to use for rechecking our scan
>>> results.
>>> So a later patch will add some rechecks.
>>
>> Ah, interesting, glad to hear. I take it you're pushing your patch
>> soon, then?
>
> ISTM that this patch should be "returned with feedback" or "rejected" based
> on the thread. I'm marking it "waiting for author" for the time being.

I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?

Yes, I have done the research, so think patch should be rejected now.

Thanks to everyone for their input. It's good to have alternate approaches.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

10 марта 2016 г., 12:01:18

10 марта 2016 г., в 11:50, Simon Riggs <simon@2ndquadrant.com> написал(а):

On 10 March 2016 at 06:27, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
> On 1/8/16 9:34 AM, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>>
>>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> I would agree except for the observation on toast indexes. I think
>>>> that's an important enough use case that perhaps we should have both.
>>>
>>> The exclusion of toast indexes is something we can remove also, I have
>>> recently discovered. When we access toast data we ignore MVCC, but we
>>> still
>>> have the toast pointer and chunkid to use for rechecking our scan
>>> results.
>>> So a later patch will add some rechecks.
>>
>> Ah, interesting, glad to hear. I take it you're pushing your patch
>> soon, then?
>
> ISTM that this patch should be "returned with feedback" or "rejected" based
> on the thread. I'm marking it "waiting for author" for the time being.

I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?

Yes, I have done the research, so think patch should be rejected now.

Let’s do immediately after you will send a new version of your patch? Or even better after testing your patch? Don’t get me wrong, but rejecting my patch without tangible work on your patch may lead to forgiving about the problem before 9.6 freeze.

Thanks to everyone for their input. It's good to have alternate approaches.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

May the force be with you…

https://simply.name

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Michael Paquier

Дата:

10 марта 2016 г., 12:22:56

On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.
--
Michael

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Simon Riggs

Дата:

10 марта 2016 г., 14:38:31

On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.

I attach 2 patches.

avoid_pin_scan_always.v1.patch

Takes the approach that we generate the same WAL records as in 9.5, we just choose not to do anything with that information. This is possible because we don't care anymore whether it is toast or other relations. So it effectively reverts parts of the earlier patch.

This could be easily back-patched more easily.

toast_recheck.v1.patch

Adds recheck code for toast access. I'm not certain this is necessary, but here it is. No problems found with it.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Simon Riggs

Дата:

11 марта 2016 г., 12:36:37

On 10 March 2016 at 11:38, Simon Riggs <simon@2ndquadrant.com> wrote:

On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.

I attach 2 patches.

I'll wait for review, so setting status back to Needs Review.

Assuming review, I will commit the week w/c 21 March.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

14 марта 2016 г., 21:15:45

10 марта 2016 г., в 14:38, Simon Riggs <simon@2ndquadrant.com> написал(а):

On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.

I attach 2 patches.

avoid_pin_scan_always.v1.patch
Takes the approach that we generate the same WAL records as in 9.5, we just choose not to do anything with that information. This is possible because we don't care anymore whether it is toast or other relations. So it effectively reverts parts of the earlier patch.
This could be easily back-patched more easily.

toast_recheck.v1.patch
Adds recheck code for toast access. I'm not certain this is necessary, but here it is. No problems found with it.

JFYI, I’m preparing the stand to reproduce the initial problem and I hope to finish testing this week.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<avoid_pin_scan_always.v1.patch><toast_recheck.v1.patch>

May the force be with you…

https://simply.name

Re: Improving replay of XLOG_BTREE_VACUUM records

От

David Steele

Дата:

25 марта 2016 г., 19:11:28

Hi Vladimir,

On 3/14/16 2:15 PM, Vladimir Borodin wrote:

> JFYI, I’m preparing the stand to reproduce the initial problem and I
> hope to finish testing this week.

Do you know when you'll have the results from the testing you were going 
to do?  It seems this patch is currently waiting on that to be finished.

Thanks,
-- 
-David
david@pgmasters.net

Re: Improving replay of XLOG_BTREE_VACUUM records

От

Vladimir Borodin

Дата:

25 марта 2016 г., 21:09:29

25 марта 2016 г., в 19:11, David Steele <david@pgmasters.net> написал(а):

Hi Vladimir,

On 3/14/16 2:15 PM, Vladimir Borodin wrote:

JFYI, I’m preparing the stand to reproduce the initial problem and I
hope to finish testing this week.

Do you know when you'll have the results from the testing you were going to do? It seems this patch is currently waiting on that to be finished.

I couldn’t reproduce the problem on pgbench database with scale factor 50000 last week. The test case was quite simple:

1. On master I was adding data to pgbench_accounts table.

2. On standby I was doing the following:

postgres@pgload01d ~ $ cat /tmp/query

\set naccounts 100000 * :scale

SELECT aid FROM pgbench_accounts WHERE aid = :naccounts;

postgres@pgload01d ~ $ /usr/pgsql-9.6/bin/pgbench -M prepared -f /tmp/query -c 1 -j 1 -T 3600 -P 10 -S -n pgbench

3. On master I was sometimes calling VACUUM pgbench_accounts.

Without applying patches there weren’t huge replication lags on standbys. Seems, that I'm doing something wrong… I’m doing my best right now to find the reason but I can’t give you any time evaluation :(

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

May the force be with you…

https://simply.name

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Improving replay of XLOG_BTREE_VACUUM records

Вложения

Вложения

Вложения

Вложения