Обсуждение: Improving replay of XLOG_BTREE_VACUUM records

Поиск
Список
Период
Сортировка

Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:
Hi all.

There are situations in which vacuuming big btree index causes stuck in WAL replaying on hot standby servers for quite a long time. I’ve described the problem in more details in this thread [0]. Below in that thread Kevin Grittner proposed a good way for improving btree scans so that btree vacuuming logic could be seriously simplified. Since I don’t know when that may happen I’ve done a patch that makes some improvement right now. If Kevin or someone else would expand [1] for handling all types of btree scans, I suppose, my patch could be thrown away and vacuuming logic should be strongly rewritten.

The idea of the patch is not to read pages from disk to make sure they are unpinned (like btree_xlog_vacuum does it right now). This is done with creating a new ReadBufferMode which returns locked buffer without setting BM_VALID flag on it. I don’t know if that is the right way of doing that but it seems to work well.

Testing it my environment gives a good win [2] - green is unpatched replica, blue is replica with a patch, two spikes are results of calling manual vacuuming of big table. Since the picture could be unavailable I’ll write here that:
1. replication delay reduced from ~1250 MB to 200 MB of replay_location lag,
2. patched replica caught master in less than a minute against 12 minutes of unpatched replica,
3. Physical I/O load on patched replica didn’t change compared with the normal workload while unpatched replica did lots of reads from PGDATA during spikes.

There is still a stuck in WAL replay but much smaller that right now. Also this change seems not to affect any other scenarios.

I’ll add it to 2015-06 commitfest.



--
May the force be with you…

Вложения

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Jim Nasby
Дата:
On 5/1/15 11:19 AM, Vladimir Borodin wrote:
> There are situations in which vacuuming big btree index causes stuck in
> WAL replaying on hot standby servers for quite a long time. I’ve
> described the problem in more details in this thread [0]. Below in that
> thread Kevin Grittner proposed a good way for improving btree scans so
> that btree vacuuming logic could be seriously simplified. Since I don’t
> know when that may happen I’ve done a patch that makes some improvement
> right now. If Kevin or someone else would expand [1] for handling all
> types of btree scans, I suppose, my patch could be thrown away and
> vacuuming logic should be strongly rewritten.

This looks like a good way to address this until the more significant 
work can be done.

I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID? 
or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on 
the code; I see the logic to NO_BM_VALID...

+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer so that noone could pin it.

It would be better to explain why we want that mode. How about:

RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set 
BM_VALID before returning the buffer. This is done to ensure that no one 
can pin the buffer without actually reading the buffer contents in. This 
is necessary while replying XLOG_BTREE_VACUUM records in hot standby.

+        if (mode == RBM_ZERO_NO_BM_VALID)
+            TerminateBufferIO(bufHdr, false, 0);
+        else
+            TerminateBufferIO(bufHdr, false, BM_VALID);

Simply passing in a 0 seems a bit odd to me; is there anywhere else we 
do that?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:
Hi, Jim.

Thanks for review.

2 мая 2015 г., в 2:10, Jim Nasby <Jim.Nasby@BlueTreble.com> написал(а):

On 5/1/15 11:19 AM, Vladimir Borodin wrote:
There are situations in which vacuuming big btree index causes stuck in
WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in that
thread Kevin Grittner proposed a good way for improving btree scans so
that btree vacuuming logic could be seriously simplified. Since I don’t
know when that may happen I’ve done a patch that makes some improvement
right now. If Kevin or someone else would expand [1] for handling all
types of btree scans, I suppose, my patch could be thrown away and
vacuuming logic should be strongly rewritten.

This looks like a good way to address this until the more significant work can be done.

I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID? or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on the code; I see the logic to NO_BM_VALID…

Perhaps, RBM_ZERO_NO_BM_VALID is not so good (it makes more difficult to grep BM_VALID in code), but I don’t actually like BM_INVALID and BM_NOT_VALID, sorry :( But I also don’t insist on NO_BM_VALID, any other suggestions?


+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
+ * BM_VALID bit before returning buffer so that noone could pin it.

It would be better to explain why we want that mode. How about:

RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set BM_VALID before returning the buffer. This is done to ensure that no one can pin the buffer without actually reading the buffer contents in. This is necessary while replying XLOG_BTREE_VACUUM records in hot standby.

Good point, fixed in attached patch.


+ if (mode == RBM_ZERO_NO_BM_VALID)
+ TerminateBufferIO(bufHdr, false, 0);
+ else
+ TerminateBufferIO(bufHdr, false, BM_VALID);

Simply passing in a 0 seems a bit odd to me; is there anywhere else we do that?

Yes, it is done the same way in FlushBuffer function [0]. Also comments before TerminateBufferIO say that 0 is expected value for third argument.



--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
May the force be with you…

Вложения

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Heikki Linnakangas
Дата:
On 05/02/2015 02:10 AM, Jim Nasby wrote:
> This looks like a good way to address this until the more significant
> work can be done.
>
> I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
> or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
> the code; I see the logic to NO_BM_VALID...
>
> + * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
> + * BM_VALID bit before returning buffer so that noone could pin it.
>
> It would be better to explain why we want that mode. How about:
>
> RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
> BM_VALID before returning the buffer. This is done to ensure that no one
> can pin the buffer without actually reading the buffer contents in. This
> is necessary while replying XLOG_BTREE_VACUUM records in hot standby.

That's a rather strange mode. The BM_VALID flag is internal to the 
buffer manager - if you don't know how the buffer manager works, you 
cannot understand that description. I couldn't quite understand what 
that means myself. What can or can you not do with a buffer that's not 
marked as BM_VALID? I'd suggest a mode like this instead:

In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer 
cache already. If it's not, it is not read from disk, and InvalidBuffer 
is returned instead.
- Heikki




Re: Improving replay of XLOG_BTREE_VACUUM records

От
Jim Nasby
Дата:
On 7/24/15 1:53 AM, Heikki Linnakangas wrote:
> On 05/02/2015 02:10 AM, Jim Nasby wrote:
>> This looks like a good way to address this until the more significant
>> work can be done.
>>
>> I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
>> or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
>> the code; I see the logic to NO_BM_VALID...
>>
>> + * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does
>> not set
>> + * BM_VALID bit before returning buffer so that noone could pin it.
>>
>> It would be better to explain why we want that mode. How about:
>>
>> RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
>> BM_VALID before returning the buffer. This is done to ensure that no one
>> can pin the buffer without actually reading the buffer contents in. This
>> is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
>
> That's a rather strange mode. The BM_VALID flag is internal to the
> buffer manager - if you don't know how the buffer manager works, you
> cannot understand that description. I couldn't quite understand what
> that means myself. What can or can you not do with a buffer that's not
> marked as BM_VALID? I'd suggest a mode like this instead:
>
> In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer
> cache already. If it's not, it is not read from disk, and InvalidBuffer
> is returned instead.

+1, though I think it's still worth mentioning why we're doing this (so 
we know no one else can pin the buffer).
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Andres Freund
Дата:
On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
> On 05/02/2015 02:10 AM, Jim Nasby wrote:
> >This looks like a good way to address this until the more significant
> >work can be done.
> >
> >I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
> >or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
> >the code; I see the logic to NO_BM_VALID...
> >
> >+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
> >+ * BM_VALID bit before returning buffer so that noone could pin it.
> >
> >It would be better to explain why we want that mode. How about:
> >
> >RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
> >BM_VALID before returning the buffer. This is done to ensure that no one
> >can pin the buffer without actually reading the buffer contents in. This
> >is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
>
> That's a rather strange mode. The BM_VALID flag is internal to the buffer
> manager - if you don't know how the buffer manager works, you cannot
> understand that description. I couldn't quite understand what that means
> myself. What can or can you not do with a buffer that's not marked as
> BM_VALID? I'd suggest a mode like this instead:
>
> In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer cache
> already. If it's not, it is not read from disk, and InvalidBuffer is
> returned instead.

To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something like

bool
BufferInCache(Relation, ForkNumber, BlockNumber)
{   /* XXX: setup tag, hash, partition */
   LWLockAcquire(newPartitionLock, LW_SHARED);   buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);  return buf_id != -1;
 
}

and then fall back to the normal ReadBuffer() when it's in cache.


Andres



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Michael Paquier
Дата:
On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:
> On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
> To me it sounds like this shouldn't go through the full ReadBuffer()
> rigamarole. That code is already complex enough, and here it's really
> not needed. I think it'll be much easier to review - and actually faster
> in many cases to simply have something like
>
> bool
> BufferInCache(Relation, ForkNumber, BlockNumber)
> {
>     /* XXX: setup tag, hash, partition */
>
>     LWLockAcquire(newPartitionLock, LW_SHARED);
>     buf_id = BufTableLookup(&newTag, newHash);
>     LWLockRelease(newPartitionLock);
>     return buf_id != -1;
> }
>
> and then fall back to the normal ReadBuffer() when it's in cache.

Patch marked as returned with feedback as input from the author has
been waited for some time now.
-- 
Michael



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:

25 авг. 2015 г., в 16:03, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:
On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something like

bool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
   /* XXX: setup tag, hash, partition */

   LWLockAcquire(newPartitionLock, LW_SHARED);
   buf_id = BufTableLookup(&newTag, newHash);
   LWLockRelease(newPartitionLock);
   return buf_id != -1;
}

and then fall back to the normal ReadBuffer() when it's in cache.

Yep, sounds good. Patch with implementation attached.


Patch marked as returned with feedback as input from the author has
been waited for some time now.

Sorry for delay, I will link it to the current commitfest.


--
Michael


--
May the force be with you…

Вложения

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Greg Stark
Дата:
On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
> To me it sounds like this shouldn't go through the full ReadBuffer()
> rigamarole. That code is already complex enough, and here it's really
> not needed. I think it'll be much easier to review - and actually faster
> in many cases to simply have something like
>
> bool
> BufferInCache(Relation, ForkNumber, BlockNumber)
> {
>     /* XXX: setup tag, hash, partition */
>
>     LWLockAcquire(newPartitionLock, LW_SHARED);
>     buf_id = BufTableLookup(&newTag, newHash);
>     LWLockRelease(newPartitionLock);
>     return buf_id != -1;
> }
>
> and then fall back to the normal ReadBuffer() when it's in cache.


I guess I'm missing something but how is this API useful? As soon as
its returned it the result might be invalid since before you actually
make use of the return value another process could come and read in
and pin the page in question. Is this part of some interlock where you
only have to know it was unpinned at some point in time since some
other event?

-- 
greg



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Andres Freund
Дата:
On 2015-09-02 22:46:59 +0100, Greg Stark wrote:
> On Sun, Jul 26, 2015 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
> > To me it sounds like this shouldn't go through the full ReadBuffer()
> > rigamarole. That code is already complex enough, and here it's really
> > not needed. I think it'll be much easier to review - and actually faster
> > in many cases to simply have something like
> >
> > bool
> > BufferInCache(Relation, ForkNumber, BlockNumber)
> > {
> >     /* XXX: setup tag, hash, partition */
> >
> >     LWLockAcquire(newPartitionLock, LW_SHARED);
> >     buf_id = BufTableLookup(&newTag, newHash);
> >     LWLockRelease(newPartitionLock);
> >     return buf_id != -1;
> > }
> >
> > and then fall back to the normal ReadBuffer() when it's in cache.
> 
> 
> I guess I'm missing something but how is this API useful? As soon as
> its returned it the result might be invalid since before you actually
> make use of the return value another process could come and read in
> and pin the page in question. Is this part of some interlock where you
> only have to know it was unpinned at some point in time since some
> other event?

Yes. We're talking about this block:    for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)    {
  /*         * We use RBM_NORMAL_NO_LOG mode because it's not an error         * condition to see all-zero pages.  The
originalbtvacuumpage         * scan would have skipped over all-zero pages, noting them in FSM         * but not
botheringto initialize them just yet; so we mustn't         * throw an error here.  (We could skip acquiring the
cleanuplock         * if PageIsNew, but it's probably not worth the cycles to test.)         *         * XXX we don't
actuallyneed to read the block, we just need to         * confirm it is unpinned. If we had a special call into the
   * buffer manager we could optimise this so that if the block is         * not in shared_buffers we confirm it as
unpinned.        */        buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
         RBM_NORMAL_NO_LOG);        if (BufferIsValid(buffer))        {            LockBufferForCleanup(buffer);
   UnlockReleaseBuffer(buffer);        }    }}
 


Greetings,

Andres Freund



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Michael Paquier
Дата:
On Thu, Sep 3, 2015 at 6:10 AM, Vladimir Borodin wrote:
> Patch with implementation attached.
> Sorry for delay, I will link it to the current commitfest.

This patch looks correct, and is doing what it claims it does. It is
also not really worth refactoring the bit of code in PrefetchBuffer
that does a similar operation, so I am marking it as ready for
committer.
-- 
Michael



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Alvaro Herrera
Дата:
Vladimir Borodin wrote:

> There are situations in which vacuuming big btree index causes stuck
> in WAL replaying on hot standby servers for quite a long time. I’ve
> described the problem in more details in this thread [0]. Below in
> that thread Kevin Grittner proposed a good way for improving btree
> scans so that btree vacuuming logic could be seriously simplified.
> Since I don’t know when that may happen I’ve done a patch that makes
> some improvement right now. If Kevin or someone else would expand [1]
> for handling all types of btree scans, I suppose, my patch could be
> thrown away and vacuuming logic should be strongly rewritten.

You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.

Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
(Note there's an updated patch a few emails down the thread.)

If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon's.

Thanks!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Michael Paquier
Дата:
On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Vladimir Borodin wrote:
>
>> There are situations in which vacuuming big btree index causes stuck
>> in WAL replaying on hot standby servers for quite a long time. I’ve
>> described the problem in more details in this thread [0]. Below in
>> that thread Kevin Grittner proposed a good way for improving btree
>> scans so that btree vacuuming logic could be seriously simplified.
>> Since I don’t know when that may happen I’ve done a patch that makes
>> some improvement right now. If Kevin or someone else would expand [1]
>> for handling all types of btree scans, I suppose, my patch could be
>> thrown away and vacuuming logic should be strongly rewritten.
>
> You submitted this patch in May 2015 -- and 7 months later, Simon came
> up with another patch that's supposed to fix the underlying problem, so
> that this shouldn't be a problem anymore.
>
> Would you please have a look at Simon's patch, in particular verify
> whether it solves the performance dip in your testing environment?
> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
> (Note there's an updated patch a few emails down the thread.)
>
> If it seems to fix the problem for you, I think we should mark yours
> rejected and just apply Simon's.

Simon's patch (justly) does not update lastBlockVacuumed in the case
of toast indexes, but Vladimir's patch would still optimize this case,
no?
--
Michael



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:

7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Vladimir Borodin wrote:

There are situations in which vacuuming big btree index causes stuck
in WAL replaying on hot standby servers for quite a long time. I’ve
described the problem in more details in this thread [0]. Below in
that thread Kevin Grittner proposed a good way for improving btree
scans so that btree vacuuming logic could be seriously simplified.
Since I don’t know when that may happen I’ve done a patch that makes
some improvement right now. If Kevin or someone else would expand [1]
for handling all types of btree scans, I suppose, my patch could be
thrown away and vacuuming logic should be strongly rewritten.

You submitted this patch in May 2015 -- and 7 months later, Simon came
up with another patch that's supposed to fix the underlying problem, so
that this shouldn't be a problem anymore.

Would you please have a look at Simon's patch, in particular verify
whether it solves the performance dip in your testing environment?
https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
(Note there's an updated patch a few emails down the thread.)

If it seems to fix the problem for you, I think we should mark yours
rejected and just apply Simon’s.

Ok, I’ll try this patch with my use case. Basically, it’s not so easy now since I’ve partitioned that big table to not have such problems but there is a way to reproduce it once again. If it helps, I agree that my should be rejected in favor of the Simon’s patch because my patch just reduces replication lag but Simon’s seems to remove lag at all.


Simon's patch (justly) does not update lastBlockVacuumed in the case
of toast indexes, but Vladimir's patch would still optimize this case,
no?

I suppose, in case of _not_ toast indexes. But yes, seems that my patch should help in that case too.

-- 
Michael


--
May the force be with you…

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Alvaro Herrera
Дата:
Vladimir Borodin wrote:
> 
> > 7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):
> > 
> > On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
> > <alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:

> >> Would you please have a look at Simon's patch, in particular verify
> >> whether it solves the performance dip in your testing environment?
> >> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
> >> (Note there's an updated patch a few emails down the thread.)
> >> 
> >> If it seems to fix the problem for you, I think we should mark yours
> >> rejected and just apply Simon’s.
> 
> Ok, I’ll try this patch with my use case. Basically, it’s not so easy
> now since I’ve partitioned that big table to not have such problems
> but there is a way to reproduce it once again. If it helps, I agree
> that my should be rejected in favor of the Simon’s patch because my
> patch just reduces replication lag but Simon’s seems to remove lag at
> all.

I would agree except for the observation on toast indexes.  I think
that's an important enough use case that perhaps we should have both.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Simon Riggs
Дата:
On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Vladimir Borodin wrote:
>
> > 7 янв. 2016 г., в 5:26, Michael Paquier <michael.paquier@gmail.com> написал(а):
> >
> > On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
> > <alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:

> >> Would you please have a look at Simon's patch, in particular verify
> >> whether it solves the performance dip in your testing environment?
> >> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
> >> (Note there's an updated patch a few emails down the thread.)
> >>
> >> If it seems to fix the problem for you, I think we should mark yours
> >> rejected and just apply Simon’s.
>
> Ok, I’ll try this patch with my use case. Basically, it’s not so easy
> now since I’ve partitioned that big table to not have such problems
> but there is a way to reproduce it once again. If it helps, I agree
> that my should be rejected in favor of the Simon’s patch because my
> patch just reduces replication lag but Simon’s seems to remove lag at
> all.

I would agree except for the observation on toast indexes.  I think
that's an important enough use case that perhaps we should have both.

The exclusion of toast indexes is something we can remove also, I have recently discovered. When we access toast data we ignore MVCC, but we still have the toast pointer and chunkid to use for rechecking our scan results. So a later patch will add some rechecks.

So I don't think it is worth applying this patch now. I should add that I also had a patch that did this, posted earlier IIRC. That is not the reason to reject this, just me pointing out that I'm effectively rejecting my own earlier patch also.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Alvaro Herrera
Дата:
Simon Riggs wrote:
> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > I would agree except for the observation on toast indexes.  I think
> > that's an important enough use case that perhaps we should have both.
> 
> The exclusion of toast indexes is something we can remove also, I have
> recently discovered. When we access toast data we ignore MVCC, but we still
> have the toast pointer and chunkid to use for rechecking our scan results.
> So a later patch will add some rechecks.

Ah, interesting, glad to hear.  I take it you're pushing your patch
soon, then?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Improving replay of XLOG_BTREE_VACUUM records

От
David Steele
Дата:
On 1/8/16 9:34 AM, Alvaro Herrera wrote:
> Simon Riggs wrote:
>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> I would agree except for the observation on toast indexes.  I think
>>> that's an important enough use case that perhaps we should have both.
>> The exclusion of toast indexes is something we can remove also, I have
>> recently discovered. When we access toast data we ignore MVCC, but we still
>> have the toast pointer and chunkid to use for rechecking our scan results.
>> So a later patch will add some rechecks.
> Ah, interesting, glad to hear.  I take it you're pushing your patch
> soon, then?

ISTM that this patch should be "returned with feedback" or "rejected" 
based on the thread.  I'm marking it "waiting for author" for the time 
being.

Thanks,

-- 
-David
david@pgmasters.net




Re: Improving replay of XLOG_BTREE_VACUUM records

От
Michael Paquier
Дата:
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
> On 1/8/16 9:34 AM, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>>
>>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> I would agree except for the observation on toast indexes.  I think
>>>> that's an important enough use case that perhaps we should have both.
>>>
>>> The exclusion of toast indexes is something we can remove also, I have
>>> recently discovered. When we access toast data we ignore MVCC, but we
>>> still
>>> have the toast pointer and chunkid to use for rechecking our scan
>>> results.
>>> So a later patch will add some rechecks.
>>
>> Ah, interesting, glad to hear.  I take it you're pushing your patch
>> soon, then?
>
> ISTM that this patch should be "returned with feedback" or "rejected" based
> on the thread.  I'm marking it "waiting for author" for the time being.

I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?
-- 
Michael



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Simon Riggs
Дата:
On 10 March 2016 at 06:27, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
> On 1/8/16 9:34 AM, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>>
>>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> I would agree except for the observation on toast indexes.  I think
>>>> that's an important enough use case that perhaps we should have both.
>>>
>>> The exclusion of toast indexes is something we can remove also, I have
>>> recently discovered. When we access toast data we ignore MVCC, but we
>>> still
>>> have the toast pointer and chunkid to use for rechecking our scan
>>> results.
>>> So a later patch will add some rechecks.
>>
>> Ah, interesting, glad to hear.  I take it you're pushing your patch
>> soon, then?
>
> ISTM that this patch should be "returned with feedback" or "rejected" based
> on the thread.  I'm marking it "waiting for author" for the time being.

I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?

Yes, I have done the research, so think patch should be rejected now.

Thanks to everyone for their input. It's good to have alternate approaches.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:

10 марта 2016 г., в 11:50, Simon Riggs <simon@2ndquadrant.com> написал(а):

On 10 March 2016 at 06:27, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 1:29 AM, David Steele <david@pgmasters.net> wrote:
> On 1/8/16 9:34 AM, Alvaro Herrera wrote:
>> Simon Riggs wrote:
>>>
>>> On 8 January 2016 at 13:36, Alvaro Herrera <alvherre@2ndquadrant.com>
>>> wrote:
>>>>
>>>> I would agree except for the observation on toast indexes.  I think
>>>> that's an important enough use case that perhaps we should have both.
>>>
>>> The exclusion of toast indexes is something we can remove also, I have
>>> recently discovered. When we access toast data we ignore MVCC, but we
>>> still
>>> have the toast pointer and chunkid to use for rechecking our scan
>>> results.
>>> So a later patch will add some rechecks.
>>
>> Ah, interesting, glad to hear.  I take it you're pushing your patch
>> soon, then?
>
> ISTM that this patch should be "returned with feedback" or "rejected" based
> on the thread.  I'm marking it "waiting for author" for the time being.

I think that we are still waiting for some input from Simon here...
Simon, are you going to finish wrapping up your other patch?

Yes, I have done the research, so think patch should be rejected now.

Let’s do immediately after you will send a new version of your patch? Or even better after testing your patch? Don’t get me wrong, but rejecting my patch without tangible work on your patch may lead to forgiving about the problem before 9.6 freeze.


Thanks to everyone for their input. It's good to have alternate approaches.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
May the force be with you…

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Michael Paquier
Дата:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.
--
Michael



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Simon Riggs
Дата:
On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.

I attach 2 patches.

avoid_pin_scan_always.v1.patch 
Takes the approach that we generate the same WAL records as in 9.5, we just choose not to do anything with that information. This is possible because we don't care anymore whether it is toast or other relations. So it effectively reverts parts of the earlier patch.
This could be easily back-patched more easily.

toast_recheck.v1.patch
Adds recheck code for toast access. I'm not certain this is necessary, but here it is. No problems found with it.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Simon Riggs
Дата:
On 10 March 2016 at 11:38, Simon Riggs <simon@2ndquadrant.com> wrote:
On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.

I attach 2 patches.

I'll wait for review,  so setting status back to Needs Review.

Assuming review, I will commit the week w/c 21 March.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:

10 марта 2016 г., в 14:38, Simon Riggs <simon@2ndquadrant.com> написал(а):

On 10 March 2016 at 09:22, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin <root@simply.name> wrote:
> Let’s do immediately after you will send a new version of your patch? Or
> even better after testing your patch? Don’t get me wrong, but rejecting my
> patch without tangible work on your patch may lead to forgiving about the
> problem before 9.6 freeze.

This makes sense. Let's not reject this patch yet if the alternative
approach is not committed.

I attach 2 patches.

avoid_pin_scan_always.v1.patch 
Takes the approach that we generate the same WAL records as in 9.5, we just choose not to do anything with that information. This is possible because we don't care anymore whether it is toast or other relations. So it effectively reverts parts of the earlier patch.
This could be easily back-patched more easily.

toast_recheck.v1.patch
Adds recheck code for toast access. I'm not certain this is necessary, but here it is. No problems found with it.

JFYI, I’m preparing the stand to reproduce the initial problem and I hope to finish testing this week.


--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<avoid_pin_scan_always.v1.patch><toast_recheck.v1.patch>


--
May the force be with you…

Re: Improving replay of XLOG_BTREE_VACUUM records

От
David Steele
Дата:
Hi Vladimir,

On 3/14/16 2:15 PM, Vladimir Borodin wrote:

> JFYI, I’m preparing the stand to reproduce the initial problem and I
> hope to finish testing this week.

Do you know when you'll have the results from the testing you were going 
to do?  It seems this patch is currently waiting on that to be finished.

Thanks,
-- 
-David
david@pgmasters.net



Re: Improving replay of XLOG_BTREE_VACUUM records

От
Vladimir Borodin
Дата:

25 марта 2016 г., в 19:11, David Steele <david@pgmasters.net> написал(а):

Hi Vladimir,

On 3/14/16 2:15 PM, Vladimir Borodin wrote:

JFYI, I’m preparing the stand to reproduce the initial problem and I
hope to finish testing this week.

Do you know when you'll have the results from the testing you were going to do?  It seems this patch is currently waiting on that to be finished.

I couldn’t reproduce the problem on pgbench database with scale factor 50000 last week. The test case was quite simple:
1. On master I was adding data to pgbench_accounts table.
2. On standby I was doing the following:
postgres@pgload01d ~ $ cat /tmp/query
\set naccounts 100000 * :scale
SELECT aid FROM pgbench_accounts WHERE aid = :naccounts;
postgres@pgload01d ~ $ /usr/pgsql-9.6/bin/pgbench -M prepared -f /tmp/query -c 1 -j 1 -T 3600 -P 10 -S -n pgbench
3. On master I was sometimes calling VACUUM pgbench_accounts.

Without applying patches there weren’t huge replication lags on standbys. Seems, that I'm doing something wrong… I’m doing my best right now to find the reason but I can’t give you any time evaluation :(


Thanks,
--
-David
david@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…