Re: Speed up Clog Access by increasing CLOG buffers
От | Amit Kapila |
---|---|
Тема | Re: Speed up Clog Access by increasing CLOG buffers |
Дата | |
Msg-id | CAA4eK1+8gQTyGSZLe1Rb7jeM1Beh4FqA4VNjtpZcmvwizDQ0hw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Speed up Clog Access by increasing CLOG buffers (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: Speed up Clog Access by increasing CLOG buffers
(Amit Kapila <amit.kapila16@gmail.com>)
|
Список | pgsql-hackers |
On Tue, Mar 22, 2016 at 4:22 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
> > * Record the final state of transaction entries in the commit log for
> > * all entries on a single page. Atomic only on this page.
> > *
> > + * Group the status update for transactions. This improves the efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it. To achieve the group transaction status
> > + * update, we need to populate the transaction status related information
> > + * in shared memory and doing it for overflowed sub-transactions would need
> > + * a big chunk of shared memory, so we are not doing this optimization for
> > + * such cases. This optimization is only applicable if the transaction and
> > + * all child sub-transactions belong to same page which we presume to be the
> > + * most common case, we might be able to apply this when they are not on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache based
> > + * on pageno for which each time a group leader needs to set the transaction
> > + * status and that can lead to some performance penalty as well because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave that
> > + * case for now. We don't do this optimization for prepared transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group status
> > + * update. We choose not to create a semaphore for dummy procs for this
> > + * purpose as the advantage of using this optimization for prepared transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>
Okay, I have simplified the sentences in the comment.
>
>
> > * Otherwise API is same as TransactionIdSetTreeStatus()
> > */
> > static void
> > TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> > TransactionId *subxids, XidStatus status,
> > - XLogRecPtr lsn, int pageno)
> > + XLogRecPtr lsn, int pageno,
> > + bool all_xact_same_page)
> > +{
> > + /*
> > + * If we can immediately acquire CLogControlLock, we update the status
> > + * of our own XID and release the lock. If not, use group XID status
> > + * update to improve efficiency and if still not able to update, then
> > + * acquire CLogControlLock and update it.
> > + */
> > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > + {
> > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > + LWLockRelease(CLogControlLock);
> > + }
> > + else if (!all_xact_same_page ||
> > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > + IsGXactActive() ||
> > + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> > + {
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > +
> > + LWLockRelease(CLogControlLock);
> > + }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
> PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
> lock acquire first can be rather expensive.
> b) I'd rather see an explicit fallback for the
> !TransactionGroupUpdateXidStatus case, this way it's too hard to
> understand. It's also harder to add probes to detect whether that
>
>
>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> > + * on behalf of all group members. This avoids a great deal of contention
> > + * around CLogControlLock when many processes are trying to commit at once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > + XLogRecPtr lsn, int pageno)
> > +{
> > + volatile PROC_HDR *procglobal = ProcGlobal;
> > + PGPROC *proc = MyProc;
> > + uint32 nextidx;
> > + uint32 wakeidx;
> > + int extraWaits = -1;
> > +
> > + /* We should definitely have an XID whose status needs to be updated. */
> > + Assert(TransactionIdIsValid(xid));
> > +
> > + /*
> > + * Add ourselves to the list of processes needing a group XID status
> > + * update.
> > + */
> > + proc->clogGroupMember = true;
> > + proc->clogGroupMemberXid = xid;
> > + proc->clogGroupMemberXidStatus = status;
> > + proc->clogGroupMemberPage = pageno;
> > + proc->clogGroupMemberLsn = lsn;
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > + /*
> > + * Add the proc to list, if the clog page where we need to update the
> > + * current transaction status is same as group leader's clog page.
> > + * There is a race condition here such that after doing the below
> > + * check and before adding this proc's clog update to a group, if the
> > + * group leader already finishes the group update for this page and
> > + * becomes group leader of another group which updates different clog
> > + * page, then it will lead to a situation where a single group can
> > + * have different clog page updates. Now the chances of such a race
> > + * condition are less and even if it happens, the only downside is
> > + * that it could lead to serial access of clog pages from disk if
> > + * those pages are not in memory. Tests doesn't indicate any
> > + * performance hit due to different clog page updates in same group,
> > + * however in future, if we want to improve the situation, then we can
> > + * detect the non-group leader transactions that tries to update the
> > + * different CLOG page after acquiring CLogControlLock and then mark
> > + * these transactions such that after waking they need to perform CLOG
> > + * update via normal path.
> > + */
>
> Needs a good portion of polishing.
>
>
> > + if (nextidx != INVALID_PGPROCNO &&
> > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> > + return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>
>
> > + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > + &nextidx,
> > + (uint32) proc->pgprocno))
> > + break;
> > + }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>
>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>
>
> > + /*
> > + * If the list was not empty, the leader will update the status of our
> > + * XID. It is impossible to have followers without a leader because the
> > + * first process that has added itself to the list will always have
> > + * nextidx as INVALID_PGPROCNO.
> > + */
> > + if (nextidx != INVALID_PGPROCNO)
> > + {
> > + /* Sleep until the leader updates our XID status. */
> > + for (;;)
> > + {
> > + /* acts as a read barrier */
> > + PGSemaphoreLock(&proc->sem);
> > + if (!proc->clogGroupMember)
> > + break;
> > + extraWaits++;
> > + }
> > +
> > + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> > +
> > + /* Fix semaphore count for any absorbed wakeups */
> > + while (extraWaits-- > 0)
> > + PGSemaphoreUnlock(&proc->sem);
> > + return true;
> > + }
> > +
> > + /* We are the leader. Acquire the lock on behalf of everyone. */
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + /*
> > + * Now that we've got the lock, clear the list of processes waiting for
> > + * group XID status update, saving a pointer to the head of the list.
> > + * Trying to pop elements one at a time could lead to an ABA problem.
> > + */
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > + &nextidx,
> > + INVALID_PGPROCNO))
> > + break;
> > + }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>
>
> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>
Will do some tests based on above test and share results.
>
> Hi,
>
> On 2016-03-15 10:47:12 +0530, Amit Kapila wrote:
> > @@ -248,12 +256,67 @@ set_status_by_pages(int nsubxids, TransactionId *subxids,
> > * Record the final state of transaction entries in the commit log for
> > * all entries on a single page. Atomic only on this page.
> > *
> > + * Group the status update for transactions. This improves the efficiency
> > + * of the transaction status update by reducing the number of lock
> > + * acquisitions required for it. To achieve the group transaction status
> > + * update, we need to populate the transaction status related information
> > + * in shared memory and doing it for overflowed sub-transactions would need
> > + * a big chunk of shared memory, so we are not doing this optimization for
> > + * such cases. This optimization is only applicable if the transaction and
> > + * all child sub-transactions belong to same page which we presume to be the
> > + * most common case, we might be able to apply this when they are not on same
> > + * page, but that needs us to map sub-transactions in proc's XidCache based
> > + * on pageno for which each time a group leader needs to set the transaction
> > + * status and that can lead to some performance penalty as well because it
> > + * needs to be done after acquiring CLogControlLock, so let's leave that
> > + * case for now. We don't do this optimization for prepared transactions
> > + * as the dummy proc associated with such transactions doesn't have a
> > + * semaphore associated with it and the same is required for group status
> > + * update. We choose not to create a semaphore for dummy procs for this
> > + * purpose as the advantage of using this optimization for prepared transactions
> > + * is not clear.
> > + *
>
> I think you should try to break up some of the sentences, one of them
> spans 7 lines.
>
Okay, I have simplified the sentences in the comment.
>
>
> > * Otherwise API is same as TransactionIdSetTreeStatus()
> > */
> > static void
> > TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
> > TransactionId *subxids, XidStatus status,
> > - XLogRecPtr lsn, int pageno)
> > + XLogRecPtr lsn, int pageno,
> > + bool all_xact_same_page)
> > +{
> > + /*
> > + * If we can immediately acquire CLogControlLock, we update the status
> > + * of our own XID and release the lock. If not, use group XID status
> > + * update to improve efficiency and if still not able to update, then
> > + * acquire CLogControlLock and update it.
> > + */
> > + if (LWLockConditionalAcquire(CLogControlLock, LW_EXCLUSIVE))
> > + {
> > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > + LWLockRelease(CLogControlLock);
> > + }
> > + else if (!all_xact_same_page ||
> > + nsubxids > PGPROC_MAX_CACHED_SUBXIDS ||
> > + IsGXactActive() ||
> > + !TransactionGroupUpdateXidStatus(xid, status, lsn, pageno))
> > + {
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + TransactionIdSetPageStatusInternal(xid, nsubxids, subxids, status, lsn, pageno);
> > +
> > + LWLockRelease(CLogControlLock);
> > + }
> > +}
> >
>
> This code is a bit arcane. I think it should be restructured to
> a) Directly go for LWLockAcquire if !all_xact_same_page || nsubxids >
> PGPROC_MAX_CACHED_SUBXIDS || IsGXactActive(). Going for a conditional
> lock acquire first can be rather expensive.
> b) I'd rather see an explicit fallback for the
> !TransactionGroupUpdateXidStatus case, this way it's too hard to
> understand. It's also harder to add probes to detect whether that
>
Changed.
>
>
> > The first process to add itself to the list will acquire
> > + * CLogControlLock in exclusive mode and perform TransactionIdSetPageStatusInternal
> > + * on behalf of all group members. This avoids a great deal of contention
> > + * around CLogControlLock when many processes are trying to commit at once,
> > + * since the lock need not be repeatedly handed off from one committing
> > + * process to the next.
> > + *
> > + * Returns true, if transaction status is updated in clog page, else return
> > + * false.
> > + */
> > +static bool
> > +TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
> > + XLogRecPtr lsn, int pageno)
> > +{
> > + volatile PROC_HDR *procglobal = ProcGlobal;
> > + PGPROC *proc = MyProc;
> > + uint32 nextidx;
> > + uint32 wakeidx;
> > + int extraWaits = -1;
> > +
> > + /* We should definitely have an XID whose status needs to be updated. */
> > + Assert(TransactionIdIsValid(xid));
> > +
> > + /*
> > + * Add ourselves to the list of processes needing a group XID status
> > + * update.
> > + */
> > + proc->clogGroupMember = true;
> > + proc->clogGroupMemberXid = xid;
> > + proc->clogGroupMemberXidStatus = status;
> > + proc->clogGroupMemberPage = pageno;
> > + proc->clogGroupMemberLsn = lsn;
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > +
> > + /*
> > + * Add the proc to list, if the clog page where we need to update the
> > + * current transaction status is same as group leader's clog page.
> > + * There is a race condition here such that after doing the below
> > + * check and before adding this proc's clog update to a group, if the
> > + * group leader already finishes the group update for this page and
> > + * becomes group leader of another group which updates different clog
> > + * page, then it will lead to a situation where a single group can
> > + * have different clog page updates. Now the chances of such a race
> > + * condition are less and even if it happens, the only downside is
> > + * that it could lead to serial access of clog pages from disk if
> > + * those pages are not in memory. Tests doesn't indicate any
> > + * performance hit due to different clog page updates in same group,
> > + * however in future, if we want to improve the situation, then we can
> > + * detect the non-group leader transactions that tries to update the
> > + * different CLOG page after acquiring CLogControlLock and then mark
> > + * these transactions such that after waking they need to perform CLOG
> > + * update via normal path.
> > + */
>
> Needs a good portion of polishing.
>
Okay, I have tried to simplify the comment as well.
>
> > + if (nextidx != INVALID_PGPROCNO &&
> > + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage)
> > + return false;
>
> I think we're returning with clogGroupMember = true - that doesn't look
> right.
>
Changed as per suggestion.
>
> > + pg_atomic_write_u32(&proc->clogGroupNext, nextidx);
> > +
> > + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > + &nextidx,
> > + (uint32) proc->pgprocno))
> > + break;
> > + }
>
> So this indeed has ABA type problems. And you appear to be arguing above
> that that's ok. Need to ponder that for a bit.
>
> So, we enqueue ourselves as the *head* of the wait list, if there's
> other waiters. Seems like it could lead to the first element after the
> leader to be delayed longer than the others.
>
>
> FWIW, You can move the nextidx = part of out the loop,
> pgatomic_compare_exchange will update the nextidx value from memory; no
> need for another load afterwards.
>
Changed as per suggestion.
>
> > + /*
> > + * If the list was not empty, the leader will update the status of our
> > + * XID. It is impossible to have followers without a leader because the
> > + * first process that has added itself to the list will always have
> > + * nextidx as INVALID_PGPROCNO.
> > + */
> > + if (nextidx != INVALID_PGPROCNO)
> > + {
> > + /* Sleep until the leader updates our XID status. */
> > + for (;;)
> > + {
> > + /* acts as a read barrier */
> > + PGSemaphoreLock(&proc->sem);
> > + if (!proc->clogGroupMember)
> > + break;
> > + extraWaits++;
> > + }
> > +
> > + Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
> > +
> > + /* Fix semaphore count for any absorbed wakeups */
> > + while (extraWaits-- > 0)
> > + PGSemaphoreUnlock(&proc->sem);
> > + return true;
> > + }
> > +
> > + /* We are the leader. Acquire the lock on behalf of everyone. */
> > + LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
> > +
> > + /*
> > + * Now that we've got the lock, clear the list of processes waiting for
> > + * group XID status update, saving a pointer to the head of the list.
> > + * Trying to pop elements one at a time could lead to an ABA problem.
> > + */
> > + while (true)
> > + {
> > + nextidx = pg_atomic_read_u32(&procglobal->clogGroupFirst);
> > + if (pg_atomic_compare_exchange_u32(&procglobal->clogGroupFirst,
> > + &nextidx,
> > + INVALID_PGPROCNO))
> > + break;
> > + }
>
> Hm. It seems like you should should simply use pg_atomic_exchange_u32(),
> rather than compare_exchange?
>
Changed as per suggestion.
>
> I think it's worthwhile to create a benchmark that does something like
> BEGIN;SELECT ... FOR UPDATE; SELECT pg_sleep(random_time);
> INSERT;COMMIT; you'd find that if random is a bit larger (say 20-200ms,
> completely realistic values for network RTT + application computation),
> the success rate of group updates shrinks noticeably.
>
Will do some tests based on above test and share results.
Attached patch contains all the changes suggested by you. Let me know if I have missed anything or you want it differently.
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Pavan DeolaseeДата:
Сообщение: pg_xlogdump fails to handle WAL file with multi-page XLP_FIRST_IS_CONTRECORD data