Re: [HACKERS] POC: Cache data in GetSnapshotData()

Поиск
Список
Период
Сортировка
От Mithun Cy
Тема Re: [HACKERS] POC: Cache data in GetSnapshotData()
Дата
Msg-id CAD__OugfQeL-8irv2OavhGAYJvU7VQkqe-WtT5r_X4TKxBU-kQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] POC: Cache data in GetSnapshotData()  (Mithun Cy <mithun.cy@enterprisedb.com>)
Список pgsql-hackers
Hi all please ignore this mail, this email is incomplete I have to add
more information and Sorry accidentally I pressed the send button
while replying.


On Tue, Aug 29, 2017 at 11:27 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Cache the SnapshotData for reuse:
> ===========================
> In one of our perf analysis using perf tool it showed GetSnapshotData
> takes a very high percentage of CPU cycles on readonly workload when
> there is very high number of concurrent connections >= 64.
>
> Machine : cthulhu 8 node machine.
> ----------------------------------------------
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                128
> On-line CPU(s) list:   0-127
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             8
> NUMA node(s):          8
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 47
> Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
> Stepping:              2
> CPU MHz:               2128.000
> BogoMIPS:              4266.63
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              24576K
> NUMA node0 CPU(s):     0,65-71,96-103
> NUMA node1 CPU(s):     72-79,104-111
> NUMA node2 CPU(s):     80-87,112-119
> NUMA node3 CPU(s):     88-95,120-127
> NUMA node4 CPU(s):     1-8,33-40
> NUMA node5 CPU(s):     9-16,41-48
> NUMA node6 CPU(s):     17-24,49-56
> NUMA node7 CPU(s):     25-32,57-64
>
> Perf CPU cycle 128 clients.
>
> On further analysis, it appears this mainly accounts to cache-misses
> happening while iterating through large proc array to compute the
> SnapshotData. Also, it seems mainly on read-only (read heavy) workload
> SnapshotData mostly remains same as no new(rarely a) transaction
> commits or rollbacks. Taking advantage of this we can save the
> previously computed SanspshotData in shared memory and in
> GetSnapshotData we can use the saved SnapshotData instead of computing
> same when it is still valid. A saved SnapsotData is valid until no
> transaction end.
>
> [Perf analysis Base]
>
> Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
> Overhead  Command   Shared Object       Symbol
> 18.63%  postgres  postgres            [.] GetSnapshotData
> 11.98%  postgres  postgres            [.] _bt_compare
> 10.20%  postgres  postgres            [.] PinBuffer
>   8.63%  postgres  postgres            [.] LWLockAttemptLock
> 6.50%  postgres  postgres            [.] LWLockRelease
>
>
> [Perf analysis with Patch]
>
>
> Server configuration:
> ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
> wal_buffers=256MB &
>
> pgbench configuration:
> scale_factor = 300
> ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  postgres
>
> The machine has 64 cores with this patch I can see server starts
> improvement after 64 clients. I have tested up to 256 clients. Which
> shows performance improvement nearly max 39% [1]. This is the best
> case for the patch where once computed snapshotData is reused further.
>
> The worst case seems to be small, quick write transactions with which
> frequently invalidate the cached SnapshotData before it is reused by
> any next GetSnapshotData call. As of now, I tested simple update
> tests: pgbench -M Prepared -N on the same machine with the above
> server configuration. I do not see much change in TPS numbers.
>
> All TPS are median of 3 runs
> Clients     TPS-With Patch 05   TPS-Base            %Diff
> 1             752.461117                755.186777          -0.3%
> 64           32171.296537           31202.153576       +3.1%
> 128         41059.660769           40061.929658       +2.49%
>
> I will do some profiling and find out why this case is not costing us
> some performance due to caching overhead.
>
>
> []
>
> On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund <andres@anarazel.de> wrote:
>> Hi,
>>
>> I think this patch should have a "cover letter" explaining what it's
>> trying to achieve, how it does so and why it's safe/correct.  I think
>> it'd also be good to try to show some of the worst cases of this patch
>> (i.e. where the caching only adds overhead, but never gets used), and
>> some of the best cases (where the cache gets used quite often, and
>> reduces contention significantly).
>>
>> I wonder if there's a way to avoid copying the snapshot into the cache
>> in situations where we're barely ever going to need it. But right now
>> the only way I can think of is to look at the length of the
>> ProcArrayLock wait queue and count the exclusive locks therein - which'd
>> be expensive and intrusive...
>>
>>
>> On 2017-08-02 15:56:21 +0530, Mithun Cy wrote:
>>> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
>>> index a7e8cf2..4d7debe 100644
>>> --- a/src/backend/storage/ipc/procarray.c
>>> +++ b/src/backend/storage/ipc/procarray.c
>>> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>>>                                       (arrayP->numProcs - index - 1) * sizeof(int));
>>>                       arrayP->pgprocnos[arrayP->numProcs - 1] = -1;   /* for debugging */
>>>                       arrayP->numProcs--;
>>> +                     ProcGlobal->is_snapshot_valid = false;
>>>                       LWLockRelease(ProcArrayLock);
>>>                       return;
>>>               }
>>>       }
>>>
>>> +     ProcGlobal->is_snapshot_valid = false;
>>>       /* Oops */
>>>       LWLockRelease(ProcArrayLock);
>>
>> Why do we need to do anything here? And if we need to, why not in
> ;> ProcArrayAdd?
>
> In ProcArrayRemove I can see ShmemVariableCache->latestCompletedXid is
> set to latestXid which is going to change xmax in GetSnapshotData,
> hence invalidated the snapshot there. ProcArrayAdd do not seem to be
> affecting snapshotData. I have modified now to set
> cached_snapshot_valid to false just below the statement
> ShmemVariableCache->latestCompletedXid = latestXid only.
> +if (TransactionIdIsValid(latestXid))
> +{
> +     Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
> +     /* Advance global latestCompletedXid while holding the lock */
> +     if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
> +                                               latestXid))
> +     ShmemVariableCache->latestCompletedXid = latestXid;
>
>
>
>>> @@ -1552,6 +1557,8 @@ GetSnapshotData(Snapshot snapshot)
>>>                                        errmsg("out of memory")));
>>>       }
>>>
>>> +     snapshot->takenDuringRecovery = RecoveryInProgress();
>>> +
>>>       /*
>>>        * It is sufficient to get shared lock on ProcArrayLock, even if we are
>>>        * going to set MyPgXact->xmin.
>>> @@ -1566,100 +1573,200 @@ GetSnapshotData(Snapshot snapshot)
>>>       /* initialize xmin calculation with xmax */
>>>       globalxmin = xmin = xmax;
>>>
>>> -     snapshot->takenDuringRecovery = RecoveryInProgress();
>>> -
>>
>> This movement of code seems fairly unrelated?
>
> -- Yes not related, It was part of early POC patch so retained it as
> it is. Now reverted those changes moved them back under the
> ProcArrayLock
>
>>>       if (!snapshot->takenDuringRecovery)
>>>       {
>>
>> Do we really need to restrict this to !recovery snapshots? It'd be nicer
>> if we could generalize it - I don't immediately see anything !recovery
>> specific in the logic here.
>
> Agree I will add one more patch on this to include standby(recovery
> state) case also to unify all the cases where we can cache the
> snapshot. Before let me see
>
>
>>> -             int                *pgprocnos = arrayP->pgprocnos;
>>> -             int                     numProcs;
>>> +             if (ProcGlobal->is_snapshot_valid)
>>> +             {
>>> +                     Snapshot        csnap = &ProcGlobal->cached_snapshot;
>>> +                     TransactionId *saved_xip;
>>> +                     TransactionId *saved_subxip;
>>>
>>> -             /*
>>> -              * Spin over procArray checking xid, xmin, and subxids.  The goal is
>>> -              * to gather all active xids, find the lowest xmin, and try to record
>>> -              * subxids.
>>> -              */
>>> -             numProcs = arrayP->numProcs;
>>> -             for (index = 0; index < numProcs; index++)
>>> +                     saved_xip = snapshot->xip;
>>> +                     saved_subxip = snapshot->subxip;
>>> +
>>> +                     memcpy(snapshot, csnap, sizeof(SnapshotData));
>>> +
>>> +                     snapshot->xip = saved_xip;
>>> +                     snapshot->subxip = saved_subxip;
>>> +
>>> +                     memcpy(snapshot->xip, csnap->xip,
>>> +                                sizeof(TransactionId) * csnap->xcnt);
>>> +                     memcpy(snapshot->subxip, csnap->subxip,
>>> +                                sizeof(TransactionId) * csnap->subxcnt);
>>> +                     globalxmin = ProcGlobal->cached_snapshot_globalxmin;
>>> +                     xmin = csnap->xmin;
>>> +
>>> +                     count = csnap->xcnt;
>>> +                     subcount = csnap->subxcnt;
>>> +                     suboverflowed = csnap->suboverflowed;
>>> +
>>> +                     Assert(TransactionIdIsValid(globalxmin));
>>> +                     Assert(TransactionIdIsValid(xmin));
>>> +             }
>>
>> Can we move this to a separate function? In fact, could you check how
>> much effort it'd be to split cached, !recovery, recovery cases into
>> three static inline helper functions? This is starting to be hard to
>> read, the indentation added in this patch doesn't help... Consider doing
>> recovery, !recovery cases in a preliminary separate patch.
>
> In this patch, I have moved the code 2 functions one to cache the
> snapshot data another to get the snapshot data from the cache. In the
> next patch, I will try to unify these things with recovery (standby)
> case. For recovery case, we are copying the xids from a simple xid
> array KnownAssignedXids[], I am not completely sure whether caching
> them bring similar performance benefits as above so I will also try to
> get a perf report for same.
>
>>
>>> +                      * Let only one of the caller cache the computed snapshot, and
>>> +                      * others can continue as before.
>>>                        */
>>> -                     if (!suboverflowed)
>>> +                     if (!ProcGlobal->is_snapshot_valid &&
>>> +                             LWLockConditionalAcquire(&ProcGlobal->CachedSnapshotLock,
>>> +                                                                              LW_EXCLUSIVE))
>>>                       {
>>
>> I'd also move this to a helper function (the bit inside the lock).
>
> Fixed as suggested.
>
>>> +
>>> +                             /*
>>> +                              * The memory barrier has be to be placed here to ensure that
>>> +                              * is_snapshot_valid is set only after snapshot is cached.
>>> +                              */
>>> +                             pg_write_barrier();
>>> +                             ProcGlobal->is_snapshot_valid = true;
>>> +                             LWLockRelease(&ProcGlobal->CachedSnapshotLock);
>>
>> LWLockRelease() is a full memory barrier, so this shouldn't be required.
>
> Okay, Sorry for my understanding, Do you want me to set
> ProcGlobal->is_snapshot_valid after
> LWLockRelease(&ProcGlobal->CachedSnapshotLock)?
>
>>
>>>  /*
>>> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
>>> index 7dbaa81..7d06904 100644
>>> --- a/src/include/storage/proc.h
>>> +++ b/src/include/storage/proc.h
>>> @@ -16,6 +16,7 @@
>>>
>>>  #include "access/xlogdefs.h"
>>>  #include "lib/ilist.h"
>>> +#include "utils/snapshot.h"
>>>  #include "storage/latch.h"
>>>  #include "storage/lock.h"
>>>  #include "storage/pg_sema.h"
>>> @@ -253,6 +254,22 @@ typedef struct PROC_HDR
>>>       int                     startupProcPid;
>>>       /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
>>>       int                     startupBufferPinWaitBufId;
>>> +
>>> +     /*
>>> +      * In GetSnapshotData we can reuse the previously snapshot computed if no
>>> +      * new transaction has committed or rolledback. Thus saving good amount of
>>> +      * computation cycle under GetSnapshotData where we need to iterate
>>> +      * through procArray every time we try to get the current snapshot. Below
>>> +      * members help us to save the previously computed snapshot in global
>>> +      * shared memory and any process which want to get current snapshot can
>>> +      * directly copy from them if it is still valid.
>>> +      */
>>> +     SnapshotData cached_snapshot;   /* Previously saved snapshot */
>>> +     volatile bool is_snapshot_valid;        /* is above snapshot valid */
>>> +     TransactionId cached_snapshot_globalxmin;       /* globalxmin when above
>>
>>
>> I'd rename is_snapshot_valid to 'cached_snapshot_valid' or such, it's
>> not clear from the name what it refers to.
>
> -- Renamed to cached_snapshot_valid.
>
>>
>>> +     LWLock          CachedSnapshotLock; /* A try lock to make sure only one
>>> +                                                                      * process write to cached_snapshot */
>>>  } PROC_HDR;
>>>
>>
>> Hm, I'd not add the lock to the struct, we normally don't do that for
>> the "in core" lwlocks that don't exist in a "configurable" number like
>> the buffer locks and such.
>>
>
> -- I am not really sure about this. Can you please help what exactly I
> should be doing here to get this corrected? Is that I have to add it
> to lwlocknames.txt as a new LWLock?
>
> --
> Thanks and Regards
> Mithun C Y
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: [HACKERS] MAIN, Uncompressed?
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: [HACKERS] Quorum commit for multiple synchronous replication.