Re: [HACKERS] [POC] Faster processing at Gather node

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [HACKERS] [POC] Faster processing at Gather node
Дата
Msg-id 20171017213957.hylzfplumptbfvro@alap3.anarazel.de
обсуждение исходный текст
Ответ на [HACKERS] [POC] Faster processing at Gather node  (Rafia Sabih <rafia.sabih@enterprisedb.com>)
Ответы Re: [HACKERS] [POC] Faster processing at Gather node  (Andres Freund <andres@anarazel.de>)
Re: [HACKERS] [POC] Faster processing at Gather node  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] [POC] Faster processing at Gather node  (Amit Kapila <amit.kapila16@gmail.com>)
Re: [HACKERS] [POC] Faster processing at Gather node  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi Everyone,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies
> the receiver. Since, this copying is done in shared queue, a lot of locking
> and latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.
>
> With this patch I could see significant improvement in performance for
> simple queries,

I've spent some time looking into this, and I'm not quite convinced this
is the right approach.  Let me start by describing where I see the
current performance problems around gather stemming from.

The observations here are made using
select * from t where i < 30000000 offset 29999999 - 1;
with Rafia's data. That avoids slowdowns on the leader due to too many
rows printed out. Sometimes I'll also use
SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1;
on tpch data to show the effects on wider tables.

The precise query doesn't really matter, the observations here are more
general, I hope.

1) nodeGather.c re-projects every row from workers. As far as I can tell  that's now always exactly the same targetlist
asit's coming from the  worker. Projection capability was added in 8538a6307049590 (without  checking whether it's
neededafaict), but I think it in turn often  obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7.  My  measurement
showsthat removing the projection yields quite massive  speedups in queries like yours, which is not too surprising.
 
  I suspect this just needs a tlist_matches_tupdesc check + an if  around ExecProject(). And a test, right now tests
passunless  force_parallel_mode is used even if just commenting out the  projection unconditionally.
 
  before commenting out nodeGather projection:
  rafia time: 8283.583  rafia profile:
+   30.62%  postgres  postgres             [.] shm_mq_receive
+   18.49%  postgres  postgres             [.] s_lock
+   10.08%  postgres  postgres             [.] SetLatch
-    7.02%  postgres  postgres             [.] slot_deform_tuple  - slot_deform_tuple     - 88.01% slot_getsomeattrs
     ExecInterpExpr          ExecGather          ExecLimit
 
  lineitem time: 8448.468  lineitem profile:
+   24.63%  postgres  postgres             [.] slot_deform_tuple
+   24.43%  postgres  postgres             [.] shm_mq_receive
+   17.36%  postgres  postgres             [.] ExecInterpExpr
+    7.41%  postgres  postgres             [.] s_lock
+    5.73%  postgres  postgres             [.] SetLatch

after:  rafia time: 6660.224  rafia profile:
+   36.77%  postgres  postgres              [.] shm_mq_receive
+   19.33%  postgres  postgres              [.] s_lock
+   13.14%  postgres  postgres              [.] SetLatch
+    9.22%  postgres  postgres              [.] AllocSetReset
+    4.27%  postgres  postgres              [.] ExecGather
+    2.79%  postgres  postgres              [.] AllocSetAlloc
  lineitem time: 4507.416  lineitem profile:
+   34.81%  postgres  postgres            [.] shm_mq_receive
+   15.45%  postgres  postgres            [.] s_lock
+   13.38%  postgres  postgres            [.] SetLatch
+    9.87%  postgres  postgres            [.] AllocSetReset
+    5.82%  postgres  postgres            [.] ExecGather
  as quite clearly visible, avoiding the projection yields some major  speedups.
  The following analysis here has the projection removed.

2) The spinlocks both on the the sending and receiving side a quite hot:
  rafia query leader:
+   36.16%  postgres  postgres            [.] shm_mq_receive
+   19.49%  postgres  postgres            [.] s_lock
+   13.24%  postgres  postgres            [.] SetLatch
  The presence of s_lock shows us that we're clearly often contending  on spinlocks, given that's the slow-path for
SpinLockAcquire().In  shm_mq_receive the instruction profile shows:
 
      │               SpinLockAcquire(&mq->mq_mutex);      │1 5ab:   mov    $0xa9b580,%ecx      │         mov
$0x4a4,%edx     │         mov    $0xa9b538,%esi      │         mov    %r15,%rdi      │       → callq  s_lock      │
 ↑ jmpq   2a1      │       tas():      │1 5c7:   mov    $0x1,%eax32.83 │         lock   xchg %al,(%r15)      │
shm_mq_inc_bytes_read():     │               SpinLockAcquire(&mq->mq_mutex);
 
and 0.01 │         pop    %r15 0.04 │       ← retq      │         nop      │       tas():      │1 338:   mov
$0x1,%eax17.59│         lock   xchg %al,(%r15)      │       shm_mq_get_bytes_written():      │
SpinLockAcquire(&mq->mq_mutex);0.05 │         test   %al,%al 0.01 │       ↓ jne    448      │               v =
mq->mq_bytes_written;
   rafia query worker:
+   33.00%  postgres  postgres            [.] shm_mq_send_bytes
+    9.90%  postgres  postgres            [.] s_lock
+    7.74%  postgres  postgres            [.] shm_mq_send
+    5.40%  postgres  postgres            [.] ExecInterpExpr
+    5.34%  postgres  postgres            [.] SetLatch
  Again, we have strong indicators for a lot of spinlock  contention. The instruction profiles show the same;
  shm_mq_send_bytes      │                               shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));      │
and    $0xfffffffffffffff8,%r15      │       tas(): 0.08 │         mov    %ebp,%eax31.07 │         lock   xchg
%al,(%r14)     │       shm_mq_inc_bytes_written():      │        * Increment the number of bytes written.      │
*/
   and
      │3  98:   cmp    %r13,%rbx 0.70 │       ↓ jae    430      │       tas(): 0.12 │1  a1:   mov    %ebp,%eax28.53 │
     lock   xchg %al,(%r14)      │       shm_mq_get_bytes_read():      │               SpinLockAcquire(&mq->mq_mutex);
   │         test   %al,%al      │       ↓ jne    298      │               v = mq->mq_bytes_read;
 
   shm_mq_send:      │       tas():50.73 │         lock   xchg %al,0x0(%r13)      │       shm_mq_notify_receiver():
│       shm_mq_notify_receiver(volatile shm_mq *mq)      │       {      │               PGPROC     *receiver;      │
          bool            detached;
 

  My interpretation here is that it's not just the effect of the  locking causing the slowdown, but to a significant
degreethe effect  of the implied bus lock.
 
  To test that theory, here are the timings for  1) spinlocks present     time: 6593.045  2) spinlocks acuisition
replacedby *full* memory barriers, which on x86 is a lock; addl $0,0(%%rsp)     time: 5159.306  3) spinlocks replaced
byread/write barriers as appropriate.     time: 4687.610
 
  By my understanding of shm_mq.c's logic, something like 3) aught to  be doable in a correct manner. There should be,
innormal  circumstances, only be one end modifying each of the protected  variables. Doing so instead of using full
blockatomics with locked  instructions is very likely to yield considerably better performance.
 
  The top-level profile after 3 is:
  leader:
+   25.89%  postgres  postgres          [.] shm_mq_receive
+   24.78%  postgres  postgres          [.] SetLatch
+   14.63%  postgres  postgres          [.] AllocSetReset
+    7.31%  postgres  postgres          [.] ExecGather
  worker:
+   14.02%  postgres  postgres            [.] ExecInterpExpr
+   11.63%  postgres  postgres            [.] shm_mq_send_bytes
+   11.25%  postgres  postgres            [.] heap_getnext
+    6.78%  postgres  postgres            [.] SetLatch
+    6.38%  postgres  postgres            [.] slot_deform_tuple
  still a lot of cycles in the queue code, but proportionally less.

4) Modulo computations in shm_mq are expensive:
      │       shm_mq_send_bytes():      │                               Size            offset = mq->mq_bytes_written %
(uint64)ringsize; 0.12 │1  70:   xor    %edx,%edx      │                               Size            sendnow =
Min(available,ringsize - offset);      │         mov    %r12,%rsi      │                               Size
offset= mq->mq_bytes_written % (uint64) ringsize;43.75 │         div    %r12      │
memcpy(&mq->mq_ring[mq->mq_ring_offset+ offset], 7.23 │         movzbl 0x31(%r15),%eax
 

      │       shm_mq_receive_bytes():      │                       used = written - mq->mq_bytes_read; 1.17 │
sub   %rax,%rcx      │                       offset = mq->mq_bytes_read % (uint64) ringsize;18.49 │         div    %rbp
    │         mov    %rdx,%rdi
 

  that we end up with a full blown div makes sense - the compiler can't  know anything about ringsize, therefore it
can'toptimize to a mask.  I think we should force the size of the ringbuffer to be a power of  two, and use a maks
insteadof a size for the buffer.
 

5) There's a *lot* of latch interactions. The biggest issue actually is  the memory barrier implied by a SetLatch,
waitingfor the latch  barely shows up.
 
  from 4) above:
  leader:
+   25.89%  postgres  postgres          [.] shm_mq_receive
+   24.78%  postgres  postgres          [.] SetLatch
+   14.63%  postgres  postgres          [.] AllocSetReset
+    7.31%  postgres  postgres          [.] ExecGather

      │      0000000000781b10 <SetLatch>:      │      SetLatch():      │              /*      │               * The
memorybarrier has to be placed here to ensure that any flag      │               * variables possibly changed by this
processhave been flushed to main      │               * memory, before we check/set is_set.      │               */
│              pg_memory_barrier();77.43 │        lock   addl   $0x0,(%rsp)      │      │              /* Quick exit if
alreadyset */      │              if (latch->is_set) 0.12 │        mov    (%rdi),%eax
 

  Commenting out the memory barrier - which is NOT CORRECT - improves  timing:  before: 4675.626  after: 4125.587
  The correct fix obviously is not to break latch correctness. I think  the big problem here is that we perform a
SetLatch()for every read  from the latch.
 
  I think we should  a) introduce a batch variant for receiving like:

extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh,                                    shm_mq_iovec *iov, int
*iovcnt,                                   bool nowait)
 
     which then only does the SetLatch() at the end. And maybe if it     wrapped around.
  b) Use shm_mq_sendv in tqueue.c by batching up insertions into the     queue whenever it's not empty when a tuple is
ready.
  I've not benchmarked this, but I'm pretty certain that the benefits  isn't just going to be reduced cost of
SetLatch()calls, but also  increased performance due to fewer context switches
 

6) I've observed, using strace, debug outputs with timings, and top with  a short interval, that quite often only one
backendhas sufficient  work, while other backends are largely idle.
 
  I think the problem here is that the "anti round robin" provisions from  bc7fcab5e36b9597857, while much better than
theprevious state, have  swung a bit too far into the other direction. Especially if we were  to introduce batching as
Isuggest in 5), but even without, this  leads to back-fort on half-empty queues between the gatherstate->nextreader
backend,and the leader.
 
  I'm not 100% certain what the right fix here is.
  One fairly drastic solution would be to move away from a  single-produce-single-consumer style, per worker, queue, to
aglobal  queue. That'd avoid fairness issues between the individual workers,  at the price of potential added
contention.One disadvantage is that  such a combined queue approach is not easily applicable for gather  merge.
 
  One less drastic approach would be to move to try to drain the queue  fully in one batch, and then move to the next
queue.That'd avoid  triggering a small wakeups for each individual tuple, as one  currently would get without the
'stickyness'.
  It might also be a good idea to use a more directed form of wakeups,  e.g. using a per-worker latch + a wait event
set,to avoid iterating  over all workers.
 


Unfortunately the patch's "local worker queue" concept seems, to me,
like it's not quite addressing the structural issues, instead opting to
address them by adding another layer of queuing. I suspect that if we'd
go for the above solutions there'd be only very small benefit in
implementing such per-worker local queues.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "David G. Johnston"
Дата:
Сообщение: [HACKERS] v10 telease note for pg_basebackup refers to old --xlog-method argument
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: [HACKERS] Re: Is anything preventing us from allowing write toforeign tables from standby?