Re: Gather performance analysis

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: Gather performance analysis
Дата
Msg-id CAFiTN-t8NMa-UVVTbm57jyZRfGjyWumDSDtXxuGfUKP2yuKcpQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Gather performance analysis  (Andres Freund <andres@anarazel.de>)
Ответы Re: Gather performance analysis  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
 
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.

Yeah, for profiling generally I build without optimizations so that I can see all the functions in the stack, so yeah profile results are without optimizations build but the performance results are with optimizations build.
 

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  QUERY PLAN                                                  │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather  (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1)                  │
│   Workers Planned: 2                                                                                         │
│   Workers Launched: 2                                                                                        │
│   ->  Parallel Seq Scan on t  (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms                                                                                      │
│ Execution Time: 24954.012 ms                                                                                 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)


Is this with or without patch, I mean can we see a comparison that patch improved anything in your environment?

Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.

Yeah that could be done, I can test with this change as well that how much we gain with this.
 

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).

Good suggestion, I will do some experiments around this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [UNVERIFIED SENDER] Re: Challenges preventing us moving to 64 bit transaction id (XID)?
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: .ready and .done files considered harmful