Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

Поиск
Список
Период
Сортировка
От Alexey Kondratov
Тема Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions
Дата
Msg-id 8eda5118-2dd0-79a1-4fe9-eec7e334de17@postgrespro.ru
обсуждение исходный текст
Ответ на Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Alexey Kondratov <a.kondratov@postgrespro.ru>)
Ответы Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
Hi Tomas,

>>>> Interesting. Any idea where does the extra overhead in this particular
>>>> case come from? It's hard to deduce that from the single flame graph,
>>>> when I don't have anything to compare it with (i.e. the flame graph 
>>>> for
>>>> the "normal" case).
>>> I guess that bottleneck is in disk operations. You can check
>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>> please, see attached flame graph for the following transaction:
>>>
>>> INSERT INTO large_text
>>> SELECT (SELECT string_agg('x', ',')
>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>
>>> Execution Time: 44519.816 ms
>>> Time: 98333,642 ms (01:38,334)
>>>
>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for 
>>> tests.
>>>
>>> Therefore, probably you may write changes on receiver in bigger chunks,
>>> not each change separately.
>>>
>> Possibly, I/O is certainly a possible culprit, although we should be
>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>> not sure why would it be cheaper to do the writes in batches.
>>
>> BTW does this mean you see the overhead on the apply side? Or are you
>> running this on a single machine, and it's difficult to decide?
>
> I run this on a single machine, but walsender and worker are utilizing 
> almost 100% of CPU per each process all the time, and at apply side 
> I/O syscalls take about 1/3 of CPU time. Though I am still not sure, 
> but for me this result somehow links performance drop with problems at 
> receiver side.
>
> Writing in batches was just a hypothesis and to validate it I have 
> performed test with large txn, but consisting of a smaller number of 
> wide rows. This test does not exhibit any significant performance 
> drop, while it was streamed too. So it seems to be valid. Anyway, I do 
> not have other reasonable ideas beside that right now.

I've checked recently this patch again and tried to elaborate it in 
terms of performance. As a result I've implemented a new POC version of 
the applier (attached). Almost everything in streaming logic stayed 
intact, but apply worker is significantly different.

As I wrote earlier I still claim, that spilling changes on disk at the 
applier side adds additional overhead, but it is possible to get rid of 
it. In my additional patch I do the following:

1) Maintain a pool of additional background workers (bgworkers), that 
are connected with main logical apply worker via shm_mq's. Each worker 
is dedicated to the processing of specific streamed transaction.

2) When we receive a streamed change for some transaction, we check 
whether there is an existing dedicated bgworker in HTAB (xid -> 
bgworker), or there are some in the idle list, or spawn a new one.

3) We pass all changes (between STREAM START/STOP) to that bgworker via 
shm_mq_send without intermediate waiting. However, we wait for bgworker 
to apply the entire changes chunk at STREAM STOP, since we don't want 
transactions reordering.

4) When transaction is commited/aborted worker is being added to the 
idle list and is waiting for reassigning message.

5) I have used the same machinery with apply_dispatch in bgworkers, 
since most of actions are practically very similar.

Thus, we do not spill anything at the applier side, so transaction 
changes are processed by bgworkers as normal backends do. In the same 
time, changes processing is strictly serial, which prevents transactions 
reordering and possible conflicts/anomalies. Even though we trade off 
performance in favor of stability the result is rather impressive. I 
have used a similar query for testing as before:

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
     SELECT round(random()*10), random(), random()*142
     FROM generate_series(1, 1000000) s(i);

with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and 
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
following:

CREATE TABLE large_test (
     id serial primary key,
     num1 bigint,
     num2 double precision,
     num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |     Ratio      |
-------------------------------------------------------------------
|                        On commit (master, v13)                  |
-------------------------------------------------------------------
| 1kk | 6.5               | 17.6                 | x2.74          |
-------------------------------------------------------------------
| 3kk | 21                | 55.4                 | x2.64          |
-------------------------------------------------------------------
| 5kk | 38.3              | 91.5                 | x2.39          |
-------------------------------------------------------------------
|                        Stream + spill                           |
-------------------------------------------------------------------
| 1kk | 5.9               | 18                   | x3             |
-------------------------------------------------------------------
| 3kk | 19.5              | 52.4                 | x2.7           |
-------------------------------------------------------------------
| 5kk | 33.3              | 86.7                 | x2.86          |
-------------------------------------------------------------------
|                        Stream + BGW pool                        |
-------------------------------------------------------------------
| 1kk | 6                 | 12                   | x2             |
-------------------------------------------------------------------
| 3kk | 18.5              | 30.5                 | x1.65          |
-------------------------------------------------------------------
| 5kk | 35.6              | 53.9                 | x1.51          |
-------------------------------------------------------------------

It seems that overhead added by synchronous replica is lower by 2-3 
times compared with Postgres master and streaming with spilling. 
Therefore, the original patch eliminated delay before large transaction 
processing start by sender, while this additional patch speeds up the 
applier side.

Although the overall speed up is surely measurable, there is a room for 
improvements yet:

1) Currently bgworkers are only spawned on demand without some initial 
pool and never stopped. Maybe we should create a small pool on 
replication start and offload some of idle bgworkers if they exceed some 
limit?

2) Probably we can track somehow that incoming change has conflicts with 
some of being processed xacts, so we can wait for specific bgworkers 
only in that case?

3) Since the communication between main logical apply worker and each 
bgworker from the pool is a 'single producer --- single consumer' 
problem, then probably it is possible to wait and set/check flags 
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and 
criticism are welcome!


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc
patch,that I don't know well enough.
 


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: RFC: seccomp-bpf support
Следующее
От: Andres Freund
Дата:
Сообщение: Re: no mailing list hits in google