Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Дата
Msg-id CAHut+Pv3z19dS+WQ0hSSSNdDJMwxQjJEnaH=BiKq65t9OF280g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication  (Peter Smith <smithpb2250@gmail.com>)
Ответы Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Список pgsql-hackers
On Fri, Jul 28, 2023 at 5:22 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Melih,
>
> BACKGROUND
> ----------
>
> We wanted to compare performance for the 2 different reuse-worker
> designs, when the apply worker is already busy handling other
> replications, and then simultaneously the test table tablesyncs are
> occurring.
>
> To test this scenario, some test scripts were written (described
> below). For comparisons, the scripts were then run using a build of
> HEAD; design #1 (v21); design #2 (0718).
>
> HOW THE TEST WORKS
> ------------------
>
> Overview:
> 1. The apply worker is made to subscribe to a 'busy_tbl'.
> 2. After the SUBSCRIPTION is created, the publisher-side then loops
> (forever) doing INSERTS into that busy_tbl.
> 3. While the apply worker is now busy, the subscriber does an ALTER
> SUBSCRIPTION REFRESH PUBLICATION to subscribe to all the other test
> tables.
> 4. We time how long it takes for all tablsyncs to complete
> 5. Repeat above for different numbers of empty tables (10, 100, 1000,
> 2000) and different numbers of sync workers (2, 4, 8, 16)
>
> Scripts
> -------
>
> (PSA 4 scripts to implement this logic)
>
> testrun script
> - this does common setup (do_one_test_setup) and then the pub/sub
> scripts (do_one_test_PUB and do_one_test_SUB -- see below) are run in
> parallel
> - repeat 10 times
>
> do_one_test_setup script
> - init and start instances
> - ipc setup tables and procedures
>
> do_one_test_PUB script
> - ipc setup pub/sub
> - table setup
> - publishes the "busy_tbl", but then waits for the subscriber to
> subscribe to only this one
> - alters the publication to include all other tables (so subscriber
> will see these only after the ALTER SUBSCRIPTION PUBLICATION REFRESH)
> - enter a busy INSERT loop until it informed by the subscriber that
> the test is finished
>
> do_one_test_SUB script
> - ipc setup pub/sub
> - table setup
> - subscribes only to "busy_tbl", then informs the publisher when that
> is done (this will cause the publisher to commence the stay_busy loop)
> - after it knows the publishing busy loop has started it does
> - ALTER SUBSCRIPTION REFRESH PUBLICATION
> - wait until all the tablesyncs are ready <=== This is the part that
> is timed for the test RESULT
>
> PROBLEM
> -------
>
> Looking at the output files (e.g. *.dat_PUB and *.dat_SUB)  they seem
> to confirm the tests are working how we wanted.
>
> Unfortunately, there is some slot problem for the patched builds (both
> designs #1 and #2). e.g. Search "ERROR" in the *.log files and see
> many slot-related errors.
>
> Please note - running these same scripts with HEAD build gave no such
> errors. So it appears to be a patch problem.
>

Hi

FYI, here is some more information about ERRORs seen.

The patches were re-tested -- applied in stages (and also against the
different scripts) to identify where the problem was introduced. Below
are the observations:

~~~

Using original test scripts

1. Using only patch v21-0001
- no errors

2. Using only patch v21-0001+0002
- no errors

3. Using patch v21-0001+0002+0003
- no errors

~~~

Using the "busy loop" test scripts for long transactions

1. Using only patch v21-0001
- no errors

2. Using only patch v21-0001+0002
- gives errors for "no copy in progress issue"
e.g. ERROR:  could not send data to WAL stream: no COPY in progress

3. Using patch v21-0001+0002+0003
- gives the same "no copy in progress issue" errors as above
e.g. ERROR:  could not send data to WAL stream: no COPY in progress
- and also gives slot consistency point errors
e.g. ERROR:  could not create replication slot
"pg_16700_sync_16514_7261998170966054867": ERROR:  could not find
logical decoding starting point
e.g. LOG:  could not drop replication slot
"pg_16700_sync_16454_7261998170966054867" on publisher: ERROR:
replication slot "pg_16700_sync_16454_7261998170966054867" does not
exist

------
Kind Regards,
Peter Smith.
Fujitsu Australia



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Incorrect handling of OOM in WAL replay leading to data loss
Следующее
От: Palak Chaturvedi
Дата:
Сообщение: Re: Extension Enhancement: Buffer Invalidation in pg_buffercache