Re: WAL recycled despite logical replication slot

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: WAL recycled despite logical replication slot
Дата
Msg-id CAMkU=1yrRmUVDq5c+hHUEhLsyT1A-Nx7PrKs9GSXc6nG=Lo_7Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WAL recycled despite logical replication slot  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Fri, Sep 20, 2019 at 11:27 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>Is there an innocent explanation for this?  I thought logical replication
>slots provided an iron-clad guarantee that WAL would be retained until it
>was no longer needed.  I am just using pub/sub, none of the lower level
>stuff.
>

I think you're right - this should not happen with replication slots.
Can you provide more detailed setup instructions, so that I can try to
reproduce and investigate the isssue?

It is a bit messy, because this isn't what I was trying to test.

The basic set up is pretty simple:

On master:

pgbench -i -s 100
create publication pgbench for table pgbench_accounts,  pgbench_branches, pgbench_history , pgbench_tellers;
pgbench -R200 -c4 -j4 -P60 -T360000 -n

on replica:

pgbench -i -s 1
truncate pgbench_history , pgbench_accounts, pgbench_branches, pgbench_tellers;
create subscription sub CONNECTION 'host=192.168.0.15' publication pgbench;

The messy part:  It looked like the synch was never going to finish, so first I cut the rate down to -R20.  Then what I thought I did was drop the primary key on pgbench_accounts (manually doing a kill -15 on the synch worker to release the lock), wait for the copy to start again and then finish and then start getting "ERROR:  logical replication target relation "public.pgbench_accounts" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL" log messages, then I re-added the primary key.  Then I increased the -R back to 200, and about 50 minutes later got the WAL already removed error.  

But now I can't seem to reproduce this, as the next time I tried to do the synch with no primary key there doesn't seem to be a commit after the COPY finishes so once it tries to replay the first update, it hits the above "no primary key" error and then rolls back the **the entire COPY** as well as the single-row update, and starts the entire COPY over again before you have a chance to intervene and build the index.  So I'm guessing now that either the lack of a commit (which itself seems like a spectacularly bad idea) is situation dependent, or the very slow COPY had finished between the time I had decided to drop the primary key, and time I actually implemented the drop.

Perhaps important here is that the replica is rather underpowered.  Write IO and fsyncs periodically become painfully slow, which is probably why there are replication timeouts, and since the problem happened when trying to reestablish after a timeout I would guess that that is critical to the issue.

I was running the master with fsync=off, but since the OS never crashed that should not be the source of corruption.


I'll try some more to reproduce this, but I wanted to make sure there was actually something here to reproduce, and not just my misunderstanding of how things are supposed to work.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: The flinfo->fn_extra question, from me this time.
Следующее
От: Jeff Janes
Дата:
Сообщение: Re: WAL recycled despite logical replication slot