Обсуждение: [HACKERS] logical replication - possible remaining problem
I am not sure whether what I found here amounts to a bug, I might be
doing something dumb.
During the last few months I did tests by running pgbench over logical
replication. Earlier emails have details.
The basic form of that now works well (and the fix has been comitted)
but as I looked over my testing program I noticed one change I made to
it, already many weeks ago:
In the cleanup during startup (pre-flight check you might say) and also
before the end, instead of
echo "delete from pg_subscription;" | psql -qXp $port2 -- (1)
I changed that (as I say, many weeks ago) to:
echo "delete from pg_subscription; delete from pg_subscription_rel; delete from pg_replication_origin;
"| psql -qXp $port2 -- (2)
This occurs (2x) inside the bash function clean_pubsub(), in main test
script pgbench_detail2.sh
This change was an effort to ensure to arrive at a 'clean' start (and
end-) state which would always be the same.
All my more recent testing (and that of Mark, I have to assume) was thus
done with (2).
Now, looking at the script again I am thinking that it would be
reasonable to expect that after issuing delete from pg_subscription;
the other 2 tables are /also/ cleaned, automatically, as a consequence.
(Is this reasonable? this is really the main question of this email).
So I removed the latter two delete statements again, and ran the tests
again with the form in (1)
I have established that (after a number of successful cycles) the test
stops succeeding with in the replica log repetitions of:
2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply
worker for subscription "sub1" has started
2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free
replication state slot for replication origin with OID 11
2017-06-07 22:10:29.057 CEST [2421] HINT: Increase
max_replication_slots and try again.
2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical
replication worker for subscription 29235 (PID 2421) exited with exit
code 1
when I manually 'clean up' by doing: delete from pg_replication_origin;
then, and only then, does the session finish and succeed ('replica ok').
So to me it looks as if there is an omission of
pg_replication_origin-cleanup when pg_description is deleted.
Does that make sense? All this is probably vague and I am only posting
in the hope that Petr (or someone else) perhaps immediately understands
what goes wrong, with even his limited amount of info.
In the meantime I will try to dig up more detailed info...
thanks,
Erik Rijkers
Erik Rijkers wrote: > Now, looking at the script again I am thinking that it would be reasonable > to expect that after issuing > delete from pg_subscription; > > the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is > this reasonable? this is really the main question of this email). I don't think it's reasonable to expect that the system recovers automatically from what amounts to catalog corruption. You should be using the DDL that removes subscriptions instead. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-06-07 23:18, Alvaro Herrera wrote: > Erik Rijkers wrote: > >> Now, looking at the script again I am thinking that it would be >> reasonable >> to expect that after issuing >> delete from pg_subscription; >> >> the other 2 tables are /also/ cleaned, automatically, as a >> consequence. (Is >> this reasonable? this is really the main question of this email). > > I don't think it's reasonable to expect that the system recovers > automatically from what amounts to catalog corruption. You should be > using the DDL that removes subscriptions instead. You're right, that makes sense. Thanks.
Hi,
On 07/06/17 22:49, Erik Rijkers wrote:
> I am not sure whether what I found here amounts to a bug, I might be
> doing something dumb.
>
> During the last few months I did tests by running pgbench over logical
> replication. Earlier emails have details.
>
> The basic form of that now works well (and the fix has been comitted)
> but as I looked over my testing program I noticed one change I made to
> it, already many weeks ago:
>
> In the cleanup during startup (pre-flight check you might say) and also
> before the end, instead of
>
> echo "delete from pg_subscription;" | psql -qXp $port2 -- (1)
>
> I changed that (as I say, many weeks ago) to:
>
> echo "delete from pg_subscription;
> delete from pg_subscription_rel;
> delete from pg_replication_origin; " | psql -qXp $port2 -- (2)
>
> This occurs (2x) inside the bash function clean_pubsub(), in main test
> script pgbench_detail2.sh
>
> This change was an effort to ensure to arrive at a 'clean' start (and
> end-) state which would always be the same.
>
> All my more recent testing (and that of Mark, I have to assume) was thus
> done with (2).
>
> Now, looking at the script again I am thinking that it would be
> reasonable to expect that after issuing
> delete from pg_subscription;
>
> the other 2 tables are /also/ cleaned, automatically, as a consequence.
> (Is this reasonable? this is really the main question of this email).
>
Hmm, they are not cleaned automatically, deleting from system catalogs
manually like this never propagates to related tables, we don't use FKs
there.
> So I removed the latter two delete statements again, and ran the tests
> again with the form in (1)
>
> I have established that (after a number of successful cycles) the test
> stops succeeding with in the replica log repetitions of:
>
> 2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply
> worker for subscription "sub1" has started
> 2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free
> replication state slot for replication origin with OID 11
> 2017-06-07 22:10:29.057 CEST [2421] HINT: Increase
> max_replication_slots and try again.
> 2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical
> replication worker for subscription 29235 (PID 2421) exited with exit
> code 1
>
> when I manually 'clean up' by doing:
> delete from pg_replication_origin;
>
Yeah because you consumed all the origins (I am still not huge fan of
how that limit works, but that's separate discussion).
> then, and only then, does the session finish and succeed ('replica ok').
>
> So to me it looks as if there is an omission of
> pg_replication_origin-cleanup when pg_description is deleted.
>
There is no omission, origin is not supposed to be deleted automatically
unless you use DROP SUBSCRIPTION.
-- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services