Обсуждение: [HACKERS] Help required to debug pg_repack breaking logical replication

Поиск
Список
Период
Сортировка

[HACKERS] Help required to debug pg_repack breaking logical replication

От
Daniele Varrazzo
Дата:
Hello,

we have been reported, and I have experienced a couple of times,
pg_repack breaking logical replication.

- https://github.com/reorg/pg_repack/issues/135
- https://github.com/2ndQuadrant/pglogical/issues/113

In my experience, after the botched run, the replication slot was
"stuck", and any attempt of reading (including
pg_logical_slot_peek_changes()) blocked until ctrl-c. I've tried
replicating the issue but first attempts have failed to fail.

In the above issue #113, Petr Jelinek commented:

> From quick look at pg_repack, the way it does table rewrite is almost guaranteed
> to break logical decoding unless there is zero unconsumed changes for a given table
> as it does not build the necessary mappings info for logical decoding that standard
> heap rewrite in postgres does.

unfortunately he didn't follow up to further details requests.

I've started drilling down the problem, observing that the swap
function, swap_heap_or_index_files() [1] was cargoculted by the
original author from the CLUSTER command code as of PG 8.2 [2] (with a
custom addition to update the relfrozenxid which seems backwards to me
as it sets the older frozen xid on the new table [3]).

[1] https://github.com/reorg/pg_repack/blob/ver_1.4.1/lib/repack.c#L1082
[2] https://github.com/postgres/postgres/blob/REL8_2_STABLE/src/backend/commands/cluster.c#L783
[3] https://github.com/reorg/pg_repack/issues/152

so that code is effectively missing a good 10 years of development.
Before jumping into fast-forwarding it, I would like to ask for some
help, i.e.

- Is Petr diagnosis right and freezing of logical replication is to be
blamed to missing mapping?
- Can you suggest a test to reproduce the issue reliably?
- What are mapped relations anyway?

Thank you in advance for any help (either info about how to fix the
issue properly, or a patch by someone who happens to really understand
what we are talking about).

-- Daniele


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Help required to debug pg_repack breaking logical replication

От
Craig Ringer
Дата:
On 8 October 2017 at 02:37, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
> Hello,
>
> we have been reported, and I have experienced a couple of times,
> pg_repack breaking logical replication.
>
> - https://github.com/reorg/pg_repack/issues/135
> - https://github.com/2ndQuadrant/pglogical/issues/113

Yeah, I was going to say I've seen reports of this with pglogical, but
I see you've linked to them.

I haven't had a chance to look into it though, and haven't had a
suitable reproducible test case.

> In the above issue #113, Petr Jelinek commented:
>
>> From quick look at pg_repack, the way it does table rewrite is almost guaranteed
>> to break logical decoding unless there is zero unconsumed changes for a given table
>> as it does not build the necessary mappings info for logical decoding that standard
>> heap rewrite in postgres does.
>
> unfortunately he didn't follow up to further details requests.

At a guess he's referring to src/backend/access/heap/rewriteheap.c .

I'd explain better if I understood what was going on myself, but I
haven't really understood the logical decoding parts of that code.

> - Is Petr diagnosis right and freezing of logical replication is to be
> blamed to missing mapping?
> - Can you suggest a test to reproduce the issue reliably?
> - What are mapped relations anyway?

I can't immediately give you the answers you seek, but start by
studying src/backend/access/heap/rewriteheap.c . Notably
logical_end_heap_rewrite, logical_rewrite_heap_tuple,
logical_begin_heap_rewrite.

At a wild "I haven't read any of the relevant code in detail yet" stab
in the dark, pg_repack is failing to do the bookkeeping required by
logical decoding around relfilenode changes, cmin/cmax, etc.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Help required to debug pg_repack breaking logicalreplication

От
Petr Jelinek
Дата:
On 08/10/17 15:21, Craig Ringer wrote:
> On 8 October 2017 at 02:37, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
>> Hello,
>>
>> we have been reported, and I have experienced a couple of times,
>> pg_repack breaking logical replication.
>>
>> - https://github.com/reorg/pg_repack/issues/135
>> - https://github.com/2ndQuadrant/pglogical/issues/113
> 
> Yeah, I was going to say I've seen reports of this with pglogical, but
> I see you've linked to them.
> 
> I haven't had a chance to look into it though, and haven't had a
> suitable reproducible test case.
> 
>> In the above issue #113, Petr Jelinek commented:
>>
>>> From quick look at pg_repack, the way it does table rewrite is almost guaranteed
>>> to break logical decoding unless there is zero unconsumed changes for a given table
>>> as it does not build the necessary mappings info for logical decoding that standard
>>> heap rewrite in postgres does.
>>
>> unfortunately he didn't follow up to further details requests.
> 
> At a guess he's referring to src/backend/access/heap/rewriteheap.c .
> 
> I'd explain better if I understood what was going on myself, but I
> haven't really understood the logical decoding parts of that code.
> 
>> - Is Petr diagnosis right and freezing of logical replication is to be
>> blamed to missing mapping?
>> - Can you suggest a test to reproduce the issue reliably?
>> - What are mapped relations anyway?
> 
> I can't immediately give you the answers you seek, but start by
> studying src/backend/access/heap/rewriteheap.c . Notably
> logical_end_heap_rewrite, logical_rewrite_heap_tuple,
> logical_begin_heap_rewrite.
> 

Yes that's exactly it. When table is rewritten we need to create mapping
for every tuple that was created or removed (ie, insert, update or
delete operation happened on it) since the oldest replication slot xmin
for logical decoding to continue to work on that table after the
rewrite. And pg_repack doesn't create that mapping.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Help required to debug pg_repack breaking logical replication

От
Robert Haas
Дата:
On Sat, Oct 7, 2017 at 2:37 PM, Daniele Varrazzo
<daniele.varrazzo@gmail.com> wrote:
> (with a
> custom addition to update the relfrozenxid which seems backwards to me
> as it sets the older frozen xid on the new table [3]).
>
> [3] https://github.com/reorg/pg_repack/issues/152

Wow.  That's really bad.  It will corrupt your database.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Help required to debug pg_repack breaking logicalreplication

От
Oleksii Kliukin
Дата:
Hello,

Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

> On 08/10/17 15:21, Craig Ringer wrote:
>> On 8 October 2017 at 02:37, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
>>> Hello,
>>>
>>> we have been reported, and I have experienced a couple of times,
>>> pg_repack breaking logical replication.
>>>
>>> - https://github.com/reorg/pg_repack/issues/135
>>> - https://github.com/2ndQuadrant/pglogical/issues/113
>>
>> Yeah, I was going to say I've seen reports of this with pglogical, but
>> I see you've linked to them.
>>
>> I haven't had a chance to look into it though, and haven't had a
>> suitable reproducible test case.
>>
>>> In the above issue #113, Petr Jelinek commented:
>>>
>>>> From quick look at pg_repack, the way it does table rewrite is almost guaranteed
>>>> to break logical decoding unless there is zero unconsumed changes for a given table
>>>> as it does not build the necessary mappings info for logical decoding that standard
>>>> heap rewrite in postgres does.
>>>
>>> unfortunately he didn't follow up to further details requests.
>>
>> At a guess he's referring to src/backend/access/heap/rewriteheap.c .
>>
>> I'd explain better if I understood what was going on myself, but I
>> haven't really understood the logical decoding parts of that code.
>>
>>> - Is Petr diagnosis right and freezing of logical replication is to be
>>> blamed to missing mapping?
>>> - Can you suggest a test to reproduce the issue reliably?
>>> - What are mapped relations anyway?
>>
>> I can't immediately give you the answers you seek, but start by
>> studying src/backend/access/heap/rewriteheap.c . Notably
>> logical_end_heap_rewrite, logical_rewrite_heap_tuple,
>> logical_begin_heap_rewrite.
>
> Yes that's exactly it. When table is rewritten we need to create mapping
> for every tuple that was created or removed (ie, insert, update or
> delete operation happened on it) since the oldest replication slot xmin
> for logical decoding to continue to work on that table after the
> rewrite. And pg_repack doesn't create that mapping.

Excuse me for posting to almost 2-years old thread. I haven’t found any more
recent discussions. I use pg_repack quite extensively and while I am not
running it together with logical replication yet, there is a pressing need
to do so. I came across this discussion and after reading it I feel the
answers to the question of whether it is safe to use pg_repack and logical
decoding together given there are lacking necessary details to make a
potential test case to reproduce the problem.

Looking at the source code inside rewriteheap.c, the functions mentioned
above are no-op if state->rs_logical_rewrite is set to false, and the flag
is only set to true for system catalogs (that repack doesn’t cause rewriting
of anyway) and user catalog tables (that are uncommon). Does it imply the
scope of the breakage is limited to only those two types of tables, or am I
missing some other part of the code Petr had in mind in the original comment
that stated that pg_repack is almost guaranteed to break logical decoding?

Cheers,
Oleksii