Re: Skipping logical replication transactions on subscriber side

Поиск
Список
Период
Сортировка
От David G. Johnston
Тема Re: Skipping logical replication transactions on subscriber side
Дата
Msg-id CAKFQuwZyyN=9J_KGV1j_MiKz59rWXcPkCe5eYCEM4w-PhhhXag@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Skipping logical replication transactions on subscriber side  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Skipping logical replication transactions on subscriber side  (Masahiko Sawada <sawada.mshk@gmail.com>)
Re: Skipping logical replication transactions on subscriber side  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-hackers
On Sun, Jan 23, 2022 at 11:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Jan 24, 2022 at 1:49 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Sun, Jan 23, 2022 at 8:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> > I really dislike the user experience this provides, and given it is new in v15 (and right now this table seems to exist solely to support this feature) changing this seems within the realm of possibility. I have to imagine these workers have a sense of local state that would just be "no errors, no need to touch pg_stat_subscription_workers at the end of this transaction's commit".  It would save a local state of the error_xid and if a successfully committed transaction has that xid it would clear the error.  The skip code path would also check for and see the matching xid value and clear the error.  Even if the local state thing doesn't work, one catalog lookup per transaction seems like potentially reasonable overhead to incur here.
>> >
>>
>> Are you telling to update the catalog to save error_xid when an error
>> occurs? If so, that has many challenges like we are not supposed to
>> perform any such operations when the transaction is in an error state.
>> We have discussed this and other ideas in the beginning. I don't find
>> any of your arguments convincing to change the basic approach here but
>> I would like to see what others think on this matter?
>>
>
> Then how does the table get updated to that state in the first place since it doesn't know the error details until there is an error?

I think your idea is based on storing error information including XID
is stored in the system catalog. I think that the reasons why we use
the stats collector

I noticed this dynamic while skimming the patch (and also pondering why the new worker table was not in a catalog chapter) but am only now fully beginning to appreciate its impact on this discussion.
 
to store error information including
last_error_xid are (1) as Amit mentioned, it would have many
challenges if updating the catalog when the transaction is in an error
state, and

I'm going on faith right now that this is a problem.  But from my prior outline I hope you can see why I find it surprising.  Don't try to update a catalog while in an error state.  Get out of the error state first.  e.g., A transient "holding pattern" would seem to work.  Upon a server restart the transient state would be forgotten, it would attempt to reapply the wal, would see the same error, and would then go back into the transient holding pattern.  I do intend to read the other discussion on this particular topic so a detailed rebuttal, if warranted, can be withheld.
 
(2) we can store more information such as error messages,
action, etc. other than XID so that users can identify that the
reported error is a conflict error but not other types of error such
as OOM error.

I mentioned only XID because of the focus on SKIP.  The other data already present in that table is ok.  Whether we use a catalog or the stats collector seems irrelevant.  If anything the catalog makes more sense - calling an error message a statistic is a bit of a reach.

>Similarly, the same is true
>for clearing subskipxid. I agree that it's useful if
>pg_subscription.subskipxid is automatically set when executing ALTER
>SUBSCRIPTION SKIP but it might not work in some cases because of this
>restriction. For these reasons to me, it makes sense to store
>subscribers' error information by using the stats collector.

I'm confused - pg_subscription is a catalog, not a stat view.  Why is it affected?

I don't see how point 2 prevents using a system catalog.  I accept point 1 as true but will need to read some of the prior discussion to really understand it.

When it comes to reporting a message to the stats collector, we need
to note that it's not guaranteed that all messages arrive at the stats
collector. Therefore, last_error_xid doesn't not necessarily get
updated after the worker reports an error.

You'll forgive me for not considering this due to its apparent lack of mention in the documentation [*] and it's arguable classification as a POLA violation.


What I do read there seems compatible with the desired user experience.  500ms lag, idle transaction oriented, reset upon unclean shutdown, and consumers seeing a stable transactional view: none of these seem like show-stoppers.

Anyway, even if subskipxid is automatically set when ALTER
SUBSCRIPTION SKIP, I think we need to provide a way to clear it as the
current patch does (setting NONE) just in case.

With my suggestion of requiring a matching xid the whole option for skip_xid = { xid | NONE } remains.

> 5(out). wait for the user to manually restart the replication stream

Do you mean that there always is user intervention after error so the
replication stream can resume?
 
That is my working assumption.  It doesn't seem like the system would auto-resume without a DBA doing something (I'll attribute a server crash to the DBA for convenience).

Apparently I need to read more about how the system works today to understand how this varies from and integrates with today's user experience.

That said, at present my two dislikes:

1) ALTER SYSTEM SKIP accepts any xid value (I need to consider further the timing of when this resets to zero)
2) pg_stat_subscription_worker.last_error_* fields remain populated even while the system is in a normal operating state.

are preventing me from preferring this patch over the status quo (yes, I know the 2nd point is about a committed feature).  Regardless of how far off I may be regarding our technical ability to change them to a more (IMO) user-friendly design.

David J.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: typos
Следующее
От: Noah Misch
Дата:
Сообщение: Re: XLogReadRecord() error in XlogReadTwoPhaseData()