RE: CDC/ETL system on top of logical replication with pgoutput, custom client

Поиск
Список
Период
Сортировка
От José Neves
Тема RE: CDC/ETL system on top of logical replication with pgoutput, custom client
Дата
Msg-id PR3P193MB04916D9CB45890D810908FB3890FA@PR3P193MB0491.EURP193.PROD.OUTLOOK.COM
обсуждение исходный текст
Ответ на RE: CDC/ETL system on top of logical replication with pgoutput, custom client  (José Neves <rafaneves3@msn.com>)
Ответы Re: CDC/ETL system on top of logical replication with pgoutput, custom client  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi there, hope to find you all well.

A follow-up on this. Indeed, a new commit-based approach solved my missing data issues.
But, getting back to the previous examples, how are server times expected to be logged for the xlogs containing these records?

With these 2 transactions:
T-1
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000

T-2
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Arriving this way:
BEGIN
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000

BEGIN
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Are server times for them expected to be:

BEGIN
INSERT LSN1-1000 - 2
UPDATE LSN2-2000 - 4
UPDATE LSN3-3000 - 6
COMMIT  LSN4-4000 - 7

BEGIN
INSERT LSN1-500 - 1
UPDATE LSN2-1500 - 3
UPDATE LSN3-2500 - 5
COMMIT  LSN4-5500 - 8

Or:

BEGIN
INSERT LSN1-1000 - 1
UPDATE LSN2-2000 - 2
UPDATE LSN3-3000 - 3
COMMIT  LSN4-4000 - 4

BEGIN
INSERT LSN1-500 - 5
UPDATE LSN2-1500 - 6
UPDATE LSN3-2500 - 7
COMMIT  LSN4-5500 - 8

I'm asking because altho I'm no longer missing data, I have a second async process that can fail (publishing data to an event messaging service), and therefore there is a possibility of data duplication. Worst I've to split large transactions as message sizes are limited. Would be nice if I could rely on server time ts to discard duplicated data...

Thanks.
Regards,
José Neves

De: José Neves <rafaneves3@msn.com>
Enviado: 1 de agosto de 2023 10:13
Para: Andres Freund <andres@anarazel.de>
Cc: Amit Kapila <amit.kapila16@gmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Assunto: RE: CDC/ETL system on top of logical replication with pgoutput, custom client
 
Hi Andres.

Owh, I see the error of my way... :(

By ignoring commits, and committing individual operation LSNs, I was effectively rolling back the subscription. In the previous example, if I committed the LSN of the first insert of the second transaction (LSN1-500), I was basically telling Postgres to send everything again, including the already processed T1.

what you mean with the "different from the numeric order"
I'm probably lacking terminology. I mean that LSN4-5500 > LSN4-4000 > LSN3-3000 > LSN3-2500...

But, if I'm understanding correctly, I can only rely on the incremental sequence to be true for the commit events. Which explains my pain.
The world makes sense again.

Thank you very much. Will try to implement this new logic, and hopefully not bug again with this issue.
Regards,
José Neves

De: Andres Freund <andres@anarazel.de>
Enviado: 1 de agosto de 2023 00:21
Para: José Neves <rafaneves3@msn.com>
Cc: Amit Kapila <amit.kapila16@gmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom client
 
Hi,

On 2023-07-31 21:25:06 +0000, José Neves wrote:
> Ok, if I understood you correctly, I start to see where my logic is faulty. Just to make sure that I got it right, taking the following example again:
>
> T-1
> INSERT LSN1-1000
> UPDATE LSN2-2000
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
>
> T-2
> INSERT LSN1-500
> UPDATE LSN2-1500
> UPDATE LSN3-2500
> COMMIT  LSN4-5500
>
> Where data will arrive in this order:
>
> INSERT LSN1-500
> INSERT LSN1-1000
> UPDATE LSN2-1500
> UPDATE LSN2-2000
> UPDATE LSN3-2500
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
> COMMIT  LSN4-5500

No, they won't arrive in that order. They will arive as

BEGIN
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000
BEGIN
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Because T1 committed before T2. Changes are only streamed out at commit /
prepare transaction (*). Within a transaction, they however *will* be ordered
by LSN.

(*) Unless you use streaming mode, in which case it'll all be more
complicated, as you'll also receive changes for transactions that might still
abort.


> You are saying that the LSN3-3000 will never be missing, either the entire
> connection will fail at that point, or all should be received in the
> expected order (which is different from the "numeric order" of LSNs).

I'm not quite sure what you mean with the "different from the numeric order"
bit...


> If the connection is down, upon restart, I will receive the entire T-1
> transaction again (well, all example data again).

Yes, unless you already acknowledged receipt up to LSN4-4000 and/or are only
asking for newer transactions when reconnecting.


> In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger
> numeric value" than the ones representing INSERT and UPDATE events on T-2, I
> will be receiving the entire T-2 transaction again, as the LSN4-5500 is
> still uncommitted.

I don't quite know what you mean with "commit LSN4-4000" here.


> This makes sense to me, but just to be extra clear, I will never receive a
> transaction commit before receiving all other events for that transaction.

Correct.

Greetings,

Andres Freund

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Configurable FP_LOCK_SLOTS_PER_BACKEND