Re: Logical replication is missing block of rows when sending initial sync?

Поиск
Список
Период
Сортировка
От hubert depesz lubaczewski
Тема Re: Logical replication is missing block of rows when sending initial sync?
Дата
Msg-id ZT/SC0A3+9tIE8qi@depesz.com
обсуждение исходный текст
Ответ на RE: Logical replication is missing block of rows when sending initial sync?  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Список pgsql-bugs
On Mon, Oct 30, 2023 at 07:10:35AM +0000, Hayato Kuroda (Fujitsu) wrote:
> * Table definitions on both nodes

Well, table has 50 columns, 28 indexes, 10 fkeys, is referenced back by 26 tables, has two simple triggers.
Showing it all would be rather complicated and time consuming (legal issues).

Used datatypes:

- 14 columns with bigint
-  8 columns with character varying(255)
-  7 columns with timestamp without time zone
-  6 columns with boolean
-  3 columns with text
-  3 columns with integer
-  3 columns with double precision
-  2 columns with character varying(16)
-  1 column with uuid
-  1 column with numeric(6,2)
-  1 column with character varying(5)
-  1 column with character varying

I know it's not full \d, but I don't think sharing it in full is realistic.

Schema on recipient is restored from schema dump made ~ 5 minutes earlier.

> * GUC settings

Which GUCs?

> * Publication/subscription options

=# \dRp focal14
                           List of publications
  Name   │  Owner   │ All tables │ Inserts │ Updates │ Deletes │ Truncates
─────────┼──────────┼────────────┼─────────┼─────────┼─────────┼───────────
 focal14 │ postgres │ f          │ t       │ t       │ t       │ t
(1 row)

and

=# \dRs focal14
           List of subscriptions
  Name   │  Owner   │ Enabled │ Publication
─────────┼──────────┼─────────┼─────────────
 focal14 │ postgres │ t       │ {focal14}
(1 row)

> * Server log and output when you met the failure

The failure was just fail of adding fkey. These logs are gone now, I can't
keep failed host when I'm in process of preparing boxes for upgrade.

Error message was:

ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES
c33s16831.subs(id);
ERROR:  insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993"
DETAIL:  Key (sub_id)=(19733141) is not present in table "subs".

> * A minimal script for reproducing the failure (if you succeeded)

No. I think it's some kind of race condition.

> * Are there incoming data on publisher during the initial sync?

Yes. Pg12 is working as normal primary server for cluster that is having traffic 24h/day.

> * A method how you checked pg_subscription_rel

#v+
WITH info AS (
    SELECT
        format(
            '%s:%s',
            sr.srsubstate,
            count(*)
        ) AS part,
        min( o.ord ) AS ord
    FROM
        pg_subscription AS s
        JOIN pg_subscription_rel AS sr ON s.oid = sr.srsubid
        JOIN unnest( '{i,d,f,s,r}'::text[] ) WITH ORDINALITY AS o ( s, ord ) ON sr.srsubstate = o.s
    WHERE
        s.subname = 'focal14'
    GROUP BY
        sr.srsubstate
)
SELECT
    string_agg(
        part,
        ' ; '
        order by ord
    )
FROM
    info;
#v-

This query I used while adding tables. Add 40 tables to publication, wait for all states to be 0, with exception of
"r".
Afterwards I used:

#v+
SELECT
    n.nspname,
    c.relname
FROM
    pg_subscription s
    join pg_subscription_rel sr on s.oid = sr.srsubid
    join pg_class c on sr.srrelid = c.oid
    join pg_namespace n on c.relnamespace = n.oid
WHERE
    s.subname = 'focal14'
ORDER BY n.nspname, c.relname
#v-

and compared it against

#v+
SELECT
    schemaname,
    tablename
FROM
    pg_publication_tables
WHERE
    pubname = 'focal14'
ORDER BY schemaname, tablename
#v-

> Could you share an error message what you met? I thought you saw like
> "Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here.

#v+
ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES
c33s16831.subs(id);
ERROR:  insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993"
DETAIL:  Key (sub_id)=(19733141) is not present in table "subs".
#v-

> If so, are tuples really missing at that time?

Yes. I checked, and the whole block of values were missing - like I showed in previous mail.

> FYI - Below workflow is a brief description for initial synchronization.
> 1. Tablesync worker connects to publisher
> 2. Starts a transaction on both pub/sub
> 3. Creates a replication slot on publisher
> 4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted
> 5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement.
> 6. Commits the transaction on both pub/sub
> 7. Catches up changes since the replication slot was defined at 4.

I kinda, understand it. That's why I thought it's important to report that
sometimes, something, somewhere breaks.

Since my project is time sensitive, I have to proceed, and if such problem
happens on my boxes, I just redo the whole sync from scratch.

So I don't mind that there is bug, as I have workaround, but I think it would
be cool to figure out what could be the potential source, and perhaps find a
fix.

> > Unfortunately I don't have a way to repeat the problem. Just noticed
> > that on some dbs that we're upgrading the problem is more common, and on
> > some - it generally doesn't happen.
> Just to confirm - You met the error only when the logical replication was set
> up from old clusters to newer, right?

Yes. That's what we use LR for. For upgrades.

It happens randomly, on some db clusters more than on others, but all in all,
at most I had to rebuilt a upgrade box 3 times, and the problem was gone.
Usually (in > 90% of cases) there is no problem.

Best regards,

depesz




В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly