WAL segments removed from primary despite the fact that logical replication slot needs it.

Поиск
Список
Период
Сортировка
От hubert depesz lubaczewski
Тема WAL segments removed from primary despite the fact that logical replication slot needs it.
Дата
Msg-id Yz2hivgyjS1RfMKs@depesz.com
обсуждение исходный текст
Ответы Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Список pgsql-bugs
Hi,
We are still in process of preparing upgrades from Ubuntu Bionic with Pg
12 to Focal with Pg 14.

And today (well, yesterday) we hit something that looks like a bug.

Jumping to problem - wal files needed by logical replication were
removed from primary, despite slot being there.

Now, a bit more details.

Our replication is setup using following method:

1. We create all tables and primary keys on pg14
2. replication slot is created, with no tables
3. we add some tables to replication slot, and wait for all tables to
   switch state to "ready"
4. rinse and repeat until all tables are in replication set
5. once all is ready, we add rest of schema (indexes, constraints,
   functions, fkeys, and so on).

This works. Worked?

Yesterday we saw something that broke the replication.

First data from bionic side:
when                    | currentLSN    | slot_name │ plugin   │ slot_type │ datoid │ database │ temporary │ active │
active_pid│  xmin  │ catalog_xmin │  restart_lsn  │ confirmed_flush_lsn
 
2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743431068   | CA09/BE23BC70 | CA09/C82B2A58
 
2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743451605   | CA09/C9DA89B0 | CA09/D3FA9898
 
2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743479469   | CA09/DD32CE50 | CA09/E65E7F80
 
2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743516374   | CA09/EDD61CB8 | CA09/F3A856D8
 
2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743635890   | CA0A/2157B8F8 | CA0A/2A4C48F8
 
2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743686305   | CA0A/38BE5AB0 | CA0A/3D2B0C20
 
2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
 
2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
 
2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
29207     | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:34:30 UTC | CA10/3B78E10  | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 

As you can see around 2:20pm replication stopped fetching data.

At this time, in pg14 logs don't show any problems what so ever.
Pg_restore was doing it's work with create indexes, and some queries
like:
ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;

At 2:20:48pm there was:
2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not
receivedata from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
 

But these errors happened earlier too (minutes/hours earlier).

But then, 3 minutes later I got:

2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive
datafrom WAL stream: SSL SYSCALL error: EOF detected
 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical
replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
 
2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical
replicationworker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
 
2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data
fromWAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical
replicationworker",,0
 
2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical
replicationworker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
 


Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on
pg12/bionic.

Unfortunately I don't know the content of pg_wal on bionic at the
moment, but currently I see:

$ ls -l pg_wal/| head
total 297534552
-rw------- 1 postgres postgres      349 Feb 11  2022 000000010000A1410000001B.00C81918.backup
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004D
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004E
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004F
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000050
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000051
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000052
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000053
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000054

So, 4 files are missing.

These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"

I know that at this moment we're hosed. This replication was removed,
everythbing cleared (after all, we accumulated already 300G of wal, and
we can't extend the FS forever, so we needed to cleanup the situation).

But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
and the fact that it did means that either we did something wrong (what,
though?), or there is some kind of bug in Pg.

Anyone has any idea on how to investigate/fix that?

Best regards,

depesz




В списке pgsql-bugs по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: BUG #17616: Silently skipping to update data to the database with owner privileges using flyway scripts
Следующее
От: "李可强"
Дата:
Сообщение: Re:Re: Fwd: BUG #17017: Two versions of the same row of records are returned in one query