Re: Fix race condition in XLogLogicalInfo and ProcSignal initialization.
| От | Chao Li |
|---|---|
| Тема | Re: Fix race condition in XLogLogicalInfo and ProcSignal initialization. |
| Дата | |
| Msg-id | C38A7D00-1B20-4947-BBC6-38267DF8D6DD@gmail.com обсуждение |
| Ответ на | Re: Fix race condition in XLogLogicalInfo and ProcSignal initialization. (Chao Li <li.evan.chao@gmail.com>) |
| Ответы |
Re: Fix race condition in XLogLogicalInfo and ProcSignal initialization.
|
| Список | pgsql-hackers |
> On Apr 29, 2026, at 09:28, Chao Li <li.evan.chao@gmail.com> wrote:
>
>
>
>> On Apr 29, 2026, at 05:15, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> Hi all,
>>
>> I found a race condition issue between XLogLogicalInfo and ProcSignal
>> initialization while reviewing another issue[1]. I'm starting a
>> separate thread for the subject as it's not related to the issue
>> reported on that thread.
>>
>> The issue is that child processes could miss the
>> PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO signal during the
>> initialization and end up in an inconsistent state because
>> InitializeProcessXLogLogicalInfo() is called (in BaseInit()) before
>> ProcSignalInit(). If the startup emits the signal to a process who is
>> between two steps, the process would not reflect the latest
>> XLogLogicalInfo state. I think we should move
>> InitializeProcessXLogLogicalInfo() after ProcSignalInit() like we do
>> so for InitLocalDataChecksumState().
>
> I think this is correct.
>
> After moving InitializeProcessXLogLogicalInfo() out of BaseInit(), background worker processes (BackgroundWorkerMain)
willno longer hold a valid value of XLogLogicalInfo, but I guess that is fine as those processes don’t call
ProcSignalInit()anyway.
>
>>
>> I've attached the patch to fix this issue. Feedback is very welcome.
>>
>
> Just found a typo:
>
> ```
> + * These initialization intentionally happens afater initializing the
> ```
>
> afater => after
>
I met Zhijie Hou at HOW 2026 a few days ago. When we talked about a feature requirement I recently heard from a DBA,
Zhijiepointed me to 67c20979ce (Toggle logical decoding dynamically based on logical slot presence).
The requirement is that storage is expensive today, and users are sensitive to the total size of WAL. In some
deployments,users may only want to replicate a small set of tables intermittently, but to enable logical replication,
theystill have to set wal_level to logical, which significantly increases the total WAL volume. I believe this feature
couldhelp address that concern, so I reviewed the code and played a bit with it.
I found an issue related to this patch, so I am sharing my findings here, although the problem also exists before this
patch.
In InitPostgres(), in the standalone backend path, StartupXLOG() is called, but XLogLogicalInfo is not updated. As a
result,if we switch to standalone mode for some emergency maintenance, make data changes, and then switch back to
normalmode, changes made during standalone mode would not include logical replication metadata, which may potentially
breakfuture logical replication.
To verify that, I did a test like:
* Start a new instance with wal_level = replica
* Create a table, insert some data, then create a logical replication slot
```
evantest=# CREATE TABLE t1(id int);
CREATE TABLE
evantest=# INSERT INTO t1 VALUES (1), (2);
INSERT 0 2
evantest=# SELECT * FROM pg_create_logical_replication_slot('s1', 'test_decoding');
slot_name | lsn
-----------+------------
s1 | 0/01D6E6D0
(1 row)
```
* Stop the server, and start with standalone mode, and truncate the table:
```
% postgres --single -F -D . evantest
PostgreSQL stand-alone backend 19devel
backend> show effective_wal_level;
1: effective_wal_level (typeid = 25, len = -1, typmod = -1, byval = f)
----
1: effective_wal_level = "replica" (typeid = 25, len = -1, typmod = -1, byval = f)
----
backend> truncate t1;
backend> 2026-04-29 21:13:24.625 CST [68316] LOG: checkpoint starting: shutdown fast
```
* Start the server normally, and real WAL through the logical slot.
```
evantest=# SELECT data FROM pg_logical_slot_get_changes('s1', NULL, NULL);
data
------------
BEGIN 721
COMMIT 721
(2 rows)
```
The TRUNCATE does not appear, which I think is wrong. To fix that, we only need to call
InitializeProcessXLogLogicalInfo()afterStartupXLOG() in the standalone path. Since the fix is based on this patch, I
addedit as 0002 in this patch set.
One more thought: I think this feature partially addresses the user requirement I described earlier. When wal_level is
replicaandsome logical slots are created, the extra WAL data should only be enabled for tables included in those slots.
Thatavoids generating unnecessary WAL data for tables that are not targets of replication, and therefore saves storage.
WDYT?Maybe a candidate for v20?
BTW, in 0001, I helped fix the typos.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Вложения
В списке pgsql-hackers по дате отправления: