Re: Implement waiting for wal lsn replay: reloaded
| От | Xuneng Zhou |
|---|---|
| Тема | Re: Implement waiting for wal lsn replay: reloaded |
| Дата | |
| Msg-id | CABPTF7WJf5BnG1yCVk032+QiGuCmrhgnfNFO2HTgcuWAeRD9+A@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Implement waiting for wal lsn replay: reloaded (Xuneng Zhou <xunengzhou@gmail.com>) |
| Ответы |
Re: Implement waiting for wal lsn replay: reloaded
|
| Список | pgsql-hackers |
Hi, On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi Alexander, > > Thanks for your feedback! > > > I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting > > mode parameter. Should we allow this? > > I think this constraint could be relaxed if needed. I was previously > unsure about the use cases. Flush mode on the primary seems useful when synchronous_commit is set to off [1]. In that mode, a transaction in primary may return success before its WAL is durably flushed to disk, trading durability for lower latency. A “wait for primary flush” operation provides an explicit durability barrier for cases where applications or tools occasionally need stronger guarantees. [1] https://postgresqlco.nf/doc/en/param/synchronous_commit/ > > If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be > > separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In > > principle, we could encode both as just 'flush' mode, and detect which > > WaitLSNType to pick by checking if recovery is in progress. However, > > how should we then react to unreached flush location after standby > > promotion (technically it could be still reached but on the different > > timeline)? > > > > Technically, we can use 'flush' mode to specify WAIT FOR behavior in > both primary and replica. Currently, wait for commands error out if > promotion occurs since: either the requested LSN type does not exist > on the primary, or we do not yet have the infrastructure to support > continuing the wait. If we allow waiting for flush on the primary as a > user-visible command and the wake-up calls for flush in primary are > introduced, the question becomes whether we should still abort the > wait on promotion, or continue waiting—as you noted—given that the > target LSN might still be reached, albeit on a different timeline. The > question behind this might be: do users care and should be aware of > the state change of the server while waiting? If they do, then we > better stop the waiting and report the error. In this case, I am > inclined to to break the unified flush mode to something like > primary_flush/standby_flush mode and > WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively. > After further consideration, it also seems reasonable to use a single, unified flush mode that works on both primary and standby servers, provided its semantics are clearly documented to avoid the potential confusion on failure. I don’t have a strong preference between these two and would be interested in your thoughts. If a standby is promoted while a session is waiting, the command better abort and return an error (or report “not in recovery” when using NO_THROW). At that point, the meaning of the LSN being waited for may have changed due to the timeline switch and the transition from standby to primary. An LSN such as 0/5000000 on TLI 2 can represent entirely different WAL content from 0/5000000 on TLI 1. Allowing the wait to silently continue across promotion risks giving users a false sense of safety—for example, interpreting “wait completed” as “the original data is now durable,” which would no longer be true. -- Best, Xuneng
Вложения
В списке pgsql-hackers по дате отправления:
