Re: [WiP] B-tree page merge during vacuum to reduce index bloat
От | Matthias van de Meent |
---|---|
Тема | Re: [WiP] B-tree page merge during vacuum to reduce index bloat |
Дата | |
Msg-id | CAEze2Wg2a8LQDRocVPa7Df2qXQLrTz-xSu1k3xP3x_6ABVo1Jw@mail.gmail.com обсуждение исходный текст |
Ответ на | [WiP] B-tree page merge during vacuum to reduce index bloat (Andrey Borodin <x4mmm@yandex-team.ru>) |
Ответы |
Re: [WiP] B-tree page merge during vacuum to reduce index bloat
|
Список | pgsql-hackers |
On Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Hi hackers, > > Together with Kirk and Nik we spent several online hacking sessions to tackle index bloat problems [0,1,2]. As a resultwe concluded that B-tree index page merge should help to keep an index from pressuring shared_buffers. > > We are proposing a feature to automatically merge nearly-empty B-tree leaf pages during VACUUM operations to reduce indexbloat. This addresses a common issue where deleted tuples leave behind sparsely populated pages that traditional pagedeletion cannot handle because they're not completely empty. > > *** Implementation Overview: > > The patch adds a new `mergefactor` reloption (default 5%, range 0-50%) that defines when a page becomes a candidate formerging. During vacuum, when a page exceeds this threshold (e.g., 95% empty with default settings), we attempt to movethe remaining tuples to the right sibling page and then delete the source page using existing page deletion mechanisms. > > Key changes: > - New `mergefactor` reloption for configurable merge thresholds > - Detection logic in `btvacuumpage()` to identify merge candidates > - Tuple movement implementation in `_bt_unlink_halfdead_page()` > - WAL logging enhancements to handle cross-page dependencies during replay > > The last part needs further improvements (it's simply REGBUF_FORCE_IMAGE for now), but I want to start a discussion andask for known problems of the approach. > > *** Correctness: > > The implementation reuses existing locking, critical sections and WAL logging infrastructure. To handle cross-page dependenciesduring WAL replay, when tuples are merged, the right sibling buffer is registered with `REGBUF_FORCE_IMAGE`,this is a temporary workaround. > > 1. Backward Scan Correctness: The primary concern is ensuring backward scans work correctly when pages are being mergedconcurrently. Since we maintain the same locking protocol as existing page deletion, I believe this should be safe,but would appreciate expert review of the approach. I'm fairly sure there is a correctness issue here; I don't think you correctly detect the two following cases: 1.) a page (P0) is scanned by a scan, finishes processing the results, and releases its pin. It prepares to scan the next page of the scan (P1). 2.) a page (A) is split, with new right sibling page B, 3.) and the newly created page B is merged into its right sibling C, 4.) the scan continues on to the next page For backward scans, if page A is the same page as the one identified with P1, the scan won't notice that tuples from P1 (aka A) have been moved through B to P0 (C), causing the scan to skip processing for those tuples. For forward scans, if page A is the same page as the one identified with P0, the scan won't notice that tuples from P0 (A) have been moved through B to P1 (C), causing the scan to process those tuples twice in the same scan, potentially duplicating results. NB: Currently, the only way for "merge" to happen is when the index page is completely empty. This guarantees that there is no movement of scan-visible tuples to pages we've already visited/are about to visit. This invariant is used extensively to limit lock and pin coupling (and thus: improve performance) in index scans; see e.g. in 1bd4bc85. This patch will invalidate that invariant, and therefore it will require (significantly) more work in the scan code (incl. nbtsearch.c) to guarantee exactly-once results + no false negatives. Kind regards, Matthias van de Meent Databricks
В списке pgsql-hackers по дате отправления: