Re: generic plans and "initial" pruning
| От | Amit Langote |
|---|---|
| Тема | Re: generic plans and "initial" pruning |
| Дата | |
| Msg-id | CA+HiwqEF9SgKyQ1HrYOURpv8DGRGHDNwBT9Y6yEBVCW+=kh_=w@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: generic plans and "initial" pruning (Amit Langote <amitlangote09@gmail.com>) |
| Список | pgsql-hackers |
Hi, On Tue, Jul 22, 2025 at 3:43 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Thu, Jul 17, 2025 at 9:11 PM Amit Langote <amitlangote09@gmail.com> wrote: > > The refinements I described in my email above might help mitigate some > > of those executor-related issues. However, I'm starting to wonder if > > it's worth reconsidering our decision to handle pruning, locking, and > > validation entirely at executor startup, which was the approach taken > > in the reverted patch. > > > > The alternative approach, doing initial pruning and locking within > > plancache.c itself (which I floated a while ago), might be worth > > revisiting. It avoids the complications we've discussed around the > > executor API and preserves the clear separation of concerns that > > plancache.c provides, though it does introduce some new layering > > concerns, which I describe further below. > > > > To support this, we'd need a mechanism to pass pruning results to the > > executor alongside each PlannedStmt. For each PartitionPruneInfo in > > the plan, that would include the corresponding PartitionPruneState and > > the bitmapset of surviving relids determined by initial pruning. Given > > that a CachedPlan can contain multiple PlannedStmts, this would > > effectively be a list of pruning results, one per statement. One > > reasonable way to handle that might be to define a parallel data > > structure, separate from PlannedStmt, constructed by plancache.c and > > carried via QueryDesc. The memory and lifetime management would mirror > > how ParamListInfo is handled today, leaving the executor API unchanged > > and avoiding intrusive changes to PlannedStmt. > > > > However, one potentially problematic aspect of this design is managing > > the lifecycle of the relations referenced by PartitionPruneState. > > Currently, partitioned table relations are opened by the executor > > after entering ExecutorStart() and closed automatically by > > ExecEndPlan(), allowing cleanup of pruning states implicitly. If we > > perform initial pruning earlier, we'd need to keep these relations > > open longer, necessitating explicit cleanup calls (e.g., a new > > FinishPartitionPruneState()) invoked by the caller of the executor, > > such as from ExecutorEnd() or even higher-level callers. This > > introduces some questionable layering by shifting responsibility for > > relation management tasks, which ideally belong within the executor, > > into its callers. > > > > My sense is that the complexity involved in carrying pruning results > > via this parallel data structure was one of the concerns Tom raised > > previously, alongside the significant pruning code refactoring that > > the earlier patch required. The latter, at least, should no longer be > > necessary given recent code improvements. > > One point I forgot to mention about this approach is that we'd also > need to ensure permissions on parent relations are checked before > performing initial pruning in plancache.c, since pruning may involve > evaluating user-provided expressions. So in effect, we'd need to > invoke not just ExecDoInitialPruning(), but also > ExecCheckPermissions(), or some variant of it, prior to executor > startup. While manageable, it does add slightly to the complexity. Sorry for the absence. I've now implemented the approach mentioned above and split it into a series of reasonably isolated patches. The key idea is to avoid taking unnecessary locks when reusing a cached plan. To achieve that, we need to perform initial partition pruning during cached plan reuse in plancache.c so that only surviving partitions are locked. This requires some plumbing to reuse the result of this "early" pruning during executor startup, because repeating the pruning logic would be both inefficient and potentially inconsistent -- what if you get different results the second time? (I don't have proof that this can happen, but some earlier emails mention the theoretical risk, so better to be safe.) So this patch introduces ExecutorPrep(), which allows executor metadata such as initial pruning results (valid subplan indexes) and full unpruned_relids to be computed ahead of execution and reused later by ExecutorStart() and during QueryDesc setup in parallel workers using the results shared by the leader. The parallel query bit was discussed previously at [1], though I didn’t have a solution I liked then. This revives an idea that was last implemented in the patch (v30) posted on Dec 16, 2022. In retrospect, I understand the hesitation Tom might have had about the patch at the time -- its changes to enable early pruning and then feed the results into ExecutorStart() were less than pretty. Thanks to the initial pruning code refactoring that I committed in Postgres 18, those changes now seem much more principled and modular IMO. The patch set is structured as follows: * Refactor partition pruning initialization (0001): separates the setup of the pruning state from its execution by introducing ExecCreatePartitionPruneStates(). This makes the pruning logic easier to reuse and adds flexibility to do only the setup but skip pruning in some cases. * Introduce ExecutorPrep infrastructure (0002): adds ExecutorPrep() and ExecPrep as a formal way to perform executor setup ahead of execution. This enables caching or transferring pruning results and other metadata without triggering execution. ExecutorStart() can now consume precomputed prep state from the EState created during ExecutorPrep(). ExecPrepCleanup() handles cleanup when the plan is invalidated during prep and so not executed; the state is cleaned up in the regular ExecutorEnd() path otherwise. * Allow parallel workers to reuse leader pruning results (0003): lets workers reuse the leader’s initial pruning results (valid subplan indexes) and unpruned_relids via ExecutorPrep(). This adds a verification step to check that leader and worker decisions match, throwing an error if they don’t -- so "reuse" is a bit of a lie. Should that check be debug-only? (Maybe not.) As mentioned above, this was previously discussed at [1]. * Enable pruning-aware locking in cached / generic plan reuse (0004): extends GetCachedPlan() and CheckCachedPlan() to call ExecutorPrep() on each PlannedStmt in the CachedPlan, locking only surviving partitions. Adds CachedPlanPrepData to pass this through plan cache APIs and down to execution via QueryDesc. Also reinstates the firstResultRel locking rule added in 28317de72 but later lost due to revert of the earlier pruning patch, to ensure correctness when all target partitions are pruned. This approach keeps plan caching and validation logic self-contained in plancache.c, avoids invasive executor API changes. Benchmark results: echo "plan_cache_mode = force_generic_plan" >> $PGDATA/postgresql.conf for p in 32 64 128 256 512 1024; do pgbench -i --partitions=$p > /dev/null 2>&1; echo -ne "$p\t"; pgbench -n -S -T10 -Mprepared | grep tps; done Master 32 tps = 23841.822407 (without initial connection time) 64 tps = 21578.619816 (without initial connection time) 128 tps = 18090.500707 (without initial connection time) 256 tps = 14152.248201 (without initial connection time) 512 tps = 9432.708423 (without initial connection time) 1024 tps = 5873.696475 (without initial connection time) Patched 32 tps = 24724.245798 (without initial connection time) 64 tps = 24858.206407 (without initial connection time) 128 tps = 24652.655269 (without initial connection time) 256 tps = 23656.756615 (without initial connection time) 512 tps = 22299.865769 (without initial connection time) 1024 tps = 21911.704317 (without initial connection time) Comments welcome. [1] https://www.postgresql.org/message-id/CA%2BHiwqFA%3DswkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw%40mail.gmail.com -- Thanks, Amit Langote
Вложения
В списке pgsql-hackers по дате отправления: