Обсуждение: A very quick observation of dangling pointers in Postgres pathlists

Поиск
Список
Период
Сортировка

A very quick observation of dangling pointers in Postgres pathlists

От
Andrei Lepikhov
Дата:
Hi,

It looks like a community decision has been developing that Postgres should
separate optimisation features into 'conventional' and 'magic' classes [1]. This
has raised my concern that hidden contracts about pathlists' state and ordering
could lead to subtle bugs if an extension optimisation goes too far.

I think this topic is of interest because of the growing number of features that
impact path choice, such as ‘disable node’ or pg_plan_advice. Also, emerging
techniques that involve two or more levels of plan trees, like ‘eager
aggregation’, might catch another dangling pointer hidden in path lists for a
while. Don’t forget complicated cases with FDW and Custom nodes too.

For this purpose, a tiny debugging extension module, pg_pathcheck [2], has been
invented. It uses create_upper_paths_hook and planner_shutdown_hook. The
extension walks the entire Path tree, starting from the top PlannerInfo, then
recurses into glob::subroots, traversing each RelOptInfo and each pathlist.
Also, it traverses the path→subpath subtrees to ensure that potentially quite
complex path trees are covered when implemented as a single RelOptInfo. For each
pointer it visits, it checks if the NodeTag matches a known Path type. If not,
the memory was freed (and, with CLOBBER_FREED_MEMORY, set to 0x7F) or reused for
something else.

This approach is not free of caveats. For example, most Path nodes and many Plan
nodes fall within the 128-byte gap of the minimal allocated chunk. That means
freeing one path allows the optimiser to immediately allocate another Path node
at a potentially different query tree level. I had such a case at least once in
production. It was actually hard to realise, reproduce, and fix.

Running make check-world tests with the debug module loaded at startup revealed
many cases in which RelOptInfo structures contain dangling pointers. What
exactly do we see there?

The pathlist contents at the moment of an ‘Invalid’ path detection:

* ProjectionPath, Invalid — by far the most common, on JOIN RelOptInfos.
* ProjectionPath, Invalid, SortPath.
* AggPath, Invalid.
* NestPath, Invalid
* HashPath, Invalid
* cheapest_startup_path referencing a dangling pointer, on what looks
like a join of two partitions.
* cheapest_startup_path referencing a dangling pointer on a plain base
RelOptInfo.

The best-known problematic code example causing this issue is
apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
0927d2f46dd.  Quickly fixing it, I see some more combinations have emerged:

* UniquePath, Invalid
* MergePath, Invalid
* SubqueryScanPath, Invalid
* SetOpPath, Invalid
* GatherPath, Path, Invalid
* AppendPath, AggPath, Invalid, AggPath
* HashPath, Invalid
* AppendPath, HashPath, Invalid

These new invalid references occur outside the originally identified code path,
showing that fixing one place does not address the broader issue (maybe my fixes
were wrong?). While some claim that the cost-dominance principle ('the cheapest
path is never invalid') provides safety, I have not found any acknowledgment of
this. As the planner is expanded, undocumented rules leave the system vulnerable.

The purpose of this email is basically to highlight the issue and raise a
discussion on how to solve it. Ashutosh designed a 'smart pointer' approach,
which seems the most balanced and bulletproof way. Another approach: 'used' flag
seems less interesting as well as local memory contexts - we should always
remember about multi-children cases that need freeing unnecessary paths in-place
to reduce memory consumption. But before diving into the code and identifying
origins of these cases, I’d like to know: is it an actual problem, or is the
cost-dominance contract enough?

[1]
https://www.postgresql.org/message-id/CA+TgmoaPgXYYEivQWxyVV=eYhN+T9JAgS9Xe4m7g9wVitVPF8g@mail.gmail.com
[2] https://github.com/danolivo/pg_pathcheck

-- 
regards, Andrei Lepikhov,
pgEdge




Re: A very quick observation of dangling pointers in Postgres pathlists

От
Andrei Lepikhov
Дата:
On 17/04/2026 10:56, Andrei Lepikhov wrote:
> The best-known problematic code example causing this issue is
> apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
> 0927d2f46dd.  Quickly fixing it, I see some more combinations have emerged:

On closer inspection, it looks like all the detected cases come from the same
issue in create_ordered_paths. The ordered_rel has the same path in its pathlist
as the input_rel. Sometimes, this path is removed and freed from ordered_rel,
which leads to a dangling pointer in the child RelOptInfo.

I've attached a patch that shows how to fix the issue. Some regression tests
change because of a hidden rule where a projection and its subpath have
different target lists. Right now, the patch always enforces a projection, even
if the target lists are the same. This is still open for discussion on whether
there's a better way to handle it.

-- 
regards, Andrei Lepikhov,
pgEdge
Вложения

Re: A very quick observation of dangling pointers in Postgres pathlists

От
David Rowley
Дата:
On Tue, 21 Apr 2026 at 19:29, Andrei Lepikhov <lepihov@gmail.com> wrote:
>
> On 17/04/2026 10:56, Andrei Lepikhov wrote:
> > The best-known problematic code example causing this issue is
> > apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
> > 0927d2f46dd.  Quickly fixing it, I see some more combinations have emerged:
>
> On closer inspection, it looks like all the detected cases come from the same
> issue in create_ordered_paths. The ordered_rel has the same path in its pathlist
> as the input_rel. Sometimes, this path is removed and freed from ordered_rel,
> which leads to a dangling pointer in the child RelOptInfo.
>
> I've attached a patch that shows how to fix the issue. Some regression tests
> change because of a hidden rule where a projection and its subpath have
> different target lists. Right now, the patch always enforces a projection, even
> if the target lists are the same. This is still open for discussion on whether
> there's a better way to handle it.

IMO, we should write a function like copy_path() or reparent_path(),
which creates a copy of the given Path, or the latter also would copy
then set the ->parent to the given RelOptInfo.  Any time we use a path
directly from the pathlist of another RelOptInfo, we should reparent
or copy it. We could add an Assert in add_path() to check the new path
has the correct parent to help us find the places where we forget to
do this.

David



Re: A very quick observation of dangling pointers in Postgres pathlists

От
Alena Rybakina
Дата:
On 17.04.2026 11:56, Andrei Lepikhov wrote:

> Hi,
>
> It looks like a community decision has been developing that Postgres should
> separate optimisation features into 'conventional' and 'magic' classes [1]. This
> has raised my concern that hidden contracts about pathlists' state and ordering
> could lead to subtle bugs if an extension optimisation goes too far.
>
> I think this topic is of interest because of the growing number of features that
> impact path choice, such as ‘disable node’ or pg_plan_advice. Also, emerging
> techniques that involve two or more levels of plan trees, like ‘eager
> aggregation’, might catch another dangling pointer hidden in path lists for a
> while. Don’t forget complicated cases with FDW and Custom nodes too.
>
> For this purpose, a tiny debugging extension module, pg_pathcheck [2], has been
> invented. It uses create_upper_paths_hook and planner_shutdown_hook. The
> extension walks the entire Path tree, starting from the top PlannerInfo, then
> recurses into glob::subroots, traversing each RelOptInfo and each pathlist.
> Also, it traverses the path→subpath subtrees to ensure that potentially quite
> complex path trees are covered when implemented as a single RelOptInfo. For each
> pointer it visits, it checks if the NodeTag matches a known Path type. If not,
> the memory was freed (and, with CLOBBER_FREED_MEMORY, set to 0x7F) or reused for
> something else.
>
> This approach is not free of caveats. For example, most Path nodes and many Plan
> nodes fall within the 128-byte gap of the minimal allocated chunk. That means
> freeing one path allows the optimiser to immediately allocate another Path node
> at a potentially different query tree level. I had such a case at least once in
> production. It was actually hard to realise, reproduce, and fix.

Hi! I raised such a problem before in this thread and proposed a patch 
to delete freed refused paths from pathlist.

You can find it here [0] of you are interested.

[0] 

https://www.postgresql.org/message-id/flat/CAExHW5uhc5JVOUExjo24oYLLcJAyD04%2BBRb080sV08pO_%3D7w%3DA%40mail.gmail.com#2c0c5f2aca79753e1c8886d3f54e7d25

-- 
-----------
Best regards,
Alena Rybakina




Re: A very quick observation of dangling pointers in Postgres pathlists

От
Andrei Lepikhov
Дата:
On 21/04/2026 10:35, David Rowley wrote:
> On Tue, 21 Apr 2026 at 19:29, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> I've attached a patch that shows how to fix the issue. Some regression tests
>> change because of a hidden rule where a projection and its subpath have
>> different target lists. Right now, the patch always enforces a projection, even
>> if the target lists are the same. This is still open for discussion on whether
>> there's a better way to handle it.
> 
> IMO, we should write a function like copy_path() or reparent_path(),
> which creates a copy of the given Path, or the latter also would copy
> then set the ->parent to the given RelOptInfo.  Any time we use a path
> directly from the pathlist of another RelOptInfo, we should reparent
> or copy it. We could add an Assert in add_path() to check the new path
> has the correct parent to help us find the places where we forget to
> do this.

It would be great to have a copy_path() function. At the moment, I create a
limited version each time in an extension module, using
reparameterize_path_by_child as a guide since it ensures the core can handle
path copies.
Do you mean we can introduce such a copy routine to fix current issue? Here is
the problem: dangling pointers are detected only by external tools. I can't
imagine an SQL reproducer to test this machinery.

-- 
regards, Andrei Lepikhov,
pgEdge



Re: A very quick observation of dangling pointers in Postgres pathlists

От
David Rowley
Дата:
On Tue, 21 Apr 2026 at 20:54, Andrei Lepikhov <lepihov@gmail.com> wrote:
>
> On 21/04/2026 10:35, David Rowley wrote:
> > IMO, we should write a function like copy_path() or reparent_path(),
> > which creates a copy of the given Path, or the latter also would copy
> > then set the ->parent to the given RelOptInfo.  Any time we use a path
> > directly from the pathlist of another RelOptInfo, we should reparent
> > or copy it. We could add an Assert in add_path() to check the new path
> > has the correct parent to help us find the places where we forget to
> > do this.
>
> It would be great to have a copy_path() function. At the moment, I create a
> limited version each time in an extension module, using
> reparameterize_path_by_child as a guide since it ensures the core can handle
> path copies.
> Do you mean we can introduce such a copy routine to fix current issue? Here is
> the problem: dangling pointers are detected only by external tools. I can't
> imagine an SQL reproducer to test this machinery.

I had anticipated that we'd only fix in master as we'd probably need a
new callback in CustomPathMethods.

David