Обсуждение: New access method for b-tree.

Поиск

Список

Период

Сортировка

New access method for b-tree.

От

Alexandre Felipe

Дата:

01 февраля, 13:02:06

Hello Hackers,

Please check this out,

It is an access method to scan a table sorting by the second column of an index, filtered on the first.

Queries like

SELECT x, y FROM grid

WHERE x in (array of Nx elements)

ORDER BY y, x

LIMIT M

Can execute streaming the rows directly from disk instead of loading everything.

Using btree index on (x, y)

On a grid with N x N will run by fetching only what is necessary

A skip scal will run with O(N * Nx) I/O, O(N x Nx) space, O(N x Nx * log( N * Nx)) comput (assuming a generic in memory sort)

The proposed access method does it O(M + Nx) I/O, O(Nx) space, and O(M * log(Nx)) compute.

Kind Regards,

Alexandre Felipe

Research & Development Engineer

Вложения

btree_merge_rebased.patch

Re: New access method for b-tree.

От

Tomas Vondra

Дата:

02 февраля, 02:54:14

Hello Felipe,

On 2/1/26 11:02, Alexandre Felipe wrote:
> Hello Hackers,
> 
> Please check this out,
> 
> It is an access method to scan a table sorting by the second column of
> an index, filtered on the first.
> Queries like
> SELECT x, y FROM grid
> WHERE x in (array of Nx elements)
> ORDER BY y, x
> LIMIT M
> 
> Can execute streaming the rows directly from disk instead of loading
> everything.
> 
> Using  btree index on (x, y)
> 
> On a grid with N x N will run by fetching only what is necessary
> A skip scal will run with O(N * Nx) I/O, O(N x Nx) space, O(N x Nx *
> log( N * Nx)) comput (assuming a generic in memory sort)
> 
> The proposed access method does it O(M + Nx) I/O, O(Nx) space, and O(M *
> log(Nx)) compute.
> 

So how does this compare to skip scan in practice? It's hard to compare,
as the patch does not implement an actual access path, but I tried this:

  CREATE TABLE merge_test_int (
    prefix_col int4,
    suffix_col int4
  );

  INSERT INTO merge_test_int
  SELECT p, s
  FROM generate_series(1, 10000) AS p,
       generate_series(1, 1000) AS s;
  CREATE INDEX merge_test_int_idx
      ON merge_test_int (prefix_col, suffix_col);

and then

1) master

  SELECT * FROM merge_test_int
  WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
    AND suffix_col >= 900
  ORDER BY suffix_col LIMIT 100;

vs.

2) merge scan

  SELECT * FROM test_btree_merge_scan_int(

      'merge_test_int',
      'merge_test_int_idx',
      ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
      900,
      100);

And with explain analyze we get this:

1) Buffers: shared hit=26 read=25

2) Buffers: shared hit=143 read=17

So it seems to access many more buffers, even if the number of reads is
lower. Presumably the merge scan is not always better than skip scan,
probably depending on number of prefixes in the query etc. What is the
cost model to decide between those two?

If you had to construct the best case and worst cases (vs. skip scan),
what would that look like?

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

regards

-- 
Tomas Vondra

Re: New access method for b-tree.

От

Matthias van de Meent

Дата:

03 февраля, 19:01:39

On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote:
>
> Hello Felipe,
>
> On 2/1/26 11:02, Alexandre Felipe wrote:
> > Hello Hackers,
> >
> > Please check this out,
> >
> > It is an access method to scan a table sorting by the second column of
> > an index, filtered on the first.
> > Queries like
> > SELECT x, y FROM grid
> > WHERE x in (array of Nx elements)
> > ORDER BY y, x
> > LIMIT M
> >
> > Can execute streaming the rows directly from disk instead of loading
> > everything.

+1 for the idea, it does sound interesting. I haven't looked in depth
at the patch, so no comments on the execution yet.

> So how does this compare to skip scan in practice? It's hard to compare,
> as the patch does not implement an actual access path, but I tried this:
[...]
> 1) master
>
>   SELECT * FROM merge_test_int
>   WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
[...]
> 2) merge scan
>
>   SELECT * FROM test_btree_merge_scan_int(
[...]
>       ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[...]
> And with explain analyze we get this:
>
> 1) Buffers: shared hit=26 read=25
> 2) Buffers: shared hit=143 read=17

(FYI; your first query was missing "2" from it's IN list while it was
present in the merge scan input; this makes the difference worse by a
few pages)

> So it seems to access many more buffers, even if the number of reads is
> lower. Presumably the merge scan is not always better than skip scan,
> probably depending on number of prefixes in the query etc. What is the
> cost model to decide between those two?

Skip scan always returns data in index order, while this merge scan
would return tuples a suffix order. The cost model would thus weigh
the cost of sorting the result of an index skipscan against the cost
of doing a merge join on n_in_list_items distinct (partial) index
scans.

As for when you would benefit in buffers accessed: The merge scan
would mainly benefit in number of buffers accessed when the selected
prefix values are non-sequential, and the prefixes cover multiple
pages at a time, and when there is a LIMIT clause on the scan. Normal
btree index skip scan infrastructure efficiently prevents new index
descents into the index when the selected SAOP key ranges are directly
adjecent, while merge scan would generally do at least one index
descent for each of its N scan heads (*) - which in the proposed
prototype patch guarantees O(index depth * num scan heads) buffer
accesses.

(*) It is theoretically possible to reuse an earlier index descent if
the SAOP entry's key range of the last descent starts and ends on the
leaf page that the next SAOP entry's key range also starts on
(applying the ideas of 5bf748b86b to this new multi-headed index scan
mode), but that infrastructure doesn't seem to be in place in the
current patch. That commit is also why your buffer access count for
master is so low compared to the merge scan's; if your chosen list of
numbers was multiples of 5 (so that matching tuples are not all
sequential) you'd probably see much more comparable buffer access
counts.

> If you had to construct the best case and worst cases (vs. skip scan),
> what would that look like?

Presumably the best case would be:

-- mytable.a has very few distinct values (e.g. bool or enum);
mytable.b many distinct values (e.g. uuid)
SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;

which the index's merge scan would turn into an index scan that
behaves similar to the following, possibly with the merge join pushed
down into the index:

SELECT * FROM (
    SELECT ... FROM mytable WHERE a = 1
  UNION
    SELECT ... FROM mytable WHERE a = 2
) ORDER BY b.

The worst case would be the opposite:

-- mytable.a has many distinct values (random uuid); mytable.b few
(e.g. boolean; enum)
SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b

As the merge scan maintains one internal indexscan head per SAOP array
element, it'd have significant in-memory and scan startup overhead,
while few values are produced for each of those scan heads.

> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?

I'm not sure, but it seems like it might be common/useful in
queue-like access patterns:

With an index on (state, updated_timestamp) you're probably interested
in all messages in just a subset of states, ordered by recent state
transitions. An index on (updated_timestamp, state) might be
considered more optimal, but won't be able to efficiently serve
queries that only want data on uncommon states: The leaf pages would
mainly contain data on common states, reducing the value of those leaf
pages.

Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
using UNION, or add an index for each percievable IN list, but it'd be
great if the user didn't have to rewrite their query or create
n_combinations indexes with their respective space usage to get this
more efficient query execution.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Re: New access method for b-tree.

От

Ants Aasma

Дата:

04 февраля, 00:42:26

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:
> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
    SELECT * FROM posts
    WHERE user_id = id
    ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

The main problem I can see is that at planning time the cardinality of
the prefix array might not be known, and in theory could be in the
millions. Having millions of index scans open at the same time is not
viable, so the method needs to somehow degrade gracefully. The idea I
had is to pick some limit, based on work_mem and/or benchmarking, and
one the limit is hit, populate the first batch and then run the next
batch of index scans, merging with the first result. Or something like
that, I can imagine a few different ways to handle it with different
tradeoffs.

I can imagine that this would really nicely benefit from ReadStream'ification.

One other connection I see is with block nested loops. In a perfect
future PostgreSQL could run the following as a set of merged index
scans that terminate early:

SELECT posts.*
FROM follows f
    JOIN posts p ON f.followed_id = p.user_id
WHERE f.follower_id = :userid
ORDER BY p.tstamp DESC LIMIT 100;

In practice this is not a huge issue - it's not that hard to transform
this to array_agg and = ANY subqueries.

Regards,
Ants Aasma

Re: New access method for b-tree.

От

Tomas Vondra

Дата:

04 февраля, 01:25:36


On 2/3/26 17:01, Matthias van de Meent wrote:
> On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hello Felipe,
>>
>> On 2/1/26 11:02, Alexandre Felipe wrote:
>>> Hello Hackers,
>>>
>>> Please check this out,
>>>
>>> It is an access method to scan a table sorting by the second column of
>>> an index, filtered on the first.
>>> Queries like
>>> SELECT x, y FROM grid
>>> WHERE x in (array of Nx elements)
>>> ORDER BY y, x
>>> LIMIT M
>>>
>>> Can execute streaming the rows directly from disk instead of loading
>>> everything.
> 
> +1 for the idea, it does sound interesting. I haven't looked in depth
> at the patch, so no comments on the execution yet.
> 
>> So how does this compare to skip scan in practice? It's hard to compare,
>> as the patch does not implement an actual access path, but I tried this:
> [...]
>> 1) master
>>
>>   SELECT * FROM merge_test_int
>>   WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
> [...]
>> 2) merge scan
>>
>>   SELECT * FROM test_btree_merge_scan_int(
> [...]
>>       ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
> [...]
>> And with explain analyze we get this:
>>
>> 1) Buffers: shared hit=26 read=25
>> 2) Buffers: shared hit=143 read=17
> 
> (FYI; your first query was missing "2" from it's IN list while it was
> present in the merge scan input; this makes the difference worse by a
> few pages)
> 
>> So it seems to access many more buffers, even if the number of reads is
>> lower. Presumably the merge scan is not always better than skip scan,
>> probably depending on number of prefixes in the query etc. What is the
>> cost model to decide between those two?
> 
> Skip scan always returns data in index order, while this merge scan
> would return tuples a suffix order. The cost model would thus weigh
> the cost of sorting the result of an index skipscan against the cost
> of doing a merge join on n_in_list_items distinct (partial) index
> scans.
> 

Makes sense.

> As for when you would benefit in buffers accessed: The merge scan
> would mainly benefit in number of buffers accessed when the selected
> prefix values are non-sequential, and the prefixes cover multiple
> pages at a time, and when there is a LIMIT clause on the scan. Normal
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.
> 

Do we have sufficient information to reliably make the right decision?
Can we actually cost the two cases well enough?

> (*) It is theoretically possible to reuse an earlier index descent if
> the SAOP entry's key range of the last descent starts and ends on the
> leaf page that the next SAOP entry's key range also starts on
> (applying the ideas of 5bf748b86b to this new multi-headed index scan
> mode), but that infrastructure doesn't seem to be in place in the
> current patch. That commit is also why your buffer access count for
> master is so low compared to the merge scan's; if your chosen list of
> numbers was multiples of 5 (so that matching tuples are not all
> sequential) you'd probably see much more comparable buffer access
> counts.
> 
>> If you had to construct the best case and worst cases (vs. skip scan),
>> what would that look like?
> 
> Presumably the best case would be:
> 
> -- mytable.a has very few distinct values (e.g. bool or enum);
> mytable.b many distinct values (e.g. uuid)
> SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;
> 
> which the index's merge scan would turn into an index scan that
> behaves similar to the following, possibly with the merge join pushed
> down into the index:
> 
> SELECT * FROM (
>     SELECT ... FROM mytable WHERE a = 1
>   UNION
>     SELECT ... FROM mytable WHERE a = 2
> ) ORDER BY b.
> 
> 
> The worst case would be the opposite:
> 
> -- mytable.a has many distinct values (random uuid); mytable.b few
> (e.g. boolean; enum)
> SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b
> 
> As the merge scan maintains one internal indexscan head per SAOP array
> element, it'd have significant in-memory and scan startup overhead,
> while few values are produced for each of those scan heads.
> 

OK. It'll be interesting to see how this performs in practice for the
whole gamut between the best and worst case.

>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
> 
> I'm not sure, but it seems like it might be common/useful in
> queue-like access patterns:
> 
> With an index on (state, updated_timestamp) you're probably interested
> in all messages in just a subset of states, ordered by recent state
> transitions. An index on (updated_timestamp, state) might be
> considered more optimal, but won't be able to efficiently serve
> queries that only want data on uncommon states: The leaf pages would
> mainly contain data on common states, reducing the value of those leaf
> pages.
> 
> Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
> using UNION, or add an index for each percievable IN list, but it'd be
> great if the user didn't have to rewrite their query or create
> n_combinations indexes with their respective space usage to get this
> more efficient query execution.
> 

I think the examples presented by Ants (with timeline view) are quite
plausible in practice.


regards

-- 
Tomas Vondra

Re: New access method for b-tree.

От

Tomas Vondra

Дата:

04 февраля, 01:41:57

On 2/3/26 22:42, Ants Aasma wrote:
> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:
>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
> 
> I have seen this pattern multiple times. My nickname for it is the
> timeline view. Think of the social media timeline, showing posts from
> all followed accounts in timestamp order, returned in reasonably sized
> batches. The naive SQL query will have to scan all posts from all
> followed accounts and pass them through a top-N sort. When the total
> number of posts is much larger than the batch size this is much slower
> than what is proposed here (assuming I understand it correctly) -
> effectively equivalent to running N index scans through Merge Append.
> 

Makes sense. I guess filtering products by category + order by price
could also produce this query pattern.

> My workarounds I have proposed users have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped
> in an order by limit. This gives the exact needed merge append plan
> shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
> 
> SELECT * FROM unnest(:friends) id, LATERAL (
>     SELECT * FROM posts
>     WHERE user_id = id
>     ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
> 
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had
> to look at one index tuple.
> 

True. It's useful to think about the query this way, and it may be
better than full select + sort, but it has issues too.

> The main problem I can see is that at planning time the cardinality of
> the prefix array might not be known, and in theory could be in the
> millions. Having millions of index scans open at the same time is not
> viable, so the method needs to somehow degrade gracefully. The idea I
> had is to pick some limit, based on work_mem and/or benchmarking, and
> one the limit is hit, populate the first batch and then run the next
> batch of index scans, merging with the first result. Or something like
> that, I can imagine a few different ways to handle it with different
> tradeoffs.
> 

Doesn't the proposed merge scan have a similar issue? Because that will
also have to keep all the index scans open (even if only internally).
Indeed, it needs to degrade gracefully, in some way. I'm afraid the
proposed batches execution will be rather complex, so I'd say v1 should
simply have a threshold, and do the full scan + sort for more items.

> I can imagine that this would really nicely benefit from ReadStream'ification.
> 

Not sure, maybe.

> One other connection I see is with block nested loops. In a perfect
> future PostgreSQL could run the following as a set of merged index
> scans that terminate early:
> 
> SELECT posts.*
> FROM follows f
>     JOIN posts p ON f.followed_id = p.user_id
> WHERE f.follower_id = :userid
> ORDER BY p.tstamp DESC LIMIT 100;
> 
> In practice this is not a huge issue - it's not that hard to transform
> this to array_agg and = ANY subqueries.
> 

Automating that transformation seems quite non-trivial (to me).


regards

-- 
Tomas Vondra

Re: New access method for b-tree.

От

Michał Kłeczek

Дата:

04 февраля, 10:13:11

On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:
I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
   SELECT * FROM posts
   WHERE user_id = id
   ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports multiple sort orders.

The only problem is that GIST does not support ORDER BY column.

One possible workaround is [1] but as described there it does not play well with partitioning.

I’ve started drafting support for ORDER BY column in GIST - see [2].

I think it would be easier to implement and maintain than a new IAM (but I don’t have enough knowledge and experience to implement it myself)

[1] https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org

[2] https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org

—

Kind regards,

Michal

Re: New access method for b-tree.

От

Alexandre Felipe

Дата:

05 февраля, 09:59:25

Thank you for looking into this.

Now we can execute a, still narrow, family queries!

Maybe it helps to see this as a social network feeds. Imagine a social network, you have a few friends, or follow a few people, and you want to see their updates ordered by date. For each user we have a different combination of users that we have to display. But maybe, even having hundreds of users we will only show the first 10.

There is a low hanging fruit on the skip scan, if we need N rows, and one group already has M rows we could stop there.

If Nx is the number of friends, and M is the number of posts to show.

This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort, instead of (Nx * N) followed by an (Nx * N) sort.

Where M = 10 and N is 1000 this is a significant improvement.

But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx) heap operations.

If everything is on the same page the skip scan would win.

The cost estimation is probably far off.

I am also not considering the filters applied after this operator, and I don't know if the planner infrastructure is able to adjust it by itself.

This is where I would like reviewer's feedback. I think that the planner costs are something to be determined experimentally.

Next I will make it slightly more general handling
* More index columns: Index (a, b, s...) could support WHERE a IN (...) ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index (a, b, c)

---

Kind Regards,

Alexandre

On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <michal@kleczek.org> wrote:

On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:
I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
   SELECT * FROM posts
   WHERE user_id = id
   ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports multiple sort orders.
The only problem is that GIST does not support ORDER BY column.
One possible workaround is [1] but as described there it does not play well with partitioning.
I’ve started drafting support for ORDER BY column in GIST - see [2].
I think it would be easier to implement and maintain than a new IAM (but I don’t have enough knowledge and experience to implement it myself)

[1] https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
[2] https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org

—
Kind regards,
Michal

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: New access method for b-tree.

Вложения

Вложения