Обсуждение: Why is this query touching 4gb of buffers?

Поиск
Список
Период
Сортировка

Why is this query touching 4gb of buffers?

От
hubert depesz lubaczewski
Дата:
Hi,
I have weird-ish case, that I can't grok, or at least explain in
hand-wavy way.

Very simple query:

SELECT
    some_table.communication_channel_id,
    some_table.root_account_id
FROM
    some_schema.some_table
WHERE
    workflow_state = 'pending' AND
    send_at <= '2025-10-23 12:01:13';

On complex-ish table:

                                                      Table "some_schema.some_table"
             Column              │            Type             │ Collation │ Nullable │
Default

═════════════════════════════════╪═════════════════════════════╪═══════════╪══════════╪═══════════════════════════════════════════════════════════════════
 id                              │ bigint                      │           │ not null │
nextval('some_schema.some_table_id_seq'::regclass)
 notification_id                 │ bigint                      │           │          │
 notification_policy_id          │ bigint                      │           │          │
 context_id                      │ bigint                      │           │          │
 context_type                    │ character varying(255)      │           │          │
 communication_channel_id        │ bigint                      │           │          │
 frequency                       │ character varying(255)      │           │          │
 workflow_state                  │ character varying(255)      │           │          │
 batched_at                      │ timestamp without time zone │           │          │
 created_at                      │ timestamp without time zone │           │          │
 updated_at                      │ timestamp without time zone │           │          │
 send_at                         │ timestamp without time zone │           │          │
 link                            │ text                        │           │          │
 name_of_topic                   │ text                        │           │          │
 summary                         │ text                        │           │          │
 root_account_id                 │ bigint                      │           │          │
 notification_policy_override_id │ bigint                      │           │          │
Indexes:
    "some_table_pkey" PRIMARY KEY, btree (id), tablespace "data1"
…
    "index_some_table_pending" btree (send_at) WHERE workflow_state::text = 'pending'::text, tablespace "data1"
…

Running this via explain shows:

 Index Scan using index_some_table_pending on some_table  (cost=0.43..399992.44 rows=2215063 width=16) (actual
time=382.466..382.466rows=0 loops=1)
 
   Index Cond: (send_at <= '2025-10-23 12:01:13'::timestamp without time zone)
   Buffers: shared hit=554347
 Planning Time: 0.082 ms
 Execution Time: 382.482 ms
(5 rows)

Specifically, I see that it returned 0 rows, and to do it, it had to "touch" 554k buffers - ~ 4GB of data.

Sizes of the relavant things:
                 oid                  │ relkind │ pg_relation_size │ pg_table_size │ pg_total_relation_size
══════════════════════════════════════╪═════════╪══════════════════╪═══════════════╪════════════════════════
 some_schema.some_table               │ r       │      15339020288 │   15376539648 │            22347988992
 some_schema.index_some_table_pending │ i       │         77127680 │      77168640 │               77168640
(2 rows)


I didn't check analyze time, so can't comment on this, but I did reindex of
this index, which shrunk it to 16973824 bytes, and now the query runs, as expected, in < 1ms:

 Index Scan using index_some_table_pending on some_table  (cost=0.43..392423.37 rows=2215272 width=16) (actual
time=0.005..0.005rows=0 loops=1)
 
   Index Cond: (send_at <= '2025-10-23 12:01:13'::timestamp without time zone)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.787 ms
 Execution Time: 0.016 ms
(7 rows)

While talking with others, there have been used certain terms, like
"unbounded range", which I understand, but I still don't see why, with
the same stats, and NO rows returned by index - before index, pg has to
do stuff to 4GB of data?! Anyone could try to explain?

If it matters, this is rather old Pg: PostgreSQL 14.7, and the query was
running on streaming replica, but results on primary were basically the
same.

Best regards,

depesz




Re: Why is this query touching 4gb of buffers?

От
hubert depesz lubaczewski
Дата:
On Fri, Oct 24, 2025 at 01:01:48PM +0200, hubert depesz lubaczewski wrote:
> Hi,
> I have weird-ish case, that I can't grok, or at least explain in
> hand-wavy way.

A bit more info. Due to how the database is setup we have MANY "copies"
of the same table - same name, same columns, different schema, different
(but similar) data.

So, I tested the problem on another instance of this table. And there
was something really, really weird.

(this query has distinct on two columns, but it doesn't seem to be relevant, and tests whether having it, or not,
showednot to generate any reasonable differences).
 

First run of the query generated:

                                                                                QUERY PLAN

══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 HashAggregate  (cost=181382.51..181745.19 rows=36268 width=16) (actual time=27962.826..27962.826 rows=0 loops=1)
   Group Key: communication_channel_id, root_account_id
   Batches: 1  Memory Usage: 1561kB
   Buffers: shared hit=4624 read=117838 dirtied=486
   ->  Index Scan using index_some_table_pending on some_table  (cost=0.42..178322.57 rows=611988 width=16) (actual
time=27962.567..27962.567rows=0 loops=1)
 
         Index Cond: (send_at <= '2025-10-23 12:35:48'::timestamp without time zone)
         Buffers: shared hit=4624 read=117838 dirtied=486
 Planning:
   Buffers: shared hit=174
 Planning Time: 1.863 ms
 Execution Time: 27963.620 ms
(11 rows)

Then, immediately I reran it, without reindex, without analyze, without anything. And I got:

                                                                            QUERY PLAN

══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 HashAggregate  (cost=181388.32..181751.00 rows=36268 width=16) (actual time=0.653..0.653 rows=0 loops=1)
   Group Key: communication_channel_id, root_account_id
   Batches: 1  Memory Usage: 1561kB
   Buffers: shared hit=424
   ->  Index Scan using index_some_table_pending on some_table  (cost=0.42..178328.27 rows=612009 width=16) (actual
time=0.438..0.438rows=0 loops=1)
 
         Index Cond: (send_at <= '2025-10-23 12:35:48'::timestamp without time zone)
         Buffers: shared hit=424
 Planning Time: 0.123 ms
 Execution Time: 1.237 ms
(9 rows)

Time is irrelevant, the point is that we are going down from ~120k buffers
"touched" to 424 buffers. What is going on?

I tested the same case on yet another setup, and ran simlar query
multiple times in a row on secondary, and got:
Buffers: shared hit=113849 read=198047 => Execution Time: 1359.661 ms
Buffers: shared hit=311896 => Execution Time: 246.143 ms

But when I ran the query on primary server tghere was very visiolble
change in buffers accesses:

Buffers: shared hit=114893 read=197776 dirtied=5528 => Execution Time: 75863.479 ms
Buffers: shared hit=775 => Execution Time: 2.360 ms

This didn't change buffers "touches" on secondary, though.

What am I missing?

Best regards,

depesz





Re: Why is this query touching 4gb of buffers?

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> First run of the query generated:
> ...
>    ->  Index Scan using index_some_table_pending on some_table  (cost=0.42..178322.57 rows=611988 width=16) (actual
time=27962.567..27962.567rows=0 loops=1) 
>          Index Cond: (send_at <= '2025-10-23 12:35:48'::timestamp without time zone)
>          Buffers: shared hit=4624 read=117838 dirtied=486

> Then, immediately I reran it, without reindex, without analyze, without anything. And I got:

>    ->  Index Scan using index_some_table_pending on some_table  (cost=0.42..178328.27 rows=612009 width=16) (actual
time=0.438..0.438rows=0 loops=1) 
>          Index Cond: (send_at <= '2025-10-23 12:35:48'::timestamp without time zone)
>          Buffers: shared hit=424

> Time is irrelevant, the point is that we are going down from ~120k buffers
> "touched" to 424 buffers. What is going on?

The first execution probably had to set hint bits on a whole lot
of recently-deleted rows.

            regards, tom lane



Re: Why is this query touching 4gb of buffers?

От
hubert depesz lubaczewski
Дата:
On Fri, Oct 24, 2025 at 08:54:06AM -0400, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > First run of the query generated:
> > ...
> >    ->  Index Scan using index_some_table_pending on some_table  (cost=0.42..178322.57 rows=611988 width=16) (actual
time=27962.567..27962.567rows=0 loops=1)
 
> >          Index Cond: (send_at <= '2025-10-23 12:35:48'::timestamp without time zone)
> >          Buffers: shared hit=4624 read=117838 dirtied=486
> 
> > Then, immediately I reran it, without reindex, without analyze, without anything. And I got:
> 
> >    ->  Index Scan using index_some_table_pending on some_table  (cost=0.42..178328.27 rows=612009 width=16) (actual
time=0.438..0.438rows=0 loops=1)
 
> >          Index Cond: (send_at <= '2025-10-23 12:35:48'::timestamp without time zone)
> >          Buffers: shared hit=424
> 
> > Time is irrelevant, the point is that we are going down from ~120k buffers
> > "touched" to 424 buffers. What is going on?
> 
> The first execution probably had to set hint bits on a whole lot
> of recently-deleted rows.

But why it doesn't happen/help on secondary?

Subsequent runs on secondary still have to "touch" hundreds of thousands
of pages", even if I'll do the run on primary that would reset hint
bits.

So, on primary - reruning the query fixes the "how many pages we touch"
- but on secondary, the number generally doesn't change, at least
withint 15-20 minute window.

Best regards,

depesz




Re: Why is this query touching 4gb of buffers?

От
Tom Lane
Дата:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> On Fri, Oct 24, 2025 at 08:54:06AM -0400, Tom Lane wrote:
>> The first execution probably had to set hint bits on a whole lot
>> of recently-deleted rows.

> But why it doesn't happen/help on secondary?

IIRC, secondaries aren't authorized to update hint bits for
themselves, they have to wait for the primary to do it and then
propagate the new data.  There might also be some question of
what the oldest open transaction is ...

            regards, tom lane



Re: Why is this query touching 4gb of buffers?

От
hubert depesz lubaczewski
Дата:
On Fri, Oct 24, 2025 at 09:01:11AM -0400, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > On Fri, Oct 24, 2025 at 08:54:06AM -0400, Tom Lane wrote:
> >> The first execution probably had to set hint bits on a whole lot
> >> of recently-deleted rows.
> 
> > But why it doesn't happen/help on secondary?
> 
> IIRC, secondaries aren't authorized to update hint bits for
> themselves, they have to wait for the primary to do it and then
> propagate the new data.  There might also be some question of
> what the oldest open transaction is ...

OK. So tested this idea on yet another "setup".

We have single primary, and two streaming replicas. Before test I ran:
select now() - min(xact_start), now() - pg_last_xact_replay_timestamp() from pg_stat_activity

on both replicas, and got:

    ?column?     │    ?column?
═════════════════╪═════════════════
 00:00:00.003007 │ 00:00:00.003673
(1 row)

and

 ?column? │     ?column?
══════════╪══════════════════
 00:00:00 │ -00:00:00.006129
(1 row)

Then I ran the problematic query on replica 1 (the first one), and got
numbers:

Buffers: shared hit=21107
Execution Time: 18.621 ms

Subsequent runs on the replica showed the same buffers usage, and
similar time.

Then I ran this query twice on primary, and noticed improvement:
Buffers: shared hit=569 read=20927 dirtied=498
Execution Time: 2596.283 ms

and on 2nd run:

Buffers: shared hit=391
Execution Time: 2.015 ms

Awesome. Well, mostly.

Now, I waited ~ 1 minute, checked replication lag, and oldest transactions
replies, where all lags were < 1s, and longest transaction across
primary/secondary1/secondary2 was ~ 4s.

Then I re-ran the query on secondary 1 and got:
Buffers: shared hit=21107
Execution Time: 18.403 ms

Interestingly, on one "setup" running vacuum analyze of the table helped
execution on secondary, but on this one, it doesn't.

Best regards,

depesz