Обсуждение: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

Поиск
Список
Период
Сортировка

[Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

От
Adam Brusselback
Дата:
Attached is a patch implementing support for a WHERE clause in REFRESH MATERIALIZED VIEW.

The syntax allows for targeted refreshes:
    REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
    REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
    REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';

I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ

This allows the user to restrict the refresh operation to a subset of the view. The qualification is applied to the view's output columns. The optimizer can then push this condition down to the underlying base tables, avoiding a full scan when only a known subset of data has changed.

Implementation notes:

1. The grammar accepts an optional WHERE clause. We forbid volatile functions in the clause to ensure correctness.

2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified, the operation performs an in-place modification using a `ROW EXCLUSIVE` lock.
    *   This mode requires a unique index to ensure constraint violations are handled correctly (e.g., when a row's values change such that it "drifts" into or out of the `WHERE` clause scope).
    *   It executes a Prune + Upsert strategy:
        * `DELETE` all rows in the materialized view that match the `WHERE` clause.
        * `INSERT` the new data from the source query.
    *   It uses `ON CONFLICT DO UPDATE` during the insert phase to handle concurrency edge cases, ensuring the refresh is robust against constraint violations.

3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting the scope of the diff (and the temporary table population) to the rows matching the predicate. This requires an `EXCLUSIVE` lock and a unique index, consistent with existing concurrent refresh behavior. It is much slower than `Non-Concurrent Partial Refresh`

4. The execution logic uses SPI to inject the predicate into the source queries during execution.

I have attached a benchmark suite to validate performance and correctness:

*   `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing system (`invoices` and `invoice_lines`). It includes an aggregated materialized view (`invoice_summary`) and a control table (`invoice_summary_table`).
*   `workload_*.sql`: pgbench scripts simulating a high-churn environment (45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset sizes while generating significant refresh work.
*   `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across multiple scale factors and concurrency levels.

The benchmark compares strategies for keeping a summary up to date (vs baseline):
*   Partial Refresh: Triggers on the base table collect modified IDs and execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
*   Materialized Table (Control): A standard table maintained via complex PL/pgSQL triggers (the traditional manual workaround).
*   Full Refresh (Legacy): Manually refresh the view after changes.

Results are below:
Concurrency: 1 client(s)
----------------------------------------------------------------------------------
Scale       Batch | Baseline TPS | Full (Rel)   Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000           1 | 5309.05      | 0.002x        0.437x        0.470x      
20000          50 | 1209.32      | 0.010x        0.600x        0.598x      
20000        1000 | 56.05        | 0.164x        0.594x        0.576x      
400000          1 | 5136.91      | 0    x        0.450x        0.487x      
400000         50 | 1709.17      | 0    x        0.497x        0.482x      
400000       1000 | 110.35       | 0.006x        0.507x        0.460x      

Concurrency: 4 client(s)
----------------------------------------------------------------------------------
Scale       Batch | Baseline TPS | Full (Rel)   Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000           1 | 19197.50     | 0x            0.412x        0.435x      
20000          50 | 1016.14      | 0.007x        0.966x        1.036x      
20000        1000 | 9.94         | 0.708x        1.401x        1.169x      
400000          1 | 19637.36     | 0x            0.436x        0.483x      
400000         50 | 4669.32      | 0x            0.574x        0.566x      
400000       1000 | 23.26        | 0.029x        1.147x        0.715x      

Concurrency: 8 client(s)
----------------------------------------------------------------------------------
Scale       Batch | Baseline TPS | Full (Rel)   Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000           1 | 30358.32     | 0x            0.440x        0.457x
20000          50 | 262.75       | 0.026x        2.943x        2.740x
20000        1000 | 11.28        | 0.575x        0.840x        0.578x
400000          1 | 36007.15     | 0x            0.430x        0.464x
400000         50 | 6664.58      | 0x            0.563x        0.494x
400000       1000 | 11.61        | 0.058x        1.000x        1.277x



In these tests, the partial refresh behaves as O(delta) rather than O(total), performing comparably to the manual PL/pgSQL approach but with significantly lower code complexity for the user.

I recognize that adding a WHERE clause to REFRESH is an extension to the SQL standard. I believe the syntax is intuitive, but I am open to discussion regarding alternative implementation strategies or syntax if the community feels a different approach is warranted.

New regression tests are included in the patch.

This is my first time submitting a patch to PostgreSQL, so please bear with me if I've missed anything or made any procedural mistakes. I'm happy to address any feedback.

Thanks,
Adam Brusselback
Вложения

Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

От
Kirk Wolak
Дата:
On Mon, Dec 8, 2025 at 3:58 PM Adam Brusselback <adambrusselback@gmail.com> wrote:
Attached is a patch implementing support for a WHERE clause in REFRESH MATERIALIZED VIEW.

The syntax allows for targeted refreshes:
    REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
    REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
    REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';

I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ

+1 (But I was in that hacking session).

  Our situation was a wonderful MV with all the columns we needed (some hard to calculate) to augment search data done millions of times/day.  It was a thing of beauty.  Until we realized we needed to update 1 record (vendor inventory UPDATE date/time) every time we processed a file (something we do 24x7, a hundred times each hour!

  For that ONE field, we ended up doing REFRESH MV concurrently;  OVER 2,000 times per day.

  Our understanding is that many people run into this exact issue.  The cache needs small frequent updates.
(After reading the code that handles MVs, we just created our own TABLE, and maintain it with a scheduler to rebuild HOURLY,
and when we process the file, a Simple UPDATE is issued for the one column).

  While this "Works", the CONCEPT of this patch (untested by me, as of yet), would have fixed this with far less effort,
and would be easier to maintain.

  After I review the code, I will add additional comments.

  I am curious what others think?  (And FWIW, I believe that the larger the MV, the MORE this feature is needed,
vs refreshing the ENTIRE view).

Regards...

Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

От
Nico Williams
Дата:
On Tue, Dec 09, 2025 at 12:08:58AM -0500, Kirk Wolak wrote:
>   Our understanding is that many people run into this exact issue.  The
> cache needs small frequent updates.
> (After reading the code that handles MVs, we just created our own TABLE,
> and maintain it with a scheduler to rebuild HOURLY,
> and when we process the file, a Simple UPDATE is issued for the one column).
> 
>   While this "Works", the CONCEPT of this patch (untested by me, as of
> yet), would have fixed this with far less effort,
> and would be easier to maintain.
> 
>   After I review the code, I will add additional comments.
> 
>   I am curious what others think?  (And FWIW, I believe that the larger the
> MV, the MORE this feature is needed,
> vs refreshing the ENTIRE view).

What I do is I have my own materialized view infrastructure, written
entirely in PlPgSQL, and I completely avoid PG's MV support.  This
alternative MV scheme creates an actual table for each MV, which means:

 - one can update the MV directly (and I do, via triggers)

 - one can have triggers on the MV (e.g., to record history)

This is has been very handy for me.

I also have a state table in which to keep track of whether an MV needs
a refresh, and I have a function i can use to mark an MV as needing a
refresh.  Marking an MV as needing a refresh sends a NOTIFY, and then I
have a daemon that will refresh views as needed (with some
debouncing/coalescing of notifications).

This way I can have MVs with very complex underlying queries for which
some kinds of updates I can easily write fast triggers for and others
where I can't (or where they would slow down transactions too much) I
simply mark the MV as needing a refresh.

Typical MV queries I have that this works very well for include
transitive reachability closure computations (e.g., all the groups a
thing is a member of, directly and indirectly, or vice versa --
recursive CTEs basically).  Though I do now have triggers that can do a
reasonably good job of synchronously and quickly updating MVs with such
queries, it's just I didn't always.

Refreshes are always 'concurrent'.

This is my 80/20 solution to the "Incremental View Maintenance" (IVM)
problem.

A not very current version is here:

https://github.com/twosigma/postgresql-contrib/blob/master/mat_views.sql

If you like it I might be able to get a newer version out.  The version
above has a few minor issues:

 - it uses DELETE FROM instead of TRUNCATE for its' sort-of temp tables

 - using TRUNCATE ends up requiring some care to avoid occasional
   deadlocks with VACUUM that are due to using tables as types of the
   columns of the deltas tables

 - logging -- lots of logging in the newest version

Another issue is that I rely on NATURAL FULL OUTER JOIN to avoid having
to generate ON conditions, but that means that all columns of the
underlying VIEW must not have NULLs.  As I've not needed to support
nullable columns in these MVs, I don't mind.

Nico
-- 



Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

От
Adam Brusselback
Дата:
>   Our situation was a wonderful MV with all the columns we needed (some
> hard to calculate) to augment search data done millions of times/day.  It
> was a thing of beauty.  Until we realized we needed to update 1 record
> (vendor inventory UPDATE date/time) every time we processed a file
> (something we do 24x7, a hundred times each hour!
>
>   For that ONE field, we ended up doing REFRESH MV concurrently;  OVER
> 2,000 times per day.

Thanks for the feedback and the use case, Kirk.

Regarding that specific scenario where a single column ("last updated" or similar) churns significantly faster than the heavy-computation columns:
Even with this patch, you might find it beneficial to separate that high-velocity column into its own small materialized view (or regular view) and join it to the main MV at query time. That will reduce the bloat you get on the main MV by quite a lot, especially if you have very wide rows (which it seems like you do).

I initially tried to implement logic that would allow for direct UPDATEs (which would enable HOT updates). However, to handle rows that matched the predicate but were no longer present in the new source data, I had to run an anti-join to identify them for deletion. That approach caused performance issues, so I settled on the "Prune + Upsert" strategy (DELETE matching rows, then INSERT from source).

Because this patch performs a delete/insert cycle, updating that one timestamp column will still result in rewriting the whole tuple in the MV.

>   For that ONE field, we ended up doing REFRESH MV concurrently;  OVER
> 2,000 times per day.

That said, 2,000 refreshes per day is nothing for this implementation, provided your updates are selective enough and your queries allow for predicate push-down to the base tables.

I look forward to your thoughts after reviewing the code.

Thanks,
Adam Brusselback