Обсуждение: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW
Attached is a patch implementing support for a WHERE clause in REFRESH MATERIALIZED VIEW.
The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';
I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ
This allows the user to restrict the refresh operation to a subset of the view. The qualification is applied to the view's output columns. The optimizer can then push this condition down to the underlying base tables, avoiding a full scan when only a known subset of data has changed.
Implementation notes:
1. The grammar accepts an optional WHERE clause. We forbid volatile functions in the clause to ensure correctness.
2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified, the operation performs an in-place modification using a `ROW EXCLUSIVE` lock.
* This mode requires a unique index to ensure constraint violations are handled correctly (e.g., when a row's values change such that it "drifts" into or out of the `WHERE` clause scope).
* It executes a Prune + Upsert strategy:
* `DELETE` all rows in the materialized view that match the `WHERE` clause.
* `INSERT` the new data from the source query.
* It uses `ON CONFLICT DO UPDATE` during the insert phase to handle concurrency edge cases, ensuring the refresh is robust against constraint violations.
3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting the scope of the diff (and the temporary table population) to the rows matching the predicate. This requires an `EXCLUSIVE` lock and a unique index, consistent with existing concurrent refresh behavior. It is much slower than `Non-Concurrent Partial Refresh`
4. The execution logic uses SPI to inject the predicate into the source queries during execution.
I have attached a benchmark suite to validate performance and correctness:
* `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing system (`invoices` and `invoice_lines`). It includes an aggregated materialized view (`invoice_summary`) and a control table (`invoice_summary_table`).
* `workload_*.sql`: pgbench scripts simulating a high-churn environment (45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset sizes while generating significant refresh work.
* `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across multiple scale factors and concurrency levels.
The benchmark compares strategies for keeping a summary up to date (vs baseline):
* Partial Refresh: Triggers on the base table collect modified IDs and execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
* Materialized Table (Control): A standard table maintained via complex PL/pgSQL triggers (the traditional manual workaround).
* Full Refresh (Legacy): Manually refresh the view after changes.
Results are below:
Concurrency: 1 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 5309.05 | 0.002x 0.437x 0.470x
20000 50 | 1209.32 | 0.010x 0.600x 0.598x
20000 1000 | 56.05 | 0.164x 0.594x 0.576x
400000 1 | 5136.91 | 0 x 0.450x 0.487x
400000 50 | 1709.17 | 0 x 0.497x 0.482x
400000 1000 | 110.35 | 0.006x 0.507x 0.460x
Concurrency: 4 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 19197.50 | 0x 0.412x 0.435x
20000 50 | 1016.14 | 0.007x 0.966x 1.036x
20000 1000 | 9.94 | 0.708x 1.401x 1.169x
400000 1 | 19637.36 | 0x 0.436x 0.483x
400000 50 | 4669.32 | 0x 0.574x 0.566x
400000 1000 | 23.26 | 0.029x 1.147x 0.715x
Concurrency: 8 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 30358.32 | 0x 0.440x 0.457x
20000 50 | 262.75 | 0.026x 2.943x 2.740x
20000 1000 | 11.28 | 0.575x 0.840x 0.578x
400000 1 | 36007.15 | 0x 0.430x 0.464x
400000 50 | 6664.58 | 0x 0.563x 0.494x
400000 1000 | 11.61 | 0.058x 1.000x 1.277x
In these tests, the partial refresh behaves as O(delta) rather than O(total), performing comparably to the manual PL/pgSQL approach but with significantly lower code complexity for the user.
I recognize that adding a WHERE clause to REFRESH is an extension to the SQL standard. I believe the syntax is intuitive, but I am open to discussion regarding alternative implementation strategies or syntax if the community feels a different approach is warranted.
New regression tests are included in the patch.
This is my first time submitting a patch to PostgreSQL, so please bear with me if I've missed anything or made any procedural mistakes. I'm happy to address any feedback.
Thanks,
Adam Brusselback
The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';
I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ
This allows the user to restrict the refresh operation to a subset of the view. The qualification is applied to the view's output columns. The optimizer can then push this condition down to the underlying base tables, avoiding a full scan when only a known subset of data has changed.
Implementation notes:
1. The grammar accepts an optional WHERE clause. We forbid volatile functions in the clause to ensure correctness.
2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified, the operation performs an in-place modification using a `ROW EXCLUSIVE` lock.
* This mode requires a unique index to ensure constraint violations are handled correctly (e.g., when a row's values change such that it "drifts" into or out of the `WHERE` clause scope).
* It executes a Prune + Upsert strategy:
* `DELETE` all rows in the materialized view that match the `WHERE` clause.
* `INSERT` the new data from the source query.
* It uses `ON CONFLICT DO UPDATE` during the insert phase to handle concurrency edge cases, ensuring the refresh is robust against constraint violations.
3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting the scope of the diff (and the temporary table population) to the rows matching the predicate. This requires an `EXCLUSIVE` lock and a unique index, consistent with existing concurrent refresh behavior. It is much slower than `Non-Concurrent Partial Refresh`
4. The execution logic uses SPI to inject the predicate into the source queries during execution.
I have attached a benchmark suite to validate performance and correctness:
* `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing system (`invoices` and `invoice_lines`). It includes an aggregated materialized view (`invoice_summary`) and a control table (`invoice_summary_table`).
* `workload_*.sql`: pgbench scripts simulating a high-churn environment (45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset sizes while generating significant refresh work.
* `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across multiple scale factors and concurrency levels.
The benchmark compares strategies for keeping a summary up to date (vs baseline):
* Partial Refresh: Triggers on the base table collect modified IDs and execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
* Materialized Table (Control): A standard table maintained via complex PL/pgSQL triggers (the traditional manual workaround).
* Full Refresh (Legacy): Manually refresh the view after changes.
Results are below:
Concurrency: 1 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 5309.05 | 0.002x 0.437x 0.470x
20000 50 | 1209.32 | 0.010x 0.600x 0.598x
20000 1000 | 56.05 | 0.164x 0.594x 0.576x
400000 1 | 5136.91 | 0 x 0.450x 0.487x
400000 50 | 1709.17 | 0 x 0.497x 0.482x
400000 1000 | 110.35 | 0.006x 0.507x 0.460x
Concurrency: 4 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 19197.50 | 0x 0.412x 0.435x
20000 50 | 1016.14 | 0.007x 0.966x 1.036x
20000 1000 | 9.94 | 0.708x 1.401x 1.169x
400000 1 | 19637.36 | 0x 0.436x 0.483x
400000 50 | 4669.32 | 0x 0.574x 0.566x
400000 1000 | 23.26 | 0.029x 1.147x 0.715x
Concurrency: 8 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 30358.32 | 0x 0.440x 0.457x
20000 50 | 262.75 | 0.026x 2.943x 2.740x
20000 1000 | 11.28 | 0.575x 0.840x 0.578x
400000 1 | 36007.15 | 0x 0.430x 0.464x
400000 50 | 6664.58 | 0x 0.563x 0.494x
400000 1000 | 11.61 | 0.058x 1.000x 1.277x
In these tests, the partial refresh behaves as O(delta) rather than O(total), performing comparably to the manual PL/pgSQL approach but with significantly lower code complexity for the user.
I recognize that adding a WHERE clause to REFRESH is an extension to the SQL standard. I believe the syntax is intuitive, but I am open to discussion regarding alternative implementation strategies or syntax if the community feels a different approach is warranted.
New regression tests are included in the patch.
This is my first time submitting a patch to PostgreSQL, so please bear with me if I've missed anything or made any procedural mistakes. I'm happy to address any feedback.
Thanks,
Adam Brusselback
Вложения
On Mon, Dec 8, 2025 at 3:58 PM Adam Brusselback <adambrusselback@gmail.com> wrote:
Attached is a patch implementing support for a WHERE clause in REFRESH MATERIALIZED VIEW.
The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';
I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ
+1 (But I was in that hacking session).
Our situation was a wonderful MV with all the columns we needed (some hard to calculate) to augment search data done millions of times/day. It was a thing of beauty. Until we realized we needed to update 1 record (vendor inventory UPDATE date/time) every time we processed a file (something we do 24x7, a hundred times each hour!
For that ONE field, we ended up doing REFRESH MV concurrently; OVER 2,000 times per day.
Our understanding is that many people run into this exact issue. The cache needs small frequent updates.
(After reading the code that handles MVs, we just created our own TABLE, and maintain it with a scheduler to rebuild HOURLY,
and when we process the file, a Simple UPDATE is issued for the one column).
While this "Works", the CONCEPT of this patch (untested by me, as of yet), would have fixed this with far less effort,
and would be easier to maintain.
After I review the code, I will add additional comments.
I am curious what others think? (And FWIW, I believe that the larger the MV, the MORE this feature is needed,
vs refreshing the ENTIRE view).
Regards...
Regards...
On Tue, Dec 09, 2025 at 12:08:58AM -0500, Kirk Wolak wrote: > Our understanding is that many people run into this exact issue. The > cache needs small frequent updates. > (After reading the code that handles MVs, we just created our own TABLE, > and maintain it with a scheduler to rebuild HOURLY, > and when we process the file, a Simple UPDATE is issued for the one column). > > While this "Works", the CONCEPT of this patch (untested by me, as of > yet), would have fixed this with far less effort, > and would be easier to maintain. > > After I review the code, I will add additional comments. > > I am curious what others think? (And FWIW, I believe that the larger the > MV, the MORE this feature is needed, > vs refreshing the ENTIRE view). What I do is I have my own materialized view infrastructure, written entirely in PlPgSQL, and I completely avoid PG's MV support. This alternative MV scheme creates an actual table for each MV, which means: - one can update the MV directly (and I do, via triggers) - one can have triggers on the MV (e.g., to record history) This is has been very handy for me. I also have a state table in which to keep track of whether an MV needs a refresh, and I have a function i can use to mark an MV as needing a refresh. Marking an MV as needing a refresh sends a NOTIFY, and then I have a daemon that will refresh views as needed (with some debouncing/coalescing of notifications). This way I can have MVs with very complex underlying queries for which some kinds of updates I can easily write fast triggers for and others where I can't (or where they would slow down transactions too much) I simply mark the MV as needing a refresh. Typical MV queries I have that this works very well for include transitive reachability closure computations (e.g., all the groups a thing is a member of, directly and indirectly, or vice versa -- recursive CTEs basically). Though I do now have triggers that can do a reasonably good job of synchronously and quickly updating MVs with such queries, it's just I didn't always. Refreshes are always 'concurrent'. This is my 80/20 solution to the "Incremental View Maintenance" (IVM) problem. A not very current version is here: https://github.com/twosigma/postgresql-contrib/blob/master/mat_views.sql If you like it I might be able to get a newer version out. The version above has a few minor issues: - it uses DELETE FROM instead of TRUNCATE for its' sort-of temp tables - using TRUNCATE ends up requiring some care to avoid occasional deadlocks with VACUUM that are due to using tables as types of the columns of the deltas tables - logging -- lots of logging in the newest version Another issue is that I rely on NATURAL FULL OUTER JOIN to avoid having to generate ON conditions, but that means that all columns of the underlying VIEW must not have NULLs. As I've not needed to support nullable columns in these MVs, I don't mind. Nico --
> Our situation was a wonderful MV with all the columns we needed (some
> hard to calculate) to augment search data done millions of times/day. It
> was a thing of beauty. Until we realized we needed to update 1 record
> (vendor inventory UPDATE date/time) every time we processed a file
> (something we do 24x7, a hundred times each hour!
>
> For that ONE field, we ended up doing REFRESH MV concurrently; OVER
> 2,000 times per day.
Thanks for the feedback and the use case, Kirk.
Regarding that specific scenario where a single column ("last updated" or similar) churns significantly faster than the heavy-computation columns:
Even with this patch, you might find it beneficial to separate that high-velocity column into its own small materialized view (or regular view) and join it to the main MV at query time. That will reduce the bloat you get on the main MV by quite a lot, especially if you have very wide rows (which it seems like you do).
I initially tried to implement logic that would allow for direct UPDATEs (which would enable HOT updates). However, to handle rows that matched the predicate but were no longer present in the new source data, I had to run an anti-join to identify them for deletion. That approach caused performance issues, so I settled on the "Prune + Upsert" strategy (DELETE matching rows, then INSERT from source).
Because this patch performs a delete/insert cycle, updating that one timestamp column will still result in rewriting the whole tuple in the MV.
> For that ONE field, we ended up doing REFRESH MV concurrently; OVER
> 2,000 times per day.
That said, 2,000 refreshes per day is nothing for this implementation, provided your updates are selective enough and your queries allow for predicate push-down to the base tables.
I look forward to your thoughts after reviewing the code.
Thanks,
Adam Brusselback
> hard to calculate) to augment search data done millions of times/day. It
> was a thing of beauty. Until we realized we needed to update 1 record
> (vendor inventory UPDATE date/time) every time we processed a file
> (something we do 24x7, a hundred times each hour!
>
> For that ONE field, we ended up doing REFRESH MV concurrently; OVER
> 2,000 times per day.
Thanks for the feedback and the use case, Kirk.
Regarding that specific scenario where a single column ("last updated" or similar) churns significantly faster than the heavy-computation columns:
Even with this patch, you might find it beneficial to separate that high-velocity column into its own small materialized view (or regular view) and join it to the main MV at query time. That will reduce the bloat you get on the main MV by quite a lot, especially if you have very wide rows (which it seems like you do).
I initially tried to implement logic that would allow for direct UPDATEs (which would enable HOT updates). However, to handle rows that matched the predicate but were no longer present in the new source data, I had to run an anti-join to identify them for deletion. That approach caused performance issues, so I settled on the "Prune + Upsert" strategy (DELETE matching rows, then INSERT from source).
Because this patch performs a delete/insert cycle, updating that one timestamp column will still result in rewriting the whole tuple in the MV.
> For that ONE field, we ended up doing REFRESH MV concurrently; OVER
> 2,000 times per day.
That said, 2,000 refreshes per day is nothing for this implementation, provided your updates are selective enough and your queries allow for predicate push-down to the base tables.
I look forward to your thoughts after reviewing the code.
Thanks,
Adam Brusselback