Обсуждение: Import Statistics in postgres_fdw before resorting to sampling.
Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of doing so to other foreign data wrappers.
This involves adding two new options to postgres_fdw at the server and table level.
The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a postgres_fdw foreign table to first attempt to fetch relation and attribute statistics from the remote table. If this succeeds, then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive sampling operation.
The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch succeeds but no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt to ANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made.
If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings.
Вложения
On Tue, Aug 12, 2025 at 10:33 PM Corey Huinker <corey.huinker@gmail.com> wrote: > > > Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of doingso to other foreign data wrappers. > > This involves adding two new options to postgres_fdw at the server and table level. > > The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a postgres_fdwforeign table to first attempt to fetch relation and attribute statistics from the remote table. If this succeeds,then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive samplingoperation. > > The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch succeedsbut no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt toANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made. > > If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings. > > Note patches 0001 and 0002 are already a part of a separate thread https://www.postgresql.org/message-id/flat/CADkLM%3DcpUiJ3QF7aUthTvaVMmgQcm7QqZBRMDLhBRTR%2BgJX-Og%40mail.gmail.com regardinga bug (0001) and a nitpick (0002) that came about as a side-effect to this effort, and but I expect those to beresolved one way or another soon. Any feedback on those two can be handled there. I think this is very useful to avoid fetching rows from foreign server and analyzing them locally. This isn't a full review. I looked at the patches mainly to find out how does it fit into the current method of analysing a foreign table. Right now, do_analyze_rel() is called with FDW specific acquireFunc, which collects a sample of rows. The sample is passed to attribute specific compute_stats which fills VacAttrStats for that attribute. VacAttrStats for all the attributes is passed to update_attstats(), which updates pg_statistics. The patch changes that to fetch the statistics from the foreign server and call pg_restore_attribute_stats for each attribute. Instead I was expecting that after fetching the stats from the foreign server, it would construct VacAttrStats and call update_attstats(). That might be marginally faster since it avoids SPI call and updates stats for all the attributes. Did you consider this alternate approach and why it was discarded? If a foreign table points to an inheritance parent on the foreign server, we will receive two rows for that table - one with inherited = false and other with true in that order. I think the stats with inherited=true are relevant to the local server since querying the parent will fetch rows from children. Since that stats is applied last, the pg_statistics will retain the intended statistics. But why to fetch two rows in the first place and waste computing cycles? -- Best Wishes, Ashutosh Bapat
This isn't a full review. I looked at the patches mainly to find out
how does it fit into the current method of analysing a foreign table.
Any degree of review is welcome. We're chasing views, reviews, etc.
Right now, do_analyze_rel() is called with FDW specific acquireFunc,
which collects a sample of rows. The sample is passed to attribute
specific compute_stats which fills VacAttrStats for that attribute.
VacAttrStats for all the attributes is passed to update_attstats(),
which updates pg_statistics. The patch changes that to fetch the
statistics from the foreign server and call pg_restore_attribute_stats
for each attribute.
That recap is accurate.
Instead I was expecting that after fetching the
stats from the foreign server, it would construct VacAttrStats and
call update_attstats(). That might be marginally faster since it
avoids SPI call and updates stats for all the attributes. Did you
consider this alternate approach and why it was discarded?
It might be marginally faster, but it would duplicate a lot of the pair-checking (must have a most-common-freqs with a most-common-vals, etc) and type-checking logic (the vals in a most-common vals must all input coerce to the correct datatype for the _destination_ column, etc), and we've already got that in pg_restore_attribute_stats. There used to be a non-fcinfo function that took a long list of Datums and isnull boolean pairs, but that wasn't pretty to look at and was replaced with the positional fcinfo version we have today. This use case might be a reason to bring that back, or expose the existing positional fcinfo function (presently static) if we want to avoid SPI badly enough. As it is, the SPI code is fairly future proof in that it isn't required to add new stat types as they are created. My first attempt at this patch attempted to make a FunctionCallInvoke() on the variadic pg_restore_attribute_stats, but that would require a filled out fn_expr, and to get that we'd have to duplicate a lot of logic from the parser, so the infrastructure isn't available to easily call a variadic function.
If a foreign table points to an inheritance parent on the foreign
server, we will receive two rows for that table - one with inherited =
false and other with true in that order. I think the stats with
inherited=true are relevant to the local server since querying the
parent will fetch rows from children. Since that stats is applied
last, the pg_statistics will retain the intended statistics. But why
to fetch two rows in the first place and waste computing cycles?
Glad you agree that we're fetching the right statistics.
That was the only way I could think of to do one client fetch and still get exactly one row back.
Anything else involved fetching the inh=true first, and if that failed fetching the inh=false statistics. That adds extra round-trips especially given that inherited statistics are more rare than non-inherited statistics. Moreoever, we're making decisions (analyze yes/no, fallback to sampling yes/no) based on whether or not we got statistics back from the foreign server at all, and having to consider the result of two fetches instead of one makes that logic more complicated.
If, however, you were referring to the work we're handing the remote server, I'm open to queries that you think would be more lightweight. However, the pg_stats view is a security barrier view, so we reduce overhead by passing through that barrier as few times as possible.
On Fri, Sep 12, 2025 at 1:17 AM Corey Huinker <corey.huinker@gmail.com> wrote:
Rebased.
Rebased again.
Вложения
On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote: > Rebased again. Hearing an opinion from Fujita-san would be interesting here, so I am adding him in CC. I have been looking a little bit at this patch. + ImportStatistics_function ImportStatistics; All FDW callbacks are documented in fdwhandler.sgml. This new one is not. I am a bit uncomfortable regarding the design you are using here, where the ImportStatistics callback, if defined, takes priority over the existing AnalyzeForeignTable, especially regarding the fact that both callbacks attempt to retrieve the same data, except that the existing callback has a different idea of the timing to use to retrieve reltuples and relpages. The original callback AnalyzeForeignTable retrieves immediately the total number of pages via SQL, to feed ANALYZE. reltuples is then fed to ANALYZE from a callback function that that's defined in AnalyzeForeignTable(). My point is: rather than trying to implement a second solution with a new callback, shouldn't we try to rework the existing callback so as it would fit better with what you are trying to do here: feed data that ANALYZE would then be in charge of inserting? Relying on pg_restore_relation_stats() for the job feels incorrect to me knowing that ANALYZE is the code path in charge of updating the stats of a relation. The new callback is a shortcut that bypasses what ANALYZE does, so the problem, at least it seems to me, is that we want to retrieve all the data in a single step, like your new callback, not in two steps, something that only the existing callback allows. Hence, wouldn't it be more natural to retrieve the total number of pages and reltuples from one callback, meaning that for local relations we should delay RelationGetNumberOfBlocks() inside the existing "acquire_sample_rows" callback (renaming it would make sense)? This requires some redesign of AnalyzeForeignTable and the internals of analyze.c, but it would let FDW extensions know about the potential efficiency gain. There is also a performance concern to be made with the patch, but as it's an ANALYZE path that may be acceptable: if we fail the first callback, then we may call the second callback. Fujita-san, what do you think? -- Michael
Вложения
On Mon, Oct 20, 2025 at 03:45:14PM +0900, Michael Paquier wrote: > On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote: > > Rebased again. > > Hearing an opinion from Fujita-san would be interesting here, so I am > adding him in CC. I have been looking a little bit at this patch. By the way, as far as I can see this patch is still in the past commit fest: https://commitfest.postgresql.org/patch/5959/ You may want to move it if you are planning to continue working on that. -- Michael
Вложения
I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages. The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE. reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().
They don't try to retrieve the same data. AnalyzeForeignTable tries to retrieve a table sample which it feeds into the normal ANALYZE process. If that sample is going to be any good, it has to be a lot of rows, that that's a lot of network traffic.
ImportStatistics just grabs the results that ANALYZE computed for the remote table, using a far better sample than we'd want to pull across the wire.
My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?
To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the wire, just to have ANALYZE compute it back down to the data we already had.
Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation.
That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially taking over what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly early on in development, though I wasn't against it at the time.
As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had inplace updates because that's what ANALYZE did.
The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows. Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)? This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.
I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.
One problem I do see, however, is that the queries for fetching relation (pg_class) stats should never fail and always return one row, even if the values returned are meaningless. It's the query against pg_stats that lets us know if 1) the database on the other side is a real-enough postgres and 2) the remote table is itself analyzed. Only once we're happy that we have good attribute stats should we bother with the relation stats. All of this logic is specific to postgres, so it was confined to postgres_fdw. I don't know that other FDWs would be that much different, but minimizing the generic impact was a goal.
I'll look at this again, but I'm not sure I'm going to come up with much different.
There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.
I think the big win is the network traffic savings.
Fujita-san, what do you think?
Very interested to know as well.
Hi Corey,
This is an important feature. Thanks for working on it.
On Mon, Nov 3, 2025 at 2:26 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>> My point is: rather than trying to implement a second solution with a
>> new callback, shouldn't we try to rework the existing callback so as
>> it would fit better with what you are trying to do here: feed data
>> that ANALYZE would then be in charge of inserting?
> To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the
wire,just to have ANALYZE compute it back down to the data we already had.
>> Relying on
>> pg_restore_relation_stats() for the job feels incorrect to me knowing
>> that ANALYZE is the code path in charge of updating the stats of a
>> relation.
> That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially
takingover what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly
earlyon in development, though I wasn't against it at the time.
>
> As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had
inplaceupdates because that's what ANALYZE did.
To me it seems like a good idea to rely on pg_restore_relation_stats
and pg_restore_attribute_stats, rather than doing some rework on
analyze.c; IMO I don't think it's a good idea to do such a thing for
something rather special like this.
>> The new callback is a shortcut that bypasses what ANALYZE
>> does, so the problem, at least it seems to me, is that we want to
>> retrieve all the data in a single step, like your new callback, not in
>> two steps, something that only the existing callback allows. Hence,
>> wouldn't it be more natural to retrieve the total number of pages and
>> reltuples from one callback, meaning that for local relations we
>> should delay RelationGetNumberOfBlocks() inside the existing
>> "acquire_sample_rows" callback (renaming it would make sense)? This
>> requires some redesign of AnalyzeForeignTable and the internals of
>> analyze.c, but it would let FDW extensions know about the potential
>> efficiency gain.
> I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.
+1 for the minimal disruption.
Other initial comments:
The commit message says:
This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.
If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.
If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.
I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.
On the other hand:
This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.
I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.
That's it for now.
My apologies for the delayed response.
Best regards,
Etsuro Fujita
Other initial comments:
The commit message says:
This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.
If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.
If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.
I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.
Obviously there is no way to know the quality/freshness of remote stats if they are found.
The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might look like. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did make me want to put it out there to see who else liked it.
On the other hand:
This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.
I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.
The stats fetch query is pretty light, but I can see fetching the relkind along with the relstats, and making decisions on whether to continue from there, only applying the relstats after attrstats have been successfully applied.
That's it for now.
I'll see what I can do to make that work.
My apologies for the delayed response.
My apologies for the delayed response.Valuable responses are worth waiting for.
I've reorganized some things a bit, mostly to make resource cleanup simpler.
Вложения
> On Nov 23, 2025, at 15:27, Corey Huinker <corey.huinker@gmail.com> wrote:
>
> My apologies for the delayed response.
>
> Valuable responses are worth waiting for.
>
> I've reorganized some things a bit, mostly to make resource cleanup simpler.
> <v4-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch>
Few comments:
1 - commit message
```
effort and the user is better of setting fetch_stats to false for that
```
I guess “of” should be “off”
2 - postgres-fdw.sgml
```
+ server, determines wheter an <command>ANALYZE</command> on a foreign
```
Typo: wheter => whether
3 - postgres-fdw.sgml
```
+ data sampling if no statistics were availble. This option is only
```
Typo: availble => available
4 - option.c
```
+ /* fetch_size is available on both server and table */
+ {"fetch_stats", ForeignServerRelationId, false},
+ {"fetch_stats", ForeignTableRelationId, false},
```
In the comment, I guess fetch_size should be fetch_stats.
5 - analyze.c
```
+ * XXX: Should this be it's own fetch type? If not, then there might be
```
Typo: “it’s own” => “its own”
6 - analyze.c
```
+ case FDW_IMPORT_STATS_NOTFOUND:
+ ereport(INFO,
+ errmsg("Found no remote statistics for \"%s\"",
+ RelationGetRelationName(onerel)));
```
`Found no remote statistics for \"%s\””` could be rephrased as `""No remote statistics found for foreign table
\"%s\””`,sounds better wording in server log.
Also, I wonder if this message at INFO level is too noisy?
7 - postgres_fdw.c
```
+ default:
+ ereport(INFO,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("Remote table %s.%s does not support statistics.",
+ quote_identifier(remote_schemaname),
+ quote_identifier(remote_relname)),
+ errdetail("Remote relation if of relkind \"%c\"", relkind));
```
I think “if of” should be “is of”.
8 - postgres_fdw.c
```
+ errmsg("Attribute statistics found for %s.%s but no columns matched",
+ quote_identifier(schemaname),
+ quote_identifier(relname))));
```
Instead of using "%s.%s” and calling quote_identifier() twice, there is a simple function to use:
quote_qualified_identifier(schemaname,relname).
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sat, Nov 22, 2025 at 6:31 AM Corey Huinker <corey.huinker@gmail.com> wrote: >> Other initial comments: >> >> The commit message says: >> >> This is managed via two new options, fetch_stats and remote_analyze, >> both are available at the server level and table level. If fetch_stats >> is true, then the ANALYZE command will first attempt to fetch statistics >> from the remote table and import those statistics locally. >> >> If remote_analyze is true, and if the first attempt to fetch remote >> statistics found no attribute statistics, then an attempt will be made >> to ANALYZE the remote table before a second and final attempt to fetch >> remote statistics. >> >> If no statistics are found, then ANALYZE will fall back to the normal >> behavior of sampling and local analysis. >> >> I think the first step assumes that the remote stats are up-to-date; >> if they aren't, it would cause a regression. (If the remote relation >> is a plain table, they are likely to be up-to-date, but for example, >> if it is a foreign table, it's possible that they are stale.) So how >> about making it the user's responsibility to make them up-to-date? If >> doing so, we wouldn't need to do the second and third steps anymore, >> making the patch simple. > Obviously there is no way to know the quality/freshness of remote stats if they are found. > > The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might looklike. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did makeme want to put it out there to see who else liked it. Actually, I have some concerns about the ANALYZE and fall-back options. As for the former, if the remote user didn't have the MAINTAIN privilege on the remote table, remote ANALYZE would be just a waste effort. As for the latter, imagine the situation where a user ANALYZEs a foreign table whose remote table is significantly large. When the previous attempts fail, the user might want to re-try to import remote stats after ANALYZEing the remote table in the remote side in some way, rather than postgres_fdw automatically falling back to the normal lengthy processing. I think just throwing an error if the first attempt fails would make the system not only simple but reliable while providing some flexibility to users. >> On the other hand: >> >> This operation will only work on remote relations that can have stored >> statistics: tables, partitioned tables, and materialized views. If the >> remote relation is a view then remote fetching/analyzing is just wasted >> effort and the user is better of setting fetch_stats to false for that >> table. >> >> I'm not sure the waste effort is acceptable; IMO, if the remote table >> is a view, I think that the system should detect that in some way, and >> then just do the normal ANALYZE processing. > > > The stats fetch query is pretty light, but I can see fetching the relkind along with the relstats, and making decisionson whether to continue from there, only applying the relstats after attrstats have been successfully applied. Good idea! I would vote for throwing an error if the relkind is view, making the user set fetch_stats to false for the foreign table. Best regards, Etsuro Fujita
On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote: > I've reorganized some things a bit, mostly to make resource cleanup simpler. Thanks for updating the patch! I will look into it. Best regards, Etsuro Fujita
On Thu, Nov 27, 2025 at 7:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote:
> I've reorganized some things a bit, mostly to make resource cleanup simpler.
Thanks for updating the patch! I will look into it.
Best regards,
Etsuro Fujita
Rebase, no changes.
Вложения
> On Dec 12, 2025, at 05:59, Corey Huinker <corey.huinker@gmail.com> wrote: > > On Thu, Nov 27, 2025 at 7:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote: > > I've reorganized some things a bit, mostly to make resource cleanup simpler. > > Thanks for updating the patch! I will look into it. > > Best regards, > Etsuro Fujita > > Rebase, no changes. > <v5-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch> A kind reminder, I don’t see my comments are addressed: https://postgr.es/m/F9C73EF2-F977-46E4-9F61-B6CF72BF1A69@gmail.com Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Thu, Nov 27, 2025 at 9:46 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Sat, Nov 22, 2025 at 6:31 AM Corey Huinker <corey.huinker@gmail.com> wrote: > >> Other initial comments: > >> > >> The commit message says: > >> > >> This is managed via two new options, fetch_stats and remote_analyze, > >> both are available at the server level and table level. If fetch_stats > >> is true, then the ANALYZE command will first attempt to fetch statistics > >> from the remote table and import those statistics locally. > >> > >> If remote_analyze is true, and if the first attempt to fetch remote > >> statistics found no attribute statistics, then an attempt will be made > >> to ANALYZE the remote table before a second and final attempt to fetch > >> remote statistics. > >> > >> If no statistics are found, then ANALYZE will fall back to the normal > >> behavior of sampling and local analysis. > >> > >> I think the first step assumes that the remote stats are up-to-date; > >> if they aren't, it would cause a regression. (If the remote relation > >> is a plain table, they are likely to be up-to-date, but for example, > >> if it is a foreign table, it's possible that they are stale.) So how > >> about making it the user's responsibility to make them up-to-date? If > >> doing so, we wouldn't need to do the second and third steps anymore, > >> making the patch simple. > > > Obviously there is no way to know the quality/freshness of remote stats if they are found. > > > > The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might looklike. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did makeme want to put it out there to see who else liked it. > > Actually, I have some concerns about the ANALYZE and fall-back > options. As for the former, if the remote user didn't have the > MAINTAIN privilege on the remote table, remote ANALYZE would be just a > waste effort. As for the latter, imagine the situation where a user > ANALYZEs a foreign table whose remote table is significantly large. > When the previous attempts fail, the user might want to re-try to > import remote stats after ANALYZEing the remote table in the remote > side in some way, rather than postgres_fdw automatically falling back > to the normal lengthy processing. I think just throwing an error if > the first attempt fails would make the system not only simple but > reliable while providing some flexibility to users. I think I was narrow-minded here; as for the ANALYZE option, if the remote user has the privilege, it would work well, so +1 for it, but I don't think it's a must-have option, so I'd vote for making it a separate patch. As for the fall-back behavior, however, sorry, I still think it reduces flexibility. Best regards, Etsuro Fujita
On Fri, Dec 12, 2025 at 6:59 AM Corey Huinker <corey.huinker@gmail.com> wrote:
> Rebase, no changes.
Thanks for rebasing! I reviewed the changes made to the core:
@@ -196,13 +196,56 @@ analyze_rel(Oid relid, RangeVar *relation,
{
/*
* For a foreign table, call the FDW's hook function to see whether it
- * supports analysis.
+ * supports statistics import and/or analysis.
*/
FdwRoutine *fdwroutine;
bool ok = false;
fdwroutine = GetFdwRoutineForRelation(onerel, false);
+ if (fdwroutine->ImportStatistics != NULL)
+ {
+ FdwImportStatsResult res;
+
+ /*
+ * Fetching pre-existing remote stats is not guaranteed to
be a quick
+ * operation.
+ *
+ * XXX: Should this be it's own fetch type? If not, then
there might be
+ * confusion when a long stats-fetch fails, followed by a
regular analyze,
+ * which would make it look like the table was analyzed twice.
+ */
+ pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
+ RelationGetRelid(onerel));
+
+ res = fdwroutine->ImportStatistics(onerel);
+
+ pgstat_progress_end_command();
+
+ /*
+ * If we were able to import statistics, then there is no
need to collect
+ * samples for local analysis.
+ */
+ switch(res)
+ {
+ case FDW_IMPORT_STATS_OK:
+ relation_close(onerel, ShareUpdateExclusiveLock);
+ return;
+ case FDW_IMPORT_STATS_DISABLED:
+ break;
+ case FDW_IMPORT_STATS_NOTFOUND:
+ ereport(INFO,
+ errmsg("Found no remote statistics for \"%s\"",
+ RelationGetRelationName(onerel)));
+ break;
+ case FDW_IMPORT_STATS_FAILED:
+ default:
+ ereport(INFO,
+ errmsg("Fetching remote statistics from
\"%s\" failed",
+ RelationGetRelationName(onerel)));
+ }
+ }
Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.
IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.
bool
StatisticsAreImportable(Relation relation)
Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.
void
ImportStatistics(Relation relation, List *va_cols, int elevel)
Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.
I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.
That's it for now. I'll continue to review the patch.
Best regards,
Etsuro Fujita
Вложения
A kind reminder, I don’t see my comments are addressed:
My apologies. Will get into the next rev.
CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for me when I started designing this feature.
Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.
Perhaps I'm not understanding completely, but I believe what we're doing now should be ok.
The local table of type 'f' can be a member of a partition, but can't be a partition itself, so whatever stats we get for it, we're storing them as inh = false.
On the remote side, the table could be an inheritance parent, in which case we ONLY want the inh=true stats, but we will still store them locally as inh = false.
The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of the query ensures that we get inh=true stats if they're there in preference to the inh=false steps.
I will grant you that in an old-style inheritance (i.e. not formal partitions) situation we'd probably want some weighted mix of the inherited and non-inherited stats, but 1) very few people use old-style inheritance anymore, 2) few of those export tables via a FDW, and 3) there's no way to do that weighting so we should fall back to sampling anyway.
None of this takes away from your suggestions down below.
IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.
bool
StatisticsAreImportable(Relation relation)
Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.
+1
void
ImportStatistics(Relation relation, List *va_cols, int elevel)
Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.
I can still see a situation where a local table expects the remote table to eventually have proper statistics on it, but until that happens it will fall back to table samples. This design decision means that either the user lives without any statistics for a while, or alters the foreign table options and hopefully remembers to set them back. While I understand the desire to first implement something very simple, I think that adding the durability that fallback allows for will be harder to implement if we don't build it in from the start. I'm interested to hear with Nathan and/or Jonathan have to say about that.
I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.
Good catch, I forgot about that one.
Going to think some more on this before I work incorporating your
On Fri, Dec 12, 2025 at 2:45 PM Corey Huinker <corey.huinker@gmail.com> wrote:
CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for me when I started designing this feature.Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.Perhaps I'm not understanding completely, but I believe what we're doing now should be ok.The local table of type 'f' can be a member of a partition, but can't be a partition itself, so whatever stats we get for it, we're storing them as inh = false.On the remote side, the table could be an inheritance parent, in which case we ONLY want the inh=true stats, but we will still store them locally as inh = false.The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of the query ensures that we get inh=true stats if they're there in preference to the inh=false steps.I will grant you that in an old-style inheritance (i.e. not formal partitions) situation we'd probably want some weighted mix of the inherited and non-inherited stats, but 1) very few people use old-style inheritance anymore, 2) few of those export tables via a FDW, and 3) there's no way to do that weighting so we should fall back to sampling anyway.None of this takes away from your suggestions down below.IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.
bool
StatisticsAreImportable(Relation relation)
Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.+1void
ImportStatistics(Relation relation, List *va_cols, int elevel)
Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.I think I'm ok with this design as the decision, as it still leaves open the fdw-specific options of how to handle initially finding no remote stats.I can still see a situation where a local table expects the remote table to eventually have proper statistics on it, but until that happens it will fall back to table samples. This design decision means that either the user lives without any statistics for a while, or alters the foreign table options and hopefully remembers to set them back. While I understand the desire to first implement something very simple, I think that adding the durability that fallback allows for will be harder to implement if we don't build it in from the start. I'm interested to hear with Nathan and/or Jonathan have to say about that.I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.Good catch, I forgot about that one.Going to think some more on this before I work incorporating your
Heh, the word "changes" got cut off.
Anyway, here's v6 incorporating both threads of feedback.
Вложения
On Sun, Dec 14, 2025 at 4:12 AM Corey Huinker <corey.huinker@gmail.com> wrote: > On Fri, Dec 12, 2025 at 2:45 PM Corey Huinker <corey.huinker@gmail.com> wrote: >> CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for me when I started designing this feature. >>> Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the >>> foreign table is an inheritance parent, we would fail to do inherited >>> stats. >> Perhaps I'm not understanding completely, but I believe what we're doing now should be ok. >> >> The local table of type 'f' can be a member of a partition, but can't be a partition itself, so whatever stats we getfor it, we're storing them as inh = false. >> >> On the remote side, the table could be an inheritance parent, in which case we ONLY want the inh=true stats, but we willstill store them locally as inh = false. >> >> The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of the query ensures that we get inh=true statsif they're there in preference to the inh=false steps. >> >> I will grant you that in an old-style inheritance (i.e. not formal partitions) situation we'd probably want some weightedmix of the inherited and non-inherited stats, but 1) very few people use old-style inheritance anymore, 2) few ofthose export tables via a FDW, and 3) there's no way to do that weighting so we should fall back to sampling anyway. Ah, I mean the case where the foreign table is an inheritance parent on the *local* side. In that case, the return would cause us to skip the recursive ANALYZE (i.e., do_analyze_rel() with inh=true), leading to no inherited stats. I agree that the case is minor, but I don't think that that's acceptable. >>> void >>> ImportStatistics(Relation relation, List *va_cols, int elevel) >>> >>> Imports the stats for the foreign table from the foreign server. As >>> mentioned in previous emails, I don't think it's a good idea to fall >>> back to the normal processing when the attempt to import the stats >>> fails, in which case I think we should just throw an error (or >>> warning) so that the user can re-try to import the stats after fixing >>> the foreign side in some way. So I re-defined this as a void >>> function. Note that this re-definition removes the concern mentioned >>> in the comment starting with "XXX:". In the postgres_fdw case, as >>> mentioned in a previous email, I think it would be good to implement >>> this so that it checks whether the remote table is a view or not when >>> importing the relation stats from the remote server, and if so, just >>> throws an error (or warning), making the user reset the fetch_stats >>> flag. >> I think I'm ok with this design as the decision, as it still leaves open the fdw-specific options of how to handle initiallyfinding no remote stats. >> >> I can still see a situation where a local table expects the remote table to eventually have proper statistics on it, butuntil that happens it will fall back to table samples. This design decision means that either the user lives without anystatistics for a while, or alters the foreign table options and hopefully remembers to set them back. While I understandthe desire to first implement something very simple, I think that adding the durability that fallback allows forwill be harder to implement if we don't build it in from the start. I'm interested to hear with Nathan and/or Jonathanhave to say about that. My concern about the fall-back behavior is that it reduces flexibility in some cases, as mentioned upthread. Maybe that could be addressed by making the behavior an option, but my another (bigger) concern is that considering that it's the user's responsibility to make remote stats up-to-date when he/she uses this feature, the "no remote stats found" result would be his/her fault; it might not be worth complicating the code for something like that. Anyway, I too would like to hear the opinions of them (or anyone else). > Anyway, here's v6 incorporating both threads of feedback. Thanks for updating the patch! I will review the postgres_fdw changes next. Best regards, Etsuro Fujita
Ah, I mean the case where the foreign table is an inheritance parent
on the *local* side. In that case, the return would cause us to skip
the recursive ANALYZE (i.e., do_analyze_rel() with inh=true), leading
to no inherited stats. I agree that the case is minor, but I don't
think that that's acceptable.
When such a corner case occurs (stats import configured to true, but table is an inheritance parent), should we raise an error, or raise a warning and return false on the CanImportStats() call? I guess the answer may depend on the feedback we get.
On Mon, Dec 15, 2025 at 5:01 AM Corey Huinker <corey.huinker@gmail.com> wrote: >> Ah, I mean the case where the foreign table is an inheritance parent >> on the *local* side. In that case, the return would cause us to skip >> the recursive ANALYZE (i.e., do_analyze_rel() with inh=true), leading >> to no inherited stats. I agree that the case is minor, but I don't >> think that that's acceptable. > When such a corner case occurs (stats import configured to true, but table is an inheritance parent), should we raise anerror, or raise a warning and return false on the CanImportStats() call? I guess the answer may depend on the feedbackwe get. As mentioned upthread, the FDW API that I proposed addresses this issue; even in such a case it allows the FDW to import stats, instead of doing the normal non-recursive ANALYZE, and then do the recursive ANALYZE, for the inherited stats. Best regards, Etsuro Fujita
On Mon, Dec 15, 2025 at 3:40 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > On Sun, Dec 14, 2025 at 4:12 AM Corey Huinker <corey.huinker@gmail.com> wrote: > > Anyway, here's v6 incorporating both threads of feedback. > > I will review the postgres_fdw changes next. I spent some time reviewing the changes. + <term><literal>fetch_stats</literal> (<type>boolean</type>)</term> + <listitem> + <para> + This option, which can be specified for a foreign table or a foreign + server, determines if <command>ANALYZE</command> on a foreign table + will first attempt to fetch and import the existing relation and + attribute statistics from the remote table, and only attempt regular + data sampling if no statistics were available. This option is only + useful if the remote relation is one that can have regular statistics + (tables and materialized views). + The default is <literal>true</literal>. This version of the patch wouldn't fall back to the normal ANALYZE processing anymore, so this documentation should be updated as such. Also, as I think it's the user's responsibility to ensure the existing statistics are up-to-date, as I said before, I think we should add a note about that here. Also, as some users wouldn't be able to ensure it, I'm wondering if the default should be false. + <term><literal>remote_analyze</literal> (<type>boolean</type>)</term> + <listitem> + <para> + This option, which can be specified for a foreign table or a foreign + server, determines whether an <command>ANALYZE</command> on a foreign + table will attempt to <command>ANALYZE</command> the remote table if + the first attempt to fetch remote statistics fails, and will then + make a second and final attempt to fetch remote statistics. This option + is only meaningful if the foreign table has + <literal>fetch_stats</literal> enabled at either the server or table + level. If the user wasn't able to ensure the statistics are up-to-date, I think he/she might want to do remote ANALYZE *before* fetching the statistics, not after trying to do so. So I think we could instead provide this option as such. What do you think about that? Anyway, I'd vote for leaving this for another patch, as I think it's a nice-to-have option rather than a must-have one, as I said before. ISTM that the code is well organized overall. Here are a few comments: +static const char *relstats_query_18 = + "SELECT c.relkind, c.relpages, c.reltuples, c.relallvisible, c.relallfrozen " + "FROM pg_catalog.pg_class AS c " + "JOIN pg_catalog.pg_namespace AS n ON n.oid = c.relnamespace " + "WHERE n.nspname = $1 AND c.relname = $2"; We don't use relallvisible and relallfrozen for foreign tables (note that do_analyze_rel() calls vac_update_relstats() with relallvisible=0 and relallfrozen=0 for them). Do we really need to retrieve (and restore) them? +static const char *attstats_query_17 = + "SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, " + "s.n_distinct, s.most_common_vals, s.most_common_freqs, " + "s.histogram_bounds, s.correlation, s.most_common_elems, " + "s.most_common_elem_freqs, s.elem_count_histogram, " + "s.range_length_histogram, s.range_empty_frac, s.range_bounds_histogram " + "FROM pg_catalog.pg_stats AS s " + "WHERE s.schemaname = $1 AND s.tablename = $2 " + "ORDER BY s.attname, s.inherited DESC"; I think we should retrieve the attribute statistics for only the referenced columns of the remote table, not all the columns of it, to reduce the data transfer and the cost of matching local/remote attributes in import_fetched_statistics(). In fetch_remote_statistics() + if (server_version_num >= 180000) + relation_sql = relstats_query_18; + else if (server_version_num >= 140000) + relation_sql = relstats_query_14; + else + relation_sql = relstats_query_default; I think that having the definition for each of relstats_query_18, relstats_query_14 and relstats_query_default makes the code redundant and the maintenance hard. To avoid that, how about building the relation_sql query dynamically as done for the query to fetch all table data from the remote server in postgresImportForeignSchema(). Same for the attribute_sql query. That's all I have for now. I will continue to review the changes. Best regards, Etsuro Fujita
This version of the patch wouldn't fall back to the normal ANALYZE
processing anymore, so this documentation should be updated as such.
Also, as I think it's the user's responsibility to ensure the existing
statistics are up-to-date, as I said before, I think we should add a
note about that here. Also, as some users wouldn't be able to ensure
it, I'm wondering if the default should be false.
I agree, if there is no fallback, then the default should be false. When I was initially brainstorming this patch, Nathan Bossart had suggested making it the default because 1) that would be an automatic benefit to users and 2) the cost for attempting to import stats was small in comparison to a table stample, so it was worth the attempt. I still want users to get that automatic benefit, but if there is no fallback to sampling then the default only makes sense as false.
If the user wasn't able to ensure the statistics are up-to-date, I
think he/she might want to do remote ANALYZE *before* fetching the
statistics, not after trying to do so. So I think we could instead
provide this option as such. What do you think about that? Anyway,
I'd vote for leaving this for another patch, as I think it's a
nice-to-have option rather than a must-have one, as I said before.
I hadn't thought of that one. I think it has some merit, but I think we'd want the try-after case as well. So the settings would be: "off" (never remote analyze, default), "on" (always analyze before fetching), and "retry" (analyze if no stats were found). This feature feeds into the same thinking that the default setting did, which was to make this feature just do what is usually the smart thing, and do it automatically. Building it in pieces might be easier to get committed, but it takes away the dream of seamless automatic improvement, and establishes defaults that can't be changed in future versions. That dream was probably unrealistic, but I wanted to try.
ISTM that the code is well organized overall. Here are a few comments:
Thanks!
We don't use relallvisible and relallfrozen for foreign tables (note
that do_analyze_rel() calls vac_update_relstats() with relallvisible=0
and relallfrozen=0 for them). Do we really need to retrieve (and
restore) them?
No, and as you stated, we wouldn't want to. The query was lifted verbatim from pg_dump, with a vague hope of moving the queries to a common library that both pg_dump and postgres_fdw could draw upon. But that no longer makes sense, so I'll fix.
+static const char *attstats_query_17 =
+ "SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, "
+ "s.n_distinct, s.most_common_vals, s.most_common_freqs, "
+ "s.histogram_bounds, s.correlation, s.most_common_elems, "
+ "s.most_common_elem_freqs, s.elem_count_histogram, "
+ "s.range_length_histogram, s.range_empty_frac, s.range_bounds_histogram "
+ "FROM pg_catalog.pg_stats AS s "
+ "WHERE s.schemaname = $1 AND s.tablename = $2 "
+ "ORDER BY s.attname, s.inherited DESC";
I think we should retrieve the attribute statistics for only the
referenced columns of the remote table, not all the columns of it, to
reduce the data transfer and the cost of matching local/remote
attributes in import_fetched_statistics().
I thought about this, and decided that we wanted to 1) avoid per-column round trips and 2) keep the remote queries simple. It's not such a big deal to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}' string, as we already do in pg_dump in a few places.
In fetch_remote_statistics()
+ if (server_version_num >= 180000)
+ relation_sql = relstats_query_18;
+ else if (server_version_num >= 140000)
+ relation_sql = relstats_query_14;
+ else
+ relation_sql = relstats_query_default;
I think that having the definition for each of relstats_query_18,
relstats_query_14 and relstats_query_default makes the code redundant
and the maintenance hard. To avoid that, how about building the
relation_sql query dynamically as done for the query to fetch all
table data from the remote server in postgresImportForeignSchema().
Same for the attribute_sql query.
It may be a moot point. If we're not fetching relallfrozen, then the 14 & 18 cases are now the same, and since the pre-14 case concerns differentiating between analyzed and unanalyzed tables, we would just map that to 0 IF we kept those stats, but we almost never would because an unanalyzed remote table would not have the attribute stats necessary to qualify as a good remote fetch. So we're down to just one static query.
That's all I have for now. I will continue to review the changes.
Much appreciated.
On Sun, Jan 4, 2026 at 11:56 AM Corey Huinker <corey.huinker@gmail.com> wrote:
>> This version of the patch wouldn't fall back to the normal ANALYZE
>> processing anymore, so this documentation should be updated as such.
>> Also, as I think it's the user's responsibility to ensure the existing
>> statistics are up-to-date, as I said before, I think we should add a
>> note about that here. Also, as some users wouldn't be able to ensure
>> it, I'm wondering if the default should be false.
> I agree, if there is no fallback, then the default should be false. When I was initially brainstorming this patch,
NathanBossart had suggested making it the default because 1) that would be an automatic benefit to users and 2) the
costfor attempting to import stats was small in comparison to a table stample, so it was worth the attempt. I still
wantusers to get that automatic benefit, but if there is no fallback to sampling then the default only makes sense as
false.
I think that the FDW API that I proposed could actually allow us to
fall back to sampling, by modifying StatisticsAreImportable so that it
also checks if 1) there are statistics on the remote server and 2) the
data is fresh enough, and if so, returns true; otherwise, returns
false; in the latter case we could fall back to sampling. And if we
modified it as such, I think we could change the default to true.
(Checking #2 is necessary to avoid importing stale data, which would
degrade plan quality.)
>> If the user wasn't able to ensure the statistics are up-to-date, I
>> think he/she might want to do remote ANALYZE *before* fetching the
>> statistics, not after trying to do so. So I think we could instead
>> provide this option as such. What do you think about that? Anyway,
>> I'd vote for leaving this for another patch, as I think it's a
>> nice-to-have option rather than a must-have one, as I said before.
> I hadn't thought of that one. I think it has some merit, but I think we'd want the try-after case as well. So the
settingswould be: "off" (never remote analyze, default), "on" (always analyze before fetching), and "retry" (analyze if
nostats were found). This feature feeds into the same thinking that the default setting did, which was to make this
featurejust do what is usually the smart thing, and do it automatically. Building it in pieces might be easier to get
committed,but it takes away the dream of seamless automatic improvement, and establishes defaults that can't be changed
infuture versions. That dream was probably unrealistic, but I wanted to try.
Remote ANALYZE would be an interesting idea, but to get the automatic
improvement, I think we should first work on the issue I mentioned
above. So I still think we should leave this for future work.
From a different perspective, even without the automatic improvement
including remote ANALYZE, I think this feature is useful for many
users.
>> +static const char *attstats_query_17 =
>> + "SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, "
>> + "s.n_distinct, s.most_common_vals, s.most_common_freqs, "
>> + "s.histogram_bounds, s.correlation, s.most_common_elems, "
>> + "s.most_common_elem_freqs, s.elem_count_histogram, "
>> + "s.range_length_histogram, s.range_empty_frac, s.range_bounds_histogram "
>> + "FROM pg_catalog.pg_stats AS s "
>> + "WHERE s.schemaname = $1 AND s.tablename = $2 "
>> + "ORDER BY s.attname, s.inherited DESC";
>>
>> I think we should retrieve the attribute statistics for only the
>> referenced columns of the remote table, not all the columns of it, to
>> reduce the data transfer and the cost of matching local/remote
>> attributes in import_fetched_statistics().
> I thought about this, and decided that we wanted to 1) avoid per-column round trips and 2) keep the remote queries
simple.It's not such a big deal to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}' string, as we
alreadydo in pg_dump in a few places.
Yeah, I was also thinking of modifying the query as you proposed.
>> In fetch_remote_statistics()
>>
>> + if (server_version_num >= 180000)
>> + relation_sql = relstats_query_18;
>> + else if (server_version_num >= 140000)
>> + relation_sql = relstats_query_14;
>> + else
>> + relation_sql = relstats_query_default;
>>
>> I think that having the definition for each of relstats_query_18,
>> relstats_query_14 and relstats_query_default makes the code redundant
>> and the maintenance hard. To avoid that, how about building the
>> relation_sql query dynamically as done for the query to fetch all
>> table data from the remote server in postgresImportForeignSchema().
>> Same for the attribute_sql query.
> It may be a moot point. If we're not fetching relallfrozen, then the 14 & 18 cases are now the same, and since the
pre-14case concerns differentiating between analyzed and unanalyzed tables, we would just map that to 0 IF we kept
thosestats, but we almost never would because an unanalyzed remote table would not have the attribute stats necessary
toqualify as a good remote fetch. So we're down to just one static query.
You are right. As the relation_sql query is only used in
fetch_remote_statistics(), shouldn't the query be defined within that
function?
Best regards,
Etsuro Fujita
I think that the FDW API that I proposed could actually allow us to
fall back to sampling, by modifying StatisticsAreImportable so that it
also checks if 1) there are statistics on the remote server and 2) the
data is fresh enough, and if so, returns true; otherwise, returns
false; in the latter case we could fall back to sampling. And if we
modified it as such, I think we could change the default to true.
(Checking #2 is necessary to avoid importing stale data, which would
degrade plan quality.)
So while I haven't checked for "freshness" of the statistics, I have added checks that ensure that every asked-for column in the local table with attstattarget != 0 will be send in the column filter, and we:
1. Find remote stats for all the columns that made our list
2. Do not get any stats over the wire with no matching target column.
3. Sort the list of expected remote column names, which means the list matching is effectively a merge, so O(N) vs O(N^2). This is done with a name+attnum structure, but it could just as easily have been done with a local_name+remote_name, as pg_restore_attribute_stats() will take either attnum or attname as a parameter.
Aside from a pre-emptive ANALYZE, how would you propose we check for and/or measure "freshness" of the remote statistics?
Remote ANALYZE would be an interesting idea, but to get the automatic
improvement, I think we should first work on the issue I mentioned
above. So I still think we should leave this for future work.
From a different perspective, even without the automatic improvement
including remote ANALYZE, I think this feature is useful for many
users.
I'm still hoping to hear from Nathan on this subject.
>> I think we should retrieve the attribute statistics for only the
>> referenced columns of the remote table, not all the columns of it, to
>> reduce the data transfer and the cost of matching local/remote
>> attributes in import_fetched_statistics().
> I thought about this, and decided that we wanted to 1) avoid per-column round trips and 2) keep the remote queries simple. It's not such a big deal to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}' string, as we already do in pg_dump in a few places.
Yeah, I was also thinking of modifying the query as you proposed.
That is done in the patch attached.
> It may be a moot point. If we're not fetching relallfrozen, then the 14 & 18 cases are now the same, and since the pre-14 case concerns differentiating between analyzed and unanalyzed tables, we would just map that to 0 IF we kept those stats, but we almost never would because an unanalyzed remote table would not have the attribute stats necessary to qualify as a good remote fetch. So we're down to just one static query.
You are right. As the relation_sql query is only used in
fetch_remote_statistics(), shouldn't the query be defined within that
function?
Oddly enough, I already moved it inside of fetch_remote_statistics() in the newest patch.
I wasn't planning on posting this patch until we had heard back from Nathan, but since I'd already been working on a few of the items you mentioned in your last email, I thought I'd show you that work in progress. Some issues like the documentation haven't been updated, so it's more of a work in progress, but it does pass the tests.
Summary of key points of this v7 WIP:
* Reduced columns on relstats query, query moved inside calling function.
* Per-column filters on attstats queries, filtering on destination attstattarget != 0.
* Verification that all filtered-for columns have stats, and that all stats in the result set have a matching target column.
* Expected column list is now sorted by remote-side attname, allowing a merge join of the two lists.
* No changes to the documentation, but I know they are needed.
Вложения
Hi, thanks for working on this. Generally I think that this is a good
idea to avoid fetching rows from a foreign table to compute statistics
that may already be available on the foreign server.
I started reviewing the v7 patch and here are my initial comments. I
still want to do another round of review and run some more tests.
+ if (fdwroutine->ImportStatistics != NULL &&
+ fdwroutine->StatisticsAreImportable != NULL &&
+ fdwroutine->StatisticsAreImportable(onerel))
+ import_stats = true;
+ else
+ {
+ if (fdwroutine->AnalyzeForeignTable != NULL)
+ ok = fdwroutine->AnalyzeForeignTable(onerel,
+ &acquirefunc,
+ &relpages);
+
+ if (!ok)
+ {
+ ereport(WARNING,
+ errmsg("skipping \"%s\" -- cannot analyze this foreign table.",
+ RelationGetRelationName(onerel)));
+ relation_close(onerel, ShareUpdateExclusiveLock);
+ return;
+ }
+ }
+
if (fdwroutine->AnalyzeForeignTable != NULL)
ok = fdwroutine->AnalyzeForeignTable(onerel,
&acquirefunc,
&relpages);
if (!ok)
{
ereport(WARNING,
(errmsg("skipping \"%s\" --- cannot analyze this foreign table",
RelationGetRelationName(onerel))));
relation_close(onerel, ShareUpdateExclusiveLock);
return;
}
It seems that we have the same code within the else branch after the if/else
check, is this correct?
---
+ * Every row of the result should be an attribute that we specificially
s/specificially/specifically
---
+ if (TupleDescAttr(tupdesc, i)->attisdropped)
+ continue;
+
+ /* Ignore generated columns. */
+ if (TupleDescAttr(tupdesc, i)->attgenerated)
+ continue;
+
+ attname = NameStr(TupleDescAttr(tupdesc, i)->attname);
Wouldn't be better to call TupleDescAttr a single time and save the value?
Form_pg_attribute attr = TupleDescAttr(tupdesc, i - 1);
/* Ignore dropped columns. */
if (attr->attisdropped)
continue;
...
--
Matheus Alcantara
EDB: https://www.enterprisedb.com
Hi Matheus,
On Wed, Jan 7, 2026 at 6:38 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> + if (fdwroutine->ImportStatistics != NULL &&
> + fdwroutine->StatisticsAreImportable != NULL &&
> + fdwroutine->StatisticsAreImportable(onerel))
> + import_stats = true;
> + else
> + {
> + if (fdwroutine->AnalyzeForeignTable != NULL)
> + ok = fdwroutine->AnalyzeForeignTable(onerel,
> +
&acquirefunc,
> + &relpages);
> +
> + if (!ok)
> + {
> + ereport(WARNING,
> + errmsg("skipping \"%s\" -- cannot analyze this foreign table.",
> + RelationGetRelationName(onerel)));
> + relation_close(onerel, ShareUpdateExclusiveLock);
> + return;
> + }
> + }
> +
> if (fdwroutine->AnalyzeForeignTable != NULL)
> ok = fdwroutine->AnalyzeForeignTable(onerel,
> &acquirefunc,
> &relpages);
>
> if (!ok)
> {
> ereport(WARNING,
> (errmsg("skipping \"%s\" --- cannot analyze this foreign table",
> RelationGetRelationName(onerel))));
> relation_close(onerel, ShareUpdateExclusiveLock);
> return;
> }
>
> It seems that we have the same code within the else branch after the if/else
> check, is this correct?
No. This should be something like the attached in [1]. (I didn't
look at the core changes in v6...)
Thanks!
Best regards,
Etsuro Fujita
[1] https://www.postgresql.org/message-id/CAPmGK17Dfjy_zLH1yjPqybpSueHWP7Gy_xBZXA2NpRso1qya7A%40mail.gmail.com
> I agree, if there is no fallback, then the default should be false. When I was initially brainstorming this patch, Nathan Bossart had suggested making it the default because 1) that would be an automatic benefit to users and 2) the cost for attempting to import stats was small in comparison to a table stample, so it was worth the attempt. I still want users to get that automatic benefit, but if there is no fallback to sampling then the default only makes sense as false.
I think that the FDW API that I proposed could actually allow us to
fall back to sampling, by modifying StatisticsAreImportable so that it
also checks if 1) there are statistics on the remote server and 2) the
data is fresh enough, and if so, returns true; otherwise, returns
false; in the latter case we could fall back to sampling. And if we
modified it as such, I think we could change the default to true.
(Checking #2 is necessary to avoid importing stale data, which would
degrade plan quality.)
I've given this some more thought.
First, we'd have to add the va_cols param to StatisticsAreImportable, which isn't itself terrible.
Then, we'd have to determine that there are stats available for every mapped column (filtered by va_cols, if any). But the only way to do that is to query the pg_stats view on the remote end, and if we have done that, then we've already fetched the stats. Yes, we could avoid selecting the actual statistical values, and that would save some network bandwidth at the cost of having to do the query again with stats. So I don't really see the savings.
Also, the pg_stats view is our security-barrier black box into statistics, and it gives no insight into how recently those stats were acquired. We could poke pg_stat_all_tables, assuming we can even query it, and then make a value judgement on the value of (CURRENT_TIMESTAMP - GREATEST(last_analyze, last_autoanalyze), but that value is highly subjective.
I suppose we could move all of the statistics fetching into StatisticsAreImportable, And carry those values forward if they are satisfactory. That would leave ImportStatistics() with little to do other than form up the calls to pg_restore_*_stats(), but those could still fail, and at that point we'd have no way to fall back to sampling and analysis.
I really want to make sampling fallback possible.
Anyway, here's v8, incorporating the documentation feedback and Matheus's notes.
Вложения
On Wed Jan 7, 2026 at 3:04 AM -03, Corey Huinker wrote:
>>
>> > I agree, if there is no fallback, then the default should be false. When
>> I was initially brainstorming this patch, Nathan Bossart had suggested
>> making it the default because 1) that would be an automatic benefit to
>> users and 2) the cost for attempting to import stats was small in
>> comparison to a table stample, so it was worth the attempt. I still want
>> users to get that automatic benefit, but if there is no fallback to
>> sampling then the default only makes sense as false.
>>
>> I think that the FDW API that I proposed could actually allow us to
>> fall back to sampling, by modifying StatisticsAreImportable so that it
>> also checks if 1) there are statistics on the remote server and 2) the
>> data is fresh enough, and if so, returns true; otherwise, returns
>> false; in the latter case we could fall back to sampling. And if we
>> modified it as such, I think we could change the default to true.
>> (Checking #2 is necessary to avoid importing stale data, which would
>> degrade plan quality.)
>
>
> I've given this some more thought.
>
> First, we'd have to add the va_cols param to StatisticsAreImportable, which
> isn't itself terrible.
>
> Then, we'd have to determine that there are stats available for every
> mapped column (filtered by va_cols, if any). But the only way to do that is
> to query the pg_stats view on the remote end, and if we have done that,
> then we've already fetched the stats. Yes, we could avoid selecting the
> actual statistical values, and that would save some network bandwidth at
> the cost of having to do the query again with stats. So I don't really see
> the savings.
>
> Also, the pg_stats view is our security-barrier black box into statistics,
> and it gives no insight into how recently those stats were acquired. We
> could poke pg_stat_all_tables, assuming we can even query it, and then make
> a value judgement on the value of (CURRENT_TIMESTAMP -
> GREATEST(last_analyze, last_autoanalyze), but that value is highly
> subjective.
>
> I suppose we could move all of the statistics fetching
> into StatisticsAreImportable, And carry those values forward if they are
> satisfactory. That would leave ImportStatistics() with little to do other
> than form up the calls to pg_restore_*_stats(), but those could still fail,
> and at that point we'd have no way to fall back to sampling and analysis.
>
> I really want to make sampling fallback possible.
>
> Anyway, here's v8, incorporating the documentation feedback and Matheus's
> notes.
Thanks for the new version.
+static bool
+postgresStatisticsAreImportable(Relation relation)
+...
+
+ /*
+ * Server-level options can be overridden by table-level options, so check
+ * server-level first.
+ */
+ foreach(lc, server->options)
+ {
+ DefElem *def = (DefElem *) lfirst(lc);
+
+ if (strcmp(def->defname, "fetch_stats") == 0)
+ {
+ fetch_stats = defGetBoolean(def);
+ break;
+ }
+ }
+
+ foreach(lc, table->options)
+ {
+ DefElem *def = (DefElem *) lfirst(lc);
+
+ if (strcmp(def->defname, "fetch_stats") == 0)
+ {
+ fetch_stats = defGetBoolean(def);
+ break;
+ }
+ }
+
I don't think that it's good to make StatisticsAreImportable() routine
check if fetch_stats is enabled on foreign server/table options because
if so, every fdw implementation would need this same block of code and
also fdw implementations may forget or bypass these options which I
don't think that it would be a desired behavior. What about move this
check to analyze_rel()? Perhaps create a function that just check if the
fetch_stats is enabled.
If the above statement make sense, it seems to me that
StatisticsAreImportable() may not be needed at all.
I think that we could make analyze_rel() check if fetch_stats is enable
on the foreign server/table and then call ImportStatistics() which could
return true or false. If it returns true it means that the statistics
was imported successfully, otherwise if it returns false we could
fallback to table sampling as we already do today. ImportStatistics
could return false if the foreign server don't have statistics for the
requested table, even after running ANALYZE if remote_analyze is true.
Is that make sense? Any thoughts?
--
Matheus Alcantara
EDB: https://www.enterprisedb.com
+
I don't think that it's good to make StatisticsAreImportable() routine
check if fetch_stats is enabled on foreign server/table options because
if so, every fdw implementation would need this same block of code and
also fdw implementations may forget or bypass these options which I
don't think that it would be a desired behavior. What about move this
check to analyze_rel()? Perhaps create a function that just check if the
fetch_stats is enabled.
StatisticsAreImportable() is a virtual function whose goal is to determine if this specific table supports stats exporting.
postgresStatisticsAreImportable() is the postgres_fdw implementation of that virtual function.
Any other FDWs that want to implement stats import will need to invent their own tests and configurations to determine if that is possible.
If the above statement make sense, it seems to me that
StatisticsAreImportable() may not be needed at all.
It wasn't there, initially.
I think that we could make analyze_rel() check if fetch_stats is enable
on the foreign server/table and then call ImportStatistics() which could
return true or false.
That we can't do, because there's a chance that those FDWs already have a setting named fetch_stats.
If it returns true it means that the statistics
was imported successfully, otherwise if it returns false we could
fallback to table sampling as we already do today. ImportStatistics
could return false if the foreign server don't have statistics for the
requested table, even after running ANALYZE if remote_analyze is true.
Is that make sense? Any thoughts?
That sounds very similar to the design that was presented in v1.
On Wed Jan 7, 2026 at 4:17 PM -03, Corey Huinker wrote: >> >> + >> I don't think that it's good to make StatisticsAreImportable() routine >> check if fetch_stats is enabled on foreign server/table options because >> if so, every fdw implementation would need this same block of code and >> also fdw implementations may forget or bypass these options which I >> don't think that it would be a desired behavior. What about move this >> check to analyze_rel()? Perhaps create a function that just check if the >> fetch_stats is enabled. >> > > StatisticsAreImportable() is a virtual function whose goal is to determine > if this specific table supports stats exporting. > > postgresStatisticsAreImportable() is the postgres_fdw implementation of > that virtual function. > > Any other FDWs that want to implement stats import will need to invent > their own tests and configurations to determine if that is possible. > Ok, now I understand. I thought that fetch_stats and remote_analyze was a generally fdw option and not only specific to postgres_fdw. Now I understand that is up to the fdw implementation decide how this should be enabled or disabled. Thanks for making it clear now. >> If it returns true it means that the statistics >> was imported successfully, otherwise if it returns false we could >> fallback to table sampling as we already do today. ImportStatistics >> could return false if the foreign server don't have statistics for the >> requested table, even after running ANALYZE if remote_analyze is true. >> >> Is that make sense? Any thoughts? >> > > That sounds very similar to the design that was presented in v1. > Yeah, I think that my suggestion don't make sense, I miss understood the feature. Sorry about the noise, I'll continue reviewing the v8 patch. -- Matheus Alcantara EDB: https://www.enterprisedb.com
On Wed Jan 7, 2026 at 3:04 AM -03, Corey Huinker wrote: > Anyway, here's v8, incorporating the documentation feedback and Matheus's > notes. > +CREATE FOREIGN TABLE remote_analyze_ftable (id int, a text, b bigint) + SERVER loopback + OPTIONS (table_name 'remote_analyze_table', + fetch_stats 'true', + remote_analyze 'true'); I think that it would be good also to have a test case where remote_analyze is false. The test could manually execute an ANALYZE on the target table and ensure that an ANALYZE on the foreign table fetch the statistics correctly. --- If the table don't have columns it fails to fetch the statistics with remote_analyze=false even if the target table has statistics: ERROR: P0002: Failed to import statistics from remote table public.t2, no statistics found. And if I set remote_analyze=true it fails with the following error: postgres=# analyze t2_fdw; ERROR: 08006: could not obtain message string for remote error CONTEXT: remote SQL command: SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, s.n_distinct, s.most_common_vals, s.most_common_freqs, s.histogram_bounds, s.correlation, s.most_common_elems, s.most_common_elem_freqs, s.elem_count_histogram, s.range_length_histogram, s.range_empty_frac, s.range_bounds_histogram FROM pg_catalog.pg_stats AS s WHERE s.schemaname = $1 AND s.tablename = $2 AND s.attname = ANY($3::text[]) ORDER BY s.attname, s.inherited DESC LOCATION: pgfdw_report_internal, connection.c:1037 --- If we try to run ANALYZE on a specific table column that don't exists we get: postgres=# analyze t(c); ERROR: 42703: column "c" of relation "t" does not exist LOCATION: do_analyze_rel, analyze.c:412 With fetch_stats=false we get the same error: postgres=# ALTER FOREIGN TABLE t_fdw OPTIONS (add fetch_stats 'false'); ALTER FOREIGN TABLE postgres=# ANALYZE t_fdw(c); ERROR: 42703: column "c" of relation "t_fdw" does not exist But with fetch_stats=true we get a different error: postgres=# ALTER FOREIGN TABLE t_fdw OPTIONS (drop fetch_stats); ALTER FOREIGN TABLE postgres=# ANALYZE t_fdw(c); ERROR: P0002: Failed to import statistics from remote table public.t, no statistics found. Should all these errors be consistency? --- I hope that these comments are more useful now. Thanks. -- Matheus Alcantara EDB: https://www.enterprisedb.com
On Wed, Jan 7, 2026 at 11:34 AM Corey Huinker <corey.huinker@gmail.com> wrote: > > Anyway, here's v8, incorporating the documentation feedback and Matheus's notes. I went through the patches. I have one question: The column names of the foreign table on the local server are sorted using qsort, which uses C sorting. The column names the result obtained from the foreign server are sorted using ORDER BY clause. If the default collation on the foreign server is different from the one used by qsort(), the merge sort in import_fetched_statistics() may fail. Shouldn't these two sorts use the same collation? Some minor comments /* avoid including explain_state.h here */ typedef struct ExplainState ExplainState; - unintended line deletion? fetch_remote_statistics() fetches the statistics twice if the first attempt fails. I think we need same check after the second attempt as well. The second attempt should not fail, but I think we need some safety checks and Assert at least, in case the foreign server misbehaves. -- Best Wishes, Ashutosh Bapat
On Fri, Jan 9, 2026 at 4:35 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Wed, Jan 7, 2026 at 11:34 AM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v8, incorporating the documentation feedback and Matheus's notes.
I went through the patches. I have one question: The column names of
the foreign table on the local server are sorted using qsort, which
uses C sorting. The column names the result obtained from the foreign
server are sorted using ORDER BY clause. If the default collation on
the foreign server is different from the one used by qsort(), the
merge sort in import_fetched_statistics() may fail. Shouldn't these
two sorts use the same collation?
I wondered about that, but somehow thought that because they're of type pg_catalog.name rather than text/varchar that they weren't subject to collation settings. We definitely should explicitly sort by C collation regardless.
Some minor comments
/* avoid including explain_state.h here */
typedef struct ExplainState ExplainState;
-
unintended line deletion?
Yeah, must have been.
fetch_remote_statistics() fetches the statistics twice if the first
attempt fails. I think we need same check after the second attempt as
well. The second attempt should not fail, but I think we need some
safety checks and Assert at least, in case the foreign server
misbehaves.
I think the only test *not* done on re-fetch is checking the relkind of the relation, are you speaking of another check? It's really no big deal to do that one again, but I made the distinction early on in the coding, and in retrospect there are relatively few checks we'd consider skipping on the second pass, so maybe it's not worth the distinction.
Since you're joining the thread, we have an outstanding debate about what the desired basic workflow should be, and I think we should get some consensus before we paint ourselves into a corner.
1. The Simplest Possible Model
* There is no remote_analyze functionality
* fetch_stats defaults to false
* Failure to fetch stats results in a failure, no failover to sampling.
2. Simplest Model, but with Failover
* Same as #1, but if we aren't satisfied with the stats we get from the remote, we issue a WARNING, then fall back to sampling, trusting that the user will eventually turn off fetch_stats on tables where it isn't working.
3. Analyze and Retry
* Same as #2, but we add remote_analyze option (default false).
* If the first attempt fails AND remote_analyze is set on, then we send the remote analyze, then retry. Only if that fails do we fall back to sampling.
* If the first attempt fails AND remote_analyze is set on, then we send the remote analyze, then retry. Only if that fails do we fall back to sampling.
4. Analyze and Retry, Optimistic
* Same as #3, but fetch_stats defaults to ON, because the worst case scenario is that we issue a few queries that return 0-1 rows before giving up and just sampling.
* This is the option that Nathan advocated for in our initial conversation about the topic, and I found it quite persuasive at the time, but he's been slammed with other stuff and hasn't been able to add to this thread.
5. Fetch With Retry Or Sample, Optimisitc
* If fetch_stats is on, AND the remote table is seemingly capable of holding stats, attempt to fetch them, possibly retrying after ANALYZE depending on remote_analyze.
* If fetching stats failed, just error, as a way to prime the user into changing the table's setting.
* This is what's currently implemented, and it's not quite what anyone wants. Defaulting fetch_stats to true doesn't seem great, but not defaulting it to true will reduce adoption of this feature.
6. Fetch With Retry Or Sample, Pessimistic
* Same as #5, but with fetch_stats = false.