Re: Ideas about a better API for postgres_fdw remote estimates

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: Ideas about a better API for postgres_fdw remote estimates
Дата
Msg-id CAG-ACPVQCoLaUEGrkKUV6Jd=AGVScab1MXMCwxGzOYQAGE22zQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Ideas about a better API for postgres_fdw remote estimates  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Ideas about a better API for postgres_fdw remote estimates
Список pgsql-hackers


On Fri, 4 Sep 2020 at 20:27, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote


4) I wonder if we actually want/need to simply output pg_statistic data
verbatim like this. Is postgres_fdw actually going to benefit from it? I
kinda doubt that, and my assumption was that we'd return only a small
subset of the data, needed by get_remote_estimate.

This has a couple of issues. Firstly, it requires the knowledge of what
the stakind constants in pg_statistic mean and how to interpret it - but
OK, it's true that does not change very often (or at all). Secondly, it
entirely ignores extended statistics - OK, we might extract those too,
but it's going to be much more complex. And finally it entirely ignores
costing on the remote node. Surely we can't just apply local
random_page_cost or whatever, because those may be entirely different.
And we don't know if the remote is going to use index etc.

So is extracting data from pg_statistic the right approach?


There are two different problems, which ultimately might converge.
1. If use_remote_estimates = false, more generally if querying costs from foreign server for costing paths is impractical, we want to use local estimates and try to come up with costs. For that purpose we keep some statistics locally and user is expected to refresh it periodically by running ANALYZE on the foreign table. This patch is about a. doing this efficiently without requiring to fetch every row from the foreign server b. through autovacuum automatically without user firing ANALYZE. I think this also answers your question about vacuum_rel() above.

2. How to efficiently extract costs from an EXPLAIN plan when use_remote_eestimates is true. That's the subject of some nearby thread. I think you are referring to that problem here. Hence your next point.
 
Using EXPLAIN to get costs from the foreign server isn't efficient. It increases planning time a lot; sometimes planning time exceeds execution time. If usage of foreign tables becomes more and more common, this isn't ideal. I think we should move towards a model in which the optimizer can decide whether a subtree involving a foreign server should be evaluated locally or on the foreign server without the help of foreign server. One way to do it (I am not saying that this is the only or the best way) is to estimate the cost of foreign query locally based on the information available locally about the foreign server and foreign table. This might mean that we have to get that information from the foreign server and cache it locally and use it several times, including the indexes on foreign table, values of various costs etc. Though this approach doesn't solve all of those problems it's one step forward + it makes the current scenario also efficient.

I agree that the patch needs some work though, esp the code dealing with serialization and deserialization of statistics.
--
Best Wishes,
Ashutosh

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: Evaluate expression at planning time for two more cases
Следующее
От: Hamid Akhtar
Дата:
Сообщение: Case for Improved Costing for Index Only Scans?