Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)

Поиск
Список
Период
Сортировка
От Nino Arsov
Тема Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)
Дата
Msg-id CAMGNKZ+5ipK+o5==gbpLrdgcTEtGVx68H1NFzJzQNVwer3XG+w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-admin
Hello,

I've taken part in some interesting research involving an estimation of the execution cost of a query by properly updating the statistics stored in pg_class and pg_statistic. The idea is to perform range partitioning of tables in a schema and then update the statistics in the two system catalogs mentioned above without transferring any data into the new schema. This provides fast algorithm execution times since we use some global optimization methods.

Using EXPLAIN we have successfully managed to get very similar execution plans and costs by the planner as if there were data in the newly created partitions. This method works just fine as long as the queries to be estimated consist of a join of no more than 2 tables.
When we try to estimate a query's cost that contains a join between 3 or more tables we get huge costs and wrong plans (in the rank of millions of cost units). When data is actually loaded into all of the partitions, the cost does not exceed a few thousand cost units.

The strategy to determine the partitions that the query should be executed against is using a simple interval tree to find overlapping intervals of the range values. For each partitioned table, a new master table is created and statistics are set to a "zero" value, meaning that the master (parent) table contains no data, while the statistics of the child tables that inherit the masters are updated properly.

We use the Star Schema Benchmark (a modification of TPC-H) and assume uniformity of data.

The question: Is the following list of updated statistics enough to fool the planner into generating accurate execution plans as if there were data in those partitions?

I've been searching around for a few weeks but I'm getting nowhere. Here's what we update:
  • pg_statistic
    • stawidth
    • staop
    • stanumbersN
    • stakindN
    • stavaluesN
  • pg_class
    • relfilenode (we create an empty file and fool the file system about its size, so the planner actually sees that there is data physically stored on disk, although the relation file is empty)
    • reltuples
    • relpages
This probably is a both tough and demanding question, but I guess this is the right place to ask.

Best regards,
Nino Arsov

В списке pgsql-admin по дате отправления:

Предыдущее
От: dezso
Дата:
Сообщение: Re: Missing timeline history file after execution of pg_upgrade
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Estimating the execution cost of a query in a partitioned schema: Weird execution plans and totally wrong execution costs (pg_class and pg_statistic)