Query runtime strongly dependent on generated statistics (and fewer statistics are better?)

Поиск
Список
Период
Сортировка
От Viktor Rosenfeld
Тема Query runtime strongly dependent on generated statistics (and fewer statistics are better?)
Дата
Msg-id 20120706185353.GA91028@client195-161.wlan.hu-berlin.de
обсуждение исходный текст
Список pgsql-general
Hi,

(I've sent this mail about month ago, but received no reply. So I'm
trying again.)

I've noticed that the selection of the executed query plan (and
therefore query runtime) is dependent on the statistics generated by
an ANALYZE run. As an demonstration, I chose the best runtime of 5
consecutive runs of the query linked below, regenerated the statistics
for the column node_annotation.value and re-ran the query. This
experiment was repeated a hundred times each for the statistics
targets 10, 100 (default), 1000, 10000. I've used PostgreSQL 9.1.3.

Query: http://www.informatik.hu-berlin.de/~rosenfel/analyze/query.sql
Schema: http://www.informatik.hu-berlin.de/~rosenfel/analyze/schema.pdf

The next table shows for each statistics target the number of distinct
plans generated in the experiment, how often the most common plan was
generated, the runtime (ms) of the best and worst non-unique plan, and
the number of timeouts where the query did not finish within 60 seconds.

| Statistics | # plans | most common plan | Best | Worst | Timeouts |
|------------+---------+------------------+------+-------+----------|
|         10 |      37 |               60 | 1876 |  2180 |        0 |
|        100 |      90 |                4 | 2225 |  7927 |       14 |
|       1000 |      75 |                6 | 2214 |  6329 |       22 |
|      10000 |       6 |               85 | 2195 |  2900 |        3 |

The distribution for each statistics target is linked below:

Distribution: http://www.informatik.hu-berlin.de/~rosenfel/analyze/histogram.pdf

As one can see, using the default value of 100 (and also 1000) there is
a considerable spread in the runtime of the query. The best plan is
chosen most often, but only about a quarter of the time and there are
also many timeouts. The best results can be achieved with a statistics
target of 10: The most stable query plan is the second best and there
are no timeouts. Using a statistics target of 10000 generates the most
stable plan selection, i.e. the same plan is chosen most often, but it
is almost a second slower than the best plan.

I would like to know how I can mitigate against these random results.

Cheers,
Viktor

В списке pgsql-general по дате отправления:

Предыдущее
От: Viktor Rosenfeld
Дата:
Сообщение: BUG? Regular expression matching of optional character group at beginning of RE
Следующее
От: Don Parris
Дата:
Сообщение: Tutorial On Connecting LibreOffice to PostgreSQL Available