Re: Predicting query runtime

Поиск
Список
Период
Сортировка
От Oleg Ivanov
Тема Re: Predicting query runtime
Дата
Msg-id 57D92512.1080202@postgrespro.ru
обсуждение исходный текст
Ответ на Predicting query runtime  (Vinicius Segalin <vinisegalin@gmail.com>)
Ответы Re: Predicting query runtime  (Vinicius Segalin <vinisegalin@gmail.com>)
Список pgsql-general
Hi Vinicius,

I recommend you to read this (http://www.doc.ic.ac.uk/~nb605/IO%20performance%20modeling%20research/Learning-based%20Query%20Performance%20-%202011.pdf) paper. Authors make a nice classification of different query performance prediction methods and propose their own solution for this problem.

You can also read (http://www.vldb.org/pvldb/vol9/p204-leis.pdf) to be warned about possible pitfalls in PostgreSQL query optimizer. In my opinion, the most unpleasant one is that you often cannot rely on cardinality estimations made by PostgreSQL for path nodes. Typically, the more complicated query is, the less reliable cardinality estimations become. The good news is that cost model allows to predict query execution time precisely enough with good cardinality estimations.

In paper (http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf) there is no machine learning. Nevertheless, you may find it interesting. It contains good description of PostgreSQL cost model and a method for automatic costs calibration (similar to proposed by Jeff in this thread).
The issue with the calibrating is follows: the multipliers for each *_cost factor are not provided or even directly computed in PostgreSQL for the majority of path nodes. The typical way of computations is not, for example, total_cost = 10 * seq_page_cost + 25 * random_page_cost, but total_cost = 10 * (seq_page_cost + 2 * random_page_cost) + 10 * (random_page_cost / 2). Mathematically these formulas are equivalent, but practically you will spend more time and write more code to extract the multipliers in the second case.
In the above paper authors decided to calibrate costs using only those nodes, for which the computations are not very complicated and, therefore, the multipliers can be extracted relatively easy. Anyway, cost models are available in src/backend/optimizer/path/costsize.c, and you have to get inside it somehow to obtain extra information.

As for me, the paper (http://2014.eswc-conferences.org/sites/default/files/eswc2014pd_submission_30.pdf) is interesting mostly by their introduction of graph editing distance as a distance on the space of paths. It is interesting because some machine learning methods do not require feature representations of objects, but only a valid distance function on each pair of them.

The paper (http://www.vldb.org/pvldb/vol6/p925-wu.pdf) is about predicting query execution time for concurrent workloads and also contains machine learning.

I hope listed papers will be useful for your master's thesis.

The post related to (https://pgconf.ru/en/2016/89977) is available here (http://tigvarts.livejournal.com/691.html). Please note, that this post was published in February 2016, so the information in this post is partially outdated. Some main principles were changed during my work, some issues for further research are closed now, while some other issues appeared. I believe I will have a paper on my current results completed in the early October.

------
Oleg Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On 09/12/2016 05:03 PM, Vinicius Segalin wrote:
Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.

Thank you.

Best regards,

Vinicius Segalin

В списке pgsql-general по дате отправления:

Предыдущее
От: Alex Sviridov
Дата:
Сообщение: pgAdmin3 backup over ssh tunnel
Следующее
От: Durumdara
Дата:
Сообщение: Re: Restricted access on DataBases