Re: Adaptive query optimization

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: Adaptive query optimization
Дата
Msg-id b7644989-9c35-d0f2-2775-28f9c886c2f5@postgrespro.ru
обсуждение исходный текст
Ответ на Re: Adaptive query optimization  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Adaptive query optimization  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Список pgsql-hackers


On 12.06.2019 0:43, Tomas Vondra wrote:


I don't think "learning phase" is an issue, in fact I think that's
something we need to do - it ensures we have enough data to make good
decisions.

What is wrong with learning phase is that it requires some DBA assistance: somebody should determine when to start learning,
provide relevant workload and determine when learning is finished.
One of the most recent trends in DBMSes is autonomous databases with zero administration effort.
It is especially important for clouds. And of one the main advantages of AQO is that it allows to optimize queries without human interaction.

But unfortunately I really do not know how to avoid learning phase, especially if we what to run queries at replica.


2. It saves collected data in Postgres tables, which makes read-only transaction executing only queries to become read-write transaction, obtaining XID...

Yeah, that's an issue because it makes it useless on standbys etc. I
think it'd be enough to do something similar to pg_stat_statements, i.e.
store it in memory and flush it to disk once in a while.

Thus is why my AQO implementation is storing data in file.

3. It doesn't take in account concrete values of literals used in clauses, so it is not able to address data skews.

Yep. I don't think it's necessarily an issue with all approaches to
adaptive optimization, though. But I agree we should detect both
systemic estimation issues, and misestimates for particular parameter
values. I think that's doable.

4. Machine learning  can be quite  expensive and seems to be overkill if we want just to adjust selectivities according to actual number of affected rows.


I think that depends - some machine learning approaches are not that
bad. But I think there's a more serious issue - explainability. We need
a solution where we can explain/justify why it makes some decisions. I
really don't want a black box that produces numbers that you just need
to take at face value.

The good thing is that the simpler the method, the less expensive and
more explainable it is.

I tried to create much simpler version of AQO based on auto_explain extension.
This extension provide all necessary infrastructure to analyze statements with long execution time.
I have added two new modes to auto_explain:
1. Auto generation of multicolumn statistics for variables using in clauses with large estimation error.

Interesting! I probably wouldn't consider this part of adaptive query
optimization, but it probably makes sense to make it part of this. I
wonder if we might improve this to also suggest "missing" indexes?

I think that it should be nest step of adaptive query optimization:
- autogeneration of indexes
- auto adjustment of optimizer cost parameters (cpu cost, random/sequential page access cost,...)

There is already extension hypothetical index https://github.com/HypoPG/hypopg
which can be used to estimate effect of introducing new indexes.


Right. But I think I might have an idea how to address (some of) this.

As I already mentioned, I was experimenting with something similar,
maybe two or three years ago (I remember chatting about it with Teodor
at pgcon last week). I was facing the same issues, and my approach was
based on hooks too.

But my idea was to not to track stats for a plan as a whole, but instead
decompose it into individual nodes, categoried into three basic groups -
scans, joins and aggregations. And then use this extracted information
to other plans, with "matching" nodes.

For example, let's consider a simple "single-scan" query

   SELECT * FROM t1 WHERE a = ? AND b = ? AND c < ?;

Now, if you execute this enought times (say, 100x or 1000x), tracking
the estimates and actual row counts, you may then compute the average
misestimate (maybe a geometric mean would be more appropriate here?):

   AVG(actual/estimate)

Certainly stats should be collected for each plan node, not for the whole plan.
And it is done now in Oleg's and my implementation.
Oleg is using gradient descent method. I first tried to calculate average, but then find out that building something like "histogram",
where bin is determined as log10 of estimated number of rows
.


and if this is significantly different from 1.0, then we can say there's
a systemic misestimate, and we can use this as a correction coefficient
when computing the scan estimate. (And we need to be careful about
collection new data, because the estimates will include this correction.
But that can be done by tracking "epoch" of the plan.)

Now, if someone uses this same scan in a join, like for example

   SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
    WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
      AND (t2.x = ? AND t2.y = ?)

then we can still apply the same correction to the t1 scan (I think).
But then we can also collect data for the t1-t2 join, and compute a
correction coefficient in a similar way. It requires a bit of care
because we need to compensate for misestimates of inputs, but I think
that's doable.

Of course, all this is rather speculative, and I never got to anything
beyond a very simple PoC. So I hope it makes at least some sense.


As far as I know Oleg's AQO is now used by Amason.
So it is something more than just PoC. But certainly there are still many problems
and my experiments with JOB benchmark shown that there are still a lot of things to improve.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: check_recovery_target_lsn() does a PG_CATCH without a throw
Следующее
От: Peter Eisentraut
Дата:
Сообщение: catalog files simplification