Обсуждение: About method of PostgreSQL's Optimizer

Поиск
Список
Период
Сортировка

About method of PostgreSQL's Optimizer

От
Pryscila B Guttoski
Дата:
Hello all!<br /><br /> On my master course, I'm studying the PostgreSQL's optimizer.<br /> I don't know if anyone in
thislist have been participated from the PostgreSQL's Optimizer development, but maybe someone can help me on this
question.<br/> PostgreSQL generates all possible plans of executing the query (using an almost exhaustive search), then
givesa cost to each plan and finally the cheapest one is selected for execution.<br /> There are other methods for
queryoptimization, one of them is based on plan transformations (for example, using A-Star algorithm) instead of plan
constructionsused by PostgreSQL. <br /> Does anyone know why this method was choosen? Are there any papers or
researchesabout it?<br /><br /> Thank's a lot,<br /><span class="sg"> Pryscila.<br /></span> 

Re: About method of PostgreSQL's Optimizer

От
"Jonah H. Harris"
Дата:
Pryscila,

While I haven't been too involved in the open source PostgreSQL optimizer, I have done some work on it and optimizers in other database systems.

Based on my work, it is my opinion that PostgreSQL, as-well-as other databases which use a cost-based optimizer, prefer a breadth-first algorithm because one cannot determine the "real" cost of each node at run-time without systematically examining all possibilities through calculation.  This is the opposite of a rule-based optimizer which defines heuristics which can be evaulated by a best-first algorithm such as A*.

In a cost-based optimizer, the system must calculate the "cost" of each path based on data that changes during run-time including indexing, cardinality, tuple size, available memory, CPU usage, disk access times, etc.  To a cost-based optimizer, every query is unique and therefore cannot follow a weighted path in the same fashion.  I can certainly see A* being used in a rule-based optimizer but not in a real-time cost-based optimizer.

Perhaps Tom, Bruce, et al have more specifics on PostgreSQL's implementation.

-Jonah



On 9/13/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:
Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the PostgreSQL's Optimizer development, but maybe someone can help me on this question.
PostgreSQL generates all possible plans of executing the query (using an almost exhaustive search), then gives a cost to each plan and finally the cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on plan transformations (for example, using A-Star algorithm) instead of plan constructions used by PostgreSQL.
Does anyone know why this method was choosen? Are there any papers or researches about it?

Thank's a lot,
Pryscila.



--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: About method of PostgreSQL's Optimizer

От
Josh Berkus
Дата:
Pryscila,

> > There are other methods for query optimization, one of them is based on
> > plan transformations (for example, using A-Star algorithm) instead of
> > plan constructions used by PostgreSQL.

We do certainly need a specific optimization for large star-schema joins.  I'm 
not certain that A* is suitable for our cost model, though; I think we might 
need to work up something more particular to Postgres.

> > Does anyone know why this method was choosen? Are there any papers or
> > researches about it?

There probably are on ACM but I've not read them.   Ours is a pretty 
straightforward implementation of a cost-based optimizer.   You can always 
read the code ;-)

Mark Kirkwood put together this nice paper on planner statistics:
http://www.powerpostgresql.com/PlanStats

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: About method of PostgreSQL's Optimizer

От
Pryscila B Guttoski
Дата:
Hi Jonah,

Thank's for your email, I really appreciate your opinions.

Is it interesting to use both techniques? For example:
Given a query, an optimizer:
1. Generates one of the possible execution plans.
2. Does transformations on the original plan, based on rules and
heuristics, resulting in new alternative plans.
3. Evaluates the cost of generated plans by using statistics.
4. Keeps plans that have lower cost than the original plan
5. Repeat 2-4 over the new alternative plans.
What do you think about it? Are there any restrictions that I haven't seen?

About other method...
Have you heard about using PDDL ("Planning Domain Definition
Language") for query optimization?

[]'s
Pryscila


On 9/14/05, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Pryscila,
>
>  While I haven't been too involved in the open source PostgreSQL optimizer,
> I have done some work on it and optimizers in other database systems.
>
>  Based on my work, it is my opinion that PostgreSQL, as-well-as other
> databases which use a cost-based optimizer, prefer a breadth-first algorithm
> because one cannot determine the "real" cost of each node at run-time
> without systematically examining all possibilities through calculation.
> This is the opposite of a rule-based optimizer which defines heuristics
> which can be evaulated by a best-first algorithm such as A*.
>
>  In a cost-based optimizer, the system must calculate the "cost" of each
> path based on data that changes during run-time including indexing,
> cardinality, tuple size, available memory, CPU usage, disk access times,
> etc.  To a cost-based optimizer, every query is unique and therefore cannot
> follow a weighted path in the same fashion.  I can certainly see A* being
> used in a rule-based optimizer but not in a real-time cost-based optimizer.
>
>  Perhaps Tom, Bruce, et al have more specifics on PostgreSQL's
> implementation.
>
>  -Jonah
>
>
>
>
> On 9/13/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:
> > Hello all!
> >
> > On my master course, I'm studying the PostgreSQL's optimizer.
> > I don't know if anyone in this list have been participated from the
> PostgreSQL's Optimizer development, but maybe someone can help me on this
> question.
> > PostgreSQL generates all possible plans of executing the query (using an
> almost exhaustive search), then gives a cost to each plan and finally the
> cheapest one is selected for execution.
> > There are other methods for query optimization, one of them is based on
> plan transformations (for example, using A-Star algorithm) instead of plan
> constructions used by PostgreSQL.
> > Does anyone know why this method was choosen? Are there any papers or
> researches about it?
> >
> > Thank's a lot,
> > Pryscila.
> >
>
>
>
> --
> Respectfully,
>
> Jonah H. Harris, Database Internals Architect
> EnterpriseDB Corporation
> http://www.enterprisedb.com/
>


Re: About method of PostgreSQL's Optimizer

От
"Jonah H. Harris"
Дата:
Pryscila,

Step 2 is basically where you find the difference between a cost-based optimizer (CBO) and a rule-based optimizer (RBO).  A CBO is based on the computed execution cost of the query whereas an RBO uses more generalized heuristics.

Let's get an example of what you're proposing and see if we can work it out from there.

Say we have the following (this is a generalized CBO approach, not PostgreSQL specific):

Oracle's SCOTT.EMP table with cardinality of 1 million and an index on empno and ename.  For storage purposes say that the empno index takes up 3600 blocks, the ename index takes up 7800 blocks, and the table itself takes up 17000 blocks.  We'll also say that we have a 256 megabyte buffer cache of which we have cached 50% of the empno index, 10% of the ename index, and 5% of the emp table data.

A user then issues the following query:

SELECT empno, ename FROM emp;

A cost-based optimizer will see the following:
1. See that the query is a full table scan (FTS) and calculate the cost of retrieving all 17000 blocks from disk.
2. See that the query is a FTS and that it can retrieve all data from the indexes (11400 blocks) and join the data (which join algorithm?)

Without performing a breadth-first algorithm, how can one evaluate both options in a way that would allow you to perform heuristic transformations dynamically?  What transformation/heuristic/rule can you use?  A CBO implementation has to calculate the amount of I/O needed on each plan based on several statistics such as what's *potentially* in the cache, what's the access time for block I/O (including prefetching if the storage manager has it), and other factors.  If you could name a database that uses a best-first algorithm, such as A*, please send me the link to their docs; I'd be interested in reading the implementation.

As for using both in the same optimizer, I could only see an algorithm such as a customized-A* being used to planning *some* large queries.  The reason I say this is because the cost calculation, which would still need to be breadth-first, could calculate and cache the cost of most nodes thereby allowing you to possibly perform transformations at the tail of calculation.

As for references to query optimization possibly using best-first algorithms, I think I saw several types of algorithms used in work from a university query optimization engine.  I can't remember if it was Cornell, Stanford, or Wisconsin... I'll try and get you a link to their info.

-Jonah

On 9/14/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:
Hi Jonah,

Thank's for your email, I really appreciate your opinions.

Is it interesting to use both techniques? For example:
Given a query, an optimizer:
1. Generates one of the possible execution plans.
2. Does transformations on the original plan, based on rules and
heuristics, resulting in new alternative plans.
3. Evaluates the cost of generated plans by using statistics.
4. Keeps plans that have lower cost than the original plan
5. Repeat 2-4 over the new alternative plans.
What do you think about it? Are there any restrictions that I haven't seen?

About other method...
Have you heard about using PDDL ("Planning Domain Definition
Language") for query optimization?

[]'s
Pryscila


On 9/14/05, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Pryscila,
>
>  While I haven't been too involved in the open source PostgreSQL optimizer,
> I have done some work on it and optimizers in other database systems.
>
>  Based on my work, it is my opinion that PostgreSQL, as-well-as other
> databases which use a cost-based optimizer, prefer a breadth-first algorithm
> because one cannot determine the "real" cost of each node at run-time
> without systematically examining all possibilities through calculation.
> This is the opposite of a rule-based optimizer which defines heuristics
> which can be evaulated by a best-first algorithm such as A*.
>
>  In a cost-based optimizer, the system must calculate the "cost" of each
> path based on data that changes during run-time including indexing,
> cardinality, tuple size, available memory, CPU usage, disk access times,
> etc.  To a cost-based optimizer, every query is unique and therefore cannot
> follow a weighted path in the same fashion.  I can certainly see A* being
> used in a rule-based optimizer but not in a real-time cost-based optimizer.
>
>  Perhaps Tom, Bruce, et al have more specifics on PostgreSQL's
> implementation.
>
>  -Jonah
>
>
>
>
> On 9/13/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:
> > Hello all!
> >
> > On my master course, I'm studying the PostgreSQL's optimizer.
> > I don't know if anyone in this list have been participated from the
> PostgreSQL's Optimizer development, but maybe someone can help me on this
> question.
> > PostgreSQL generates all possible plans of executing the query (using an
> almost exhaustive search), then gives a cost to each plan and finally the
> cheapest one is selected for execution.
> > There are other methods for query optimization, one of them is based on
> plan transformations (for example, using A-Star algorithm) instead of plan
> constructions used by PostgreSQL.
> > Does anyone know why this method was choosen? Are there any papers or
> researches about it?
> >
> > Thank's a lot,
> > Pryscila.
> >
>
>
>
> --
> Respectfully,
>
> Jonah H. Harris, Database Internals Architect
> EnterpriseDB Corporation
> http://www.enterprisedb.com/
>



--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: About method of PostgreSQL's Optimizer

От
Tom Lane
Дата:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> As for using both in the same optimizer, I could only see an algorithm such 
> as a customized-A* being used to planning *some* large queries. The reason I 
> say this is because the cost calculation, which would still need to be 
> breadth-first, could calculate and cache the cost of most nodes thereby 
> allowing you to possibly perform transformations at the tail of calculation.

We do already have two different plan search algorithms: the strict
bottom-up dynamic programming approach (System R style) and the GEQO
optimizer, which we switch to when there are too many joins needed to
allow exhaustive search.  The GEQO code still depends on the normal
plan cost estimation code, but it doesn't consider every possible plan.

I've never been very happy with the GEQO code: the random component of
the algorithm means you get unpredictable (and sometimes awful) plans,
and the particular variant that we are using is really designed to solve
traveling-salesman problems.  It's at best a poor fit to the join
planning problem.

So it seems interesting to me to think about replacing GEQO with a
rule-based optimizer for large join search spaces.

There are previous discussions about this in the archives, I believe.
        regards, tom lane


Re: About method of PostgreSQL's Optimizer

От
"Jonah H. Harris"
Дата:
Tom,

I agree.  There have been several occasions where GEQO has performed poorly for me.  I'll search the archives for the past discussions.

sorry for sending this to you twice Tom... forgot to hit reply all :(

On 9/14/05, Tom Lane <tgl@sss.pgh.pa.us > wrote:
"Jonah H. Harris" <jonah.harris@gmail.com > writes:
> As for using both in the same optimizer, I could only see an algorithm such
> as a customized-A* being used to planning *some* large queries. The reason I
> say this is because the cost calculation, which would still need to be
> breadth-first, could calculate and cache the cost of most nodes thereby
> allowing you to possibly perform transformations at the tail of calculation.

We do already have two different plan search algorithms: the strict
bottom-up dynamic programming approach (System R style) and the GEQO
optimizer, which we switch to when there are too many joins needed to
allow exhaustive search.  The GEQO code still depends on the normal
plan cost estimation code, but it doesn't consider every possible plan.

I've never been very happy with the GEQO code: the random component of
the algorithm means you get unpredictable (and sometimes awful) plans,
and the particular variant that we are using is really designed to solve
traveling-salesman problems.  It's at best a poor fit to the join
planning problem.

So it seems interesting to me to think about replacing GEQO with a
rule-based optimizer for large join search spaces.

There are previous discussions about this in the archives, I believe.

                        regards, tom lane



--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: About method of PostgreSQL's Optimizer

От
"Jonah H. Harris"
Дата:
Pryscila,

For research reference, you may want to look at the work done on the Columbia Query Optimization Framework.  As I recall, I think it (or its predecessors) had both cost and rule-based optimization.  If you need the code to it, I can dig it up on one of my old systems.

Albeit dated, another good reference for optimizer implementation is the cascades query optimization framework.


On 9/15/05, Jonah H. Harris <jonah.harris@gmail.com> wrote:
Tom,

I agree.  There have been several occasions where GEQO has performed poorly for me.  I'll search the archives for the past discussions.

sorry for sending this to you twice Tom... forgot to hit reply all :(

On 9/14/05, Tom Lane < tgl@sss.pgh.pa.us > wrote:
"Jonah H. Harris" < jonah.harris@gmail.com > writes:
> As for using both in the same optimizer, I could only see an algorithm such
> as a customized-A* being used to planning *some* large queries. The reason I
> say this is because the cost calculation, which would still need to be
> breadth-first, could calculate and cache the cost of most nodes thereby
> allowing you to possibly perform transformations at the tail of calculation.

We do already have two different plan search algorithms: the strict
bottom-up dynamic programming approach (System R style) and the GEQO
optimizer, which we switch to when there are too many joins needed to
allow exhaustive search.  The GEQO code still depends on the normal
plan cost estimation code, but it doesn't consider every possible plan.

I've never been very happy with the GEQO code: the random component of
the algorithm means you get unpredictable (and sometimes awful) plans,
and the particular variant that we are using is really designed to solve
traveling-salesman problems.  It's at best a poor fit to the join
planning problem.

So it seems interesting to me to think about replacing GEQO with a
rule-based optimizer for large join search spaces.

There are previous discussions about this in the archives, I believe.

                        regards, tom lane



--

Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/



--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/