Обсуждение: Actual Cost

Поиск
Список
Период
Сортировка

Actual Cost

От
Donald Dong
Дата:
Hi,

When explaining a query, I think knowing the actual rows and pages in addition
to the operation type  (e.g seqscan) would be enough to calculate the actual
cost. The actual cost metric could be useful when we want to look into how off
is the planner's estimation, and the correlation between time and cost. Would
it be a feature worth considering?

Thank you,
Donald Dong

Re: Actual Cost

От
David Fetter
Дата:
On Sat, Feb 16, 2019 at 03:10:44PM -0800, Donald Dong wrote:
> Hi,
> 
> When explaining a query, I think knowing the actual rows and pages
> in addition to the operation type  (e.g seqscan) would be enough to
> calculate the actual cost. The actual cost metric could be useful
> when we want to look into how off is the planner's estimation, and
> the correlation between time and cost. Would it be a feature worth
> considering?

As someone not volunteering to do any of the work, I think it'd be a
nice thing to have.  How large an effort would you guess it would be
to build a proof of concept?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Actual Cost

От
Tomas Vondra
Дата:
On 2/17/19 3:40 AM, David Fetter wrote:
> On Sat, Feb 16, 2019 at 03:10:44PM -0800, Donald Dong wrote:
>> Hi,
>>
>> When explaining a query, I think knowing the actual rows and pages
>> in addition to the operation type  (e.g seqscan) would be enough to
>> calculate the actual cost. The actual cost metric could be useful
>> when we want to look into how off is the planner's estimation, and
>> the correlation between time and cost. Would it be a feature worth
>> considering?
> 
> As someone not volunteering to do any of the work, I think it'd be a
> nice thing to have.  How large an effort would you guess it would be
> to build a proof of concept?
> 

I don't quite understand what is meant by "actual cost metric" and/or
how is that different from running EXPLAIN ANALYZE.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Actual Cost

От
Donald Dong
Дата:
On Feb 16, 2019, at 6:44 PM, Tomas Vondra wrote:
>
> On 2/17/19 3:40 AM, David Fetter wrote:
>>
>> As someone not volunteering to do any of the work, I think it'd be a
>> nice thing to have.  How large an effort would you guess it would be
>> to build a proof of concept?
>
> I don't quite understand what is meant by "actual cost metric" and/or
> how is that different from running EXPLAIN ANALYZE.

Here is an example:

Hash Join  (cost=3.92..18545.70 rows=34 width=32) (actual cost=3.92..18500 time=209.820..1168.831 rows=47 loops=3)

Now we have the actual time. Time can have a high variance (a change
in system load, or just noises), but I think the actual cost would be
less likely to change due to external factors.

On 2/17/19 3:40 AM, David Fetter wrote:
> On Sat, Feb 16, 2019 at 03:10:44PM -0800, Donald Dong wrote:
>> Hi,
>>
>> When explaining a query, I think knowing the actual rows and pages
>> in addition to the operation type  (e.g seqscan) would be enough to
>> calculate the actual cost. The actual cost metric could be useful
>> when we want to look into how off is the planner's estimation, and
>> the correlation between time and cost. Would it be a feature worth
>> considering?
>
> As someone not volunteering to do any of the work, I think it'd be a
> nice thing to have.  How large an effort would you guess it would be
> to build a proof of concept?

Intuitively it does not feel very complicated to me, but I think the
interface when we're planning (PlannerInfo, RelOptInfo) is different
from the interface when we're explaining a query (QueryDesc). Since
I'm very new, if I'm doing it myself, it would probably take many
iterations to get to the right point. But still, I'm happy to work on
a proof of concept if no one else wants to work on it.

regards,
Donald Dong

Re: Actual Cost

От
Tom Lane
Дата:
Donald Dong <xdong@csumb.edu> writes:
> On Feb 16, 2019, at 6:44 PM, Tomas Vondra wrote:
>> I don't quite understand what is meant by "actual cost metric" and/or
>> how is that different from running EXPLAIN ANALYZE.

> Here is an example:

> Hash Join  (cost=3.92..18545.70 rows=34 width=32) (actual cost=3.92..18500 time=209.820..1168.831 rows=47 loops=3)

> Now we have the actual time. Time can have a high variance (a change
> in system load, or just noises), but I think the actual cost would be
> less likely to change due to external factors.

I'm with Tomas: you have not explained what you think those
numbers mean.

            regards, tom lane


Re: Actual Cost

От
Donald Dong
Дата:
On Feb 16, 2019, at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Donald Dong <xdong@csumb.edu> writes:
>> On Feb 16, 2019, at 6:44 PM, Tomas Vondra wrote:
>>> I don't quite understand what is meant by "actual cost metric" and/or
>>> how is that different from running EXPLAIN ANALYZE.
>
>> Here is an example:
>
>> Hash Join  (cost=3.92..18545.70 rows=34 width=32) (actual cost=3.92..18500 time=209.820..1168.831 rows=47 loops=3)
>
>> Now we have the actual time. Time can have a high variance (a change
>> in system load, or just noises), but I think the actual cost would be
>> less likely to change due to external factors.
>
> I'm with Tomas: you have not explained what you think those
> numbers mean.

Yeah, I was considering the actual cost to be the output of the cost
model given the actual rows and pages after we instrument the
execution: plug in the values which are no longer estimations.

For a hash join, we could use the actual inner_rows_total to get the
actual cost. For a seqscan, we can use the actual rows to get the
actual CPU cost.

regards,
Donald Dong



Re: Actual Cost

От
Tomas Vondra
Дата:
On 2/17/19 7:45 AM, Donald Dong wrote:
> On Feb 16, 2019, at 9:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Donald Dong <xdong@csumb.edu> writes:
>>> On Feb 16, 2019, at 6:44 PM, Tomas Vondra wrote:
>>>> I don't quite understand what is meant by "actual cost metric" and/or
>>>> how is that different from running EXPLAIN ANALYZE.
>>
>>> Here is an example:
>>
>>> Hash Join  (cost=3.92..18545.70 rows=34 width=32) (actual cost=3.92..18500 time=209.820..1168.831 rows=47 loops=3)
>>
>>> Now we have the actual time. Time can have a high variance (a change
>>> in system load, or just noises), but I think the actual cost would be
>>> less likely to change due to external factors.
>>
>> I'm with Tomas: you have not explained what you think those
>> numbers mean.
> 
> Yeah, I was considering the actual cost to be the output of the cost
> model given the actual rows and pages after we instrument the
> execution: plug in the values which are no longer estimations.
> 
> For a hash join, we could use the actual inner_rows_total to get the
> actual cost. For a seqscan, we can use the actual rows to get the
> actual CPU cost.
> 

Perhaps I'm just too used to comparing the rows/pages directly, but I
don't quite see the benefit of having such "actual cost". Mostly because
the cost model is rather rough anyway.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Actual Cost

От
Jeff Janes
Дата:
On Sat, Feb 16, 2019 at 10:33 PM Donald Dong <xdong@csumb.edu> wrote:
On Feb 16, 2019, at 6:44 PM, Tomas Vondra wrote:
>
> On 2/17/19 3:40 AM, David Fetter wrote:
>>
>> As someone not volunteering to do any of the work, I think it'd be a
>> nice thing to have.  How large an effort would you guess it would be
>> to build a proof of concept?
>
> I don't quite understand what is meant by "actual cost metric" and/or
> how is that different from running EXPLAIN ANALYZE.

Here is an example:

Hash Join  (cost=3.92..18545.70 rows=34 width=32) (actual cost=3.92..18500 time=209.820..1168.831 rows=47 loops=3)

Now we have the actual time. Time can have a high variance (a change
in system load, or just noises), but I think the actual cost would be
less likely to change due to external factors.

I don't think there is any way to assign an actual cost.  For example how do you know if a buffer read was "actually" seq_page_cost or random_page_cost?  And if there were a way, it too would have a high variance.

What would I find very useful is a verbosity option to get the cost estimates expressed as a multiplier of each *_cost parameter, rather than just as a scalar.  And at the whole-query level, get an rusage report rather than just wall-clock duration.  And if the HashAggregate node under "explain analyze" would report memory and bucket stats; and if the Aggregate node would report...anything.

Cheers,

Jeff

Re: Actual Cost

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> What would I find very useful is a verbosity option to get the cost
> estimates expressed as a multiplier of each *_cost parameter, rather than
> just as a scalar.

Perhaps, but refactoring to get that seems impractically invasive &
expensive, since e.g. index AM cost estimate functions would have to
be redefined, plus we'd have to carry around some kind of cost vector
rather than single numbers for every Path ...

> And at the whole-query level, get an rusage report
> rather than just wall-clock duration.

I'm sure you're aware of log_statement_stats and friends already.
I agree though that that's not necessarily an optimal user interface.

            regards, tom lane


Re: Actual Cost

От
Donald Dong
Дата:
On Feb 17, 2019, at 10:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Perhaps, but refactoring to get that seems impractically invasive &
> expensive, since e.g. index AM cost estimate functions would have to
> be redefined, plus we'd have to carry around some kind of cost vector
> rather than single numbers for every Path ...

Maybe we could walk through the final plan tree and fill the expression? With another tree structure to hold the cost
vectors.

On Feb 17, 2019, at 8:29 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I don't think there is any way to assign an actual cost.  For example how do you know if a buffer read was "actually"
seq_page_costor random_page_cost?  And if there were a way, it too would have a high variance. 

Thanks for pointing that out! I think it's probably fine to use the same assumptions as the cost model? For example, we
assume3/4ths of accesses are sequential, 1/4th are not. 

> What would I find very useful is a verbosity option to get the cost estimates expressed as a multiplier of each
*_costparameter, rather than just as a scalar. 

Yeah, such expression would also help if we want to plug in the actual values.


On Feb 17, 2019, at 4:11 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Perhaps I'm just too used to comparing the rows/pages directly, but I
> don't quite see the benefit of having such "actual cost". Mostly because
> the cost model is rather rough anyway.

Yeah, I think the "actual cost" is only "actual" for the cost model - the cost it would output given the exact row/page
number.Some articles/papers have shown row estimation is the primary reason for planners to go off, so showing the
actual(or the updated assumption) might also be useful if people want to compare different plans and want to refer to a
moreaccurate quantitative measure. 

regards,
Donald Dong

Re: Actual Cost

От
Justin Pryzby
Дата:
On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote:
> What would I find very useful is [...] an rusage report rather than just
> wall-clock duration.

Most of that's available;

[pryzbyj@database ~]$ psql postgres -xtc "SET client_min_messages=log; SET log_statement_stats=on" -c 'SELECT max(i)
FROMgenerate_series(1,999999)i'
 
SET
LOG:  statement: SELECT max(i) FROM generate_series(1,999999)i
LOG:  QUERY STATISTICS
DETAIL:  ! system usage stats:
!       0.186643 s user, 0.033622 s system, 0.220780 s elapsed
!       [0.187823 s user, 0.037162 s system total]
!       61580 kB max resident size
!       0/0 [0/0] filesystem blocks in/out
!       0/7918 [0/9042] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/3 [2/3] voluntary/involuntary context switches
max | 999999

See also
https://commitfest.postgresql.org/20/1691/ => 88bdbd3f746049834ae3cc972e6e650586ec3c9d
https://www.postgresql.org/message-id/flat/7ffb9dbe-c76f-8ca3-12ee-7914ede872e6%40stormcloud9.net
https://www.postgresql.org/docs/current/runtime-config-statistics.html#RUNTIME-CONFIG-STATISTICS-MONITOR
max RSS added in commit c039ba0716383ccaf88c9be1a7f0803a77823de1

Justin


Re: Actual Cost

От
Donald Dong
Дата:
On Feb 17, 2019, at 11:05 AM, Donald Dong <xdong@csumb.edu> wrote:
>
> On Feb 17, 2019, at 10:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps, but refactoring to get that seems impractically invasive &
>> expensive, since e.g. index AM cost estimate functions would have to
>> be redefined, plus we'd have to carry around some kind of cost vector
>> rather than single numbers for every Path ...
>
> Maybe we could walk through the final plan tree and fill the expression? With another tree structure to hold the cost
vectors.

Here is a draft patch. I added a new structure called CostInfo to the Plan node. The CostInfo is be added in
create_plan,and the cost calculation is centered at CostInfo. Is this a reasonable approach? 

Thank you,
Donald Dong




Вложения