Re: [HACKERS] CLUSTER command progress monitor

Поиск
Список
Период
Сортировка
От Tatsuro Yamada
Тема Re: [HACKERS] CLUSTER command progress monitor
Дата
Msg-id 59B5F6C0.1090008@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] CLUSTER command progress monitor  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [HACKERS] CLUSTER command progress monitor  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 2017/09/08 18:55, Robert Haas wrote:
> On Wed, Aug 30, 2017 at 10:12 PM, Tatsuro Yamada
> <yamada.tatsuro@lab.ntt.co.jp> wrote:
>>    1. scanning heap
>>    2. sort tuples
>
> These two phases overlap, though. I believe progress reporting for
> sorts is really hard.  In the simple case where the data fits in
> work_mem, none of the work of the sort gets done until all the data is
> read.  Once you switch to an external sort, you're writing batch
> files, so a lot of the work is now being done during data loading.
> But as the number of batch files grows, the final merge at the end
> becomes an increasingly noticeable part of the cost, and eventually
> you end up needing multiple merge passes.  I think we need some smart
> way to report on sorts so that we can tell how much of the work has
> really been done, but I don't know how to do it.

Thanks for the comment.

As you know, CLUSTER command uses SEQ SCAN or INDEX SCAN as a scan method by
cost estimation. In the case of SEQ SCAN, these two phases not overlap.
However, in INDEX SCAN, it overlaps. Therefore I created the phase of "scan
heap and write new heap" when INDEX SCAN was selected.

I agree that progress reporting for sort is difficult. So it only reports
the phase ("sorting tuples") in the current design of progress monitor of cluster.
It doesn't report counter of sort.


>>   heap_tuples_total   | bigint  |           |          |
>
> The patch is getting the value reported as heap_tuples_total from
> OldHeap->rd_rel->reltuples.  I think this is pointless: the user can
> see that value anyway if they wish.  The point of the progress
> counters is to expose things the user couldn't otherwise see.  It's
> also not necessarily accurate: it's only an estimate in the best case,
> and may be way off if the relation has recently be extended by a large
> amount.  I think it's pretty important that we try hard to only report
> values that are known to be accurate, because users hate (and mock)
> inaccurate progress reports.

Do you mean to use the number of rows by using below calculation instead
OldHeap->rd_rel->reltuples?
 estimate rows = physical table size / average row length

I understand that OldHeap->rd_rel->reltuples is sometimes useless because
it is correct by auto analyze and it can't perform when under a threshold.

I'll add it in next patch and also share more detailed the current design of
progress monitor for cluster.

Regards,
Tatsuro Yamada




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: [HACKERS] Setting pd_lower in GIN metapage
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256